James Zou and Amirata Ghorbani (PhD candidate) extend and adapt the Shapley approach to the study of data
They propose a fair way to quantify how much individual datasets contribute to AI model performance and companies’ bottom lines.
Each of us continuously generates a stream of data. When we buy a coffee, watch a romcom or action movie, or visit the gym or the doctor's office (tracked by our phones), we hand over our data to companies that hope to make money from that information – either by using it to train an AI system to predict our future behavior or by selling it to others.
But what is that data worth?
"There's a lot of interest in thinking about the value of data," says Professor James Zou, member of the Stanford Institute for Human-Centered Artificial Intelligence (HAI), and faculty lead of a new HAI executive education program on the subject. How should companies set prices for data they buy and sell? How much does any given dataset contribute to a company's bottom line? Should each of us receive a data dividend when companies use our data?
Motivated by these questions, James and graduate student Amirata Ghorbani have developed a new and principled approach to calculating the value of data that is used to train AI models. Their approach, detailed in a paper presented at the International Conference on Machine Learning and summarized for a slightly less technical audience in arXiv, is based on a Nobel Prize-winning economics method and improves upon existing methods for determining the worth of individual datapoints or datasets. In addition, it can help AI systems designers identify low value data that should be excluded from AI training sets as well as high value data worth acquiring. It can even be used to reduce bias in AI systems.
The data Shapley value can even be used to reduce the existing biases in datasets. For example, many facial recognition systems are trained on datasets that have more images of white males than minorities or women. When these systems are deployed in the real world, their performance suffers because they see more diverse populations. To address this problem, James and Amirata ran an experiment: After a facial recognition system had been deployed in a real setting, they calculated how much each image in the training set contributed to the model's performance in the wild. They found that the images of minorities and women had the highest Shapley values and the images of white males had the lowest Shapley values. They then used this information to fix the problem – weighting the training process in favor of the more valuable images. "By giving those images higher value and giving them more weight in the training process, the data Shapley value will actually make the algorithm work better in deployment – especially for minority populations," James says.
Excerpted from: HAI "Quantifying the Value of Data"