Deep learning algorithms that achieve state-of-the-art results on image and text recognition tasks tend to fit the entire training dataset (nearly) perfectly including mislabeled examples and outliers. This propensity to memorize seemingly useless data and the resulting large generalization gap have puzzled many practitioners and is not explained by existing theories of machine learning. We provide a simple conceptual explanation and a theoretical model demonstrating that memorization of outliers and mislabeled examples is necessary for achieving close-to-optimal generalization error when learning from long-tailed data distributions. Image and text data are known to follow such distributions and therefore our results establish a formal link between these empirical phenomena. We then demonstrate the utility of memorization and support our explanation empirically. These results rely on a new technique for efficiently estimating memorization and influence of training data points.
Based on a joint work with Chiyuan Zhang.
Bio: Vitaly Feldman is a research scientist at Apple AI Research working on foundations of machine learning and privacy-preserving data analysis. His recent research interests include tools for analysis of generalization, distributed privacy-preserving learning, privacy-preserving optimization, and adaptive data analysis.
Vitaly holds a Ph.D. from Harvard (2006, advised by Leslie Valiant) and was previously a research scientist at Google Research (Brain team) and IBM Research - Almaden. His work was recognized by the COLT Best Student Paper Award in 2005 and 2013 (student co-authored) and by the IBM Research Best Paper Award in 2014, 2015 and 2016. His recent research on foundations of adaptive data analysis has been featured in CACM Research Highlights, Science, and the research blogs of IBM, Google, and Microsoft. He served as a program co-chair for COLT 2016 and ALT 2021 conferences and as a co-organizer of the Simons Institute Program on Data Privacy in 2019.