Large-population human datasets are being generated that can transform science and medicine. New machine learning techniques are necessary to unlock this data resource and enable discoveries. I will first survey recent advances in human population genomics, and describe new computational techniques we have developed to connect genetic and epigenetic variations to human diseases. These methods required significant innovations in latent-variable models, non-convex optimization and histogram estimation. I collaborated closely with the largest genomics consortia to apply these approaches—which are scalable and have strong mathematical guarantees—to systematically estimate the effects of mutations and to identify disease biomarkers. In the second part of the talk, I will discuss the omnipresent challenge of biases arising from data exploration, whereby many of the apparent patterns that we see in data are false discoveries. We developed a general approach based on information usage to bound the biases due to exploratory data analysis. We also present rigorous techniques to reduce bias. I will conclude with general lessons for machine learning and discuss new research directions.
James Zou is a postdoc at Microsoft Research New England and MIT. He works on machine learning methodology and applications to human genomics. He received his Ph.D. from Harvard University in 2014, supported by a NSF Graduate Fellowship. In Spring 2014, he was a Simons research fellow at U.C. Berkeley. He has multiple first-author papers in the top scientific journals (PNAS, Nature Methods) as well as the top machine learning conferences (NIPS, ICML, AISTATS), and has won several paper awards.