We discuss two problems related to the general challenge of making accurate inferences about a complex distribution, in the regime in which the amount of data (i.e the sample size) is too small for the empirical distribution of the samples to be an accurate representation of the underlying distribution. The first problem is the task of accurately estimating the eigenvalues of the covariance matrix of a distribution--the "population spectrum". (These eigenvalues contain basic information about the distribution, including the presence/lack of low-dimensional structure in the distribution and the applicability of many higher-level machine learning and multivariate statistical tools.) As we show, even in the regime where the sample size is linear or sublinear in the dimensionality of the distribution, and hence the eigenvalues and eigenvectors of the empirical covariance matrix are misleading, accurate approximations to the true population spectrum are possible. In the second portion of the talk, we discuss the problem of recovering a low-rank approximation to a matrix of probabilities P, given access to an observed matrix of "counts" obtained via independent samples from the distribution defined by P. This problem can be viewed as a generalization of "community detection", and is relevant to several recent machine learning efforts, including the work on constructing "word embeddings".
This talk is based on two joint works, the first with Weihao Kong, the second with Qingqing Huang, Sham Kakade, and Weihao Kong.