Recent years have witnessed significant progress in entropy and mutual information estimation, in particular in the large alphabet regime. Concretely, there exist efficiently computable estimators whose performance with n samples is essentially that of the maximum likelihood estimator with n log(n) samples, a phenomenon termed "effective sample size enlargement". Generalizations to processes with memory (estimation of the entropy rate) and continuous distributions (estimation of the differential entropy) have remained largely open. This talk is about the challenges behind those generalizations and recent progress in this direction. For estimating the entropy rate of a Markov chain, we show that when the mixing time is not too slow, at least S^2/log(S) samples are required to consistently estimate the entropy rate, where S is the size of the state space. In contrast, the empirical entropy rate requires S^2 samples to achieve consistency even if the Markov chain is i.i.d. We propose a general approach to achieve the S^2/log(S) sample complexity, and illustrate our results through estimating the entropy rate of the English language from the Penn Treebank (PTB) and the Google 1 Billion Word Dataset. For differential entropy estimation, we characterize the minimax behavior over Besov balls, and show that a fixed-k nearest neighbor estimator adaptively achieves the minimax rates up to logarithmic factors without knowing the smoothness of the density. The "effective sample size enlargement" phenomenon holds in both the Markov chain case and the case of continuous distributions.
Joint work with Weihao Gao, Yanjun Han, Chuan-Zheng Lee, Pramod Viswanath, Tsachy Weissman, Yihong Wu, and Tiancheng Yu.