This talk will discuss the flow of information and the evolution of internal representations during deep neural network (DNN) training, aiming to demystify the compression aspect of the information bottleneck theory. The theory suggests that DNN training comprises a rapid fitting phase followed by a slower compression phase, in which the mutual information I(X;T) between the input X and internal representations T decreases. Several papers observe compression of estimated mutual information on different DNN models, but the true I(X;T) over these networks is provably either constant (discrete X) or infinite (continuous X). We will explain this discrepancy between theory and experiments, and explain what was actually measured by these past works.
To this end, an auxiliary (noisy) DNN framework will be introduced, in which I(X;T) is a meaningful quantity that depends on the network's parameters. We will show that this noisy framework is a good proxy for the original (deterministic) system both in terms of performance and the learned representations. To accurately track I(X;T) over noisy DNNs, a differential entropy estimator tailor to exploit the DNN's layered structure will be developed and theoretical guarantees on the associated minimax risk will be provided. Using this estimator along with a certain analogy to an information-theoretic communication problem, we will elucidate the geometric mechanism that drives compression of I(X;T) in noisy DNNs. Based on these findings, we will circle back to deterministic networks and explain what the past observations of compression were in fact showing. Future research directions inspired by this study aiming to facilitate a comprehensive information-theoretic understanding of deep learning will also be discussed.