A mystery of modern neural networks is their surprising generalization power in overparametrized regime: they comprise so many parameters that they can interpolate the training set, even if actual labels are replaced by purely random ones; despite this, they achieve good prediction error on unseen data.
To demystify the above phenomena, we focus on two-layer neural networks in the neural tangent (NT) regime. Under a simple data model where n inputs are d-dimensional isotropic vectors and there are N hidden neurons, we show that as soon as Nd >> n, the minimum eigenvalue of the empirical NT kernel is bounded away from zero, and therefore the network can exactly interpolate arbitrary labels.
Next, we study the generalization error of NT ridge regression (including min-$ell_2$ norm interpolation). We show that in the same overparametrization regime Nd >> n, in terms of generalization errors, NT ridge regression is well approximated by kernel ridge regression (infinite-width kernel), which is in further we approximated by polynomial ridge regression. A surprising phenomenon is a "self-induced" regularization due to the high-degree components of the activation function.
Link to the ArXiv paper: https://arxiv.org/abs/2007.12826
This talk is hosted by the ISL Colloquium. To receive talk announcements, subscribe to the mailing list email@example.com.
Bio: Joe Zhong is currently a postdoc at Stanford University, advised by Prof. Andrea Montanari and Prof. David Donoho. His research interest includes statistics, optimization, and deep learning. Prior to this, Joe Zhong obtained his PhD in 2019 from Princeton University, where he was advised by Prof. Jianqing Fan, and a B.S. in mathematics from Peking University in 2014.