ISL Colloquium presents "Stochastic Descent Algorithms: Minimax Optimality, Implicit Regularization, and Deep Networks"
Stochastic descent methods have had a long history in optimization, adaptive filtering, and online learning and have recently gained tremendous popularity as the workhorse for deep learning. So much so that, it is now widely recognized that the success of deep networks is not only due to their special deep architecture, but also due to the behavior of the stochastic descent methods used, which plays a key role in reaching "good" solutions that generalize well to unseen data. In an attempt to shed some light on why this is the case, we revisit some minimax properties of stochastic gradient descent (SGD)---originally developed for quadratic loss and linear models in the context of H-infinity control in the 1990's---and extend them to general stochastic mirror descent (SMD) algorithms for general loss functions and nonlinear models. These minimax properties can be used to explain the convergence and implicit-regularization of the algorithms when the linear regression problem is over-parametrized (in what is now being called the "interpolating regime"). In the nonlinear setting, exemplified by training a deep neural network, we show that when the setup is "highly over-parametrized", stochastic descent methods enjoy similar convergence and implicit-regularization properties. This observation gives some insight into why deep networks exhibit such powerful generalization abilities. It is also a further example of what is increasingly referred to as the "blessing of dimensionality".