2021 Mathematical and Scientific Foundations of Deep Learning Annual Meeting
Organizers:
Peter Bartlett, University of California, Berkeley
Rene Vidal, Johns Hopkins University
Meeting Goals:
This meeting will bring together members of the NSFSimons Research Collaborations on the Mathematical and Scientific Foundations of Deep Learning (MoDL), which are aimed at developing mathematical and statistical tools to understand the success and limitations of deep learning, to guide the design of more effective methods, and to initiate the study of the mathematical problems that emerge. The meeting aims to report on progress in these directions in the first year and to stimulate discussions of future directions in the collaborations.
Speakers:
Andrea Montanari, Stanford University
Nati Srebro, Toyota Technological Institute at Chicago
Peter Bartlett, University of California, Berkeley
Gil Kur, Massachusetts Institute of Technology
Emmanuel Candès, Stanford University
René Vidal and Soledad Villar, Johns Hopkins University
Alejandro Ribeiro, University of Pennsylvania
Yi Ma, University of California, Berkeley

Meeting Report
The meeting brought together members of the two NSFSimons Research Collaborations on the Mathematical and Scientific Foundations of Deep Learning with other participants in the NSF SCALE MoDL program. Talks at the meeting reported on the progress of the collaborations in their first year. These collaborations aim to understand the success and limitations of deep learning.
Andrea Montanari (Stanford University) opened the workshop with a talk on highdimensional phenomena in interpolating models. This set the stage for several subsequent talks by focusing on some of the surprising properties exhibited by deeplearning methods: They employ simple gradient methods to solve nonconvex optimization problems involving rich, overparameterized models. And without any explicit regularization, they can find a perfect fit to noisy training data (an interpolating solution) and yet give accurate outofsample predictions. Andrea presented results describing the exact asymptotics of these phenomena in families of canonical highdimensional models inspired by deep networks. Rene´ Vidal (Johns Hopkins University) explored a related question: he described recent results on the dynamics of gradient flow in overparameterized linear networks, showing how overparameterization and initialization determine the rate at which an interpolating solution is approached.
The talk by Soledad Villar (Johns Hopkins University) was also related. She presented work investigating minimumnorm interpolating linear regression methods, and how approaches based on PCA can achieve improved performance in some regimes.
The talk by Gil Kur (Massachusetts Institute of Technology) explored necessary conditions for the phenomenon of benign overfitting, that is, good predictive accuracy of interpolating regression functions. His results show that whenever this phenomenon occurs, a small neighborhood of the true regression function must be almost as complex as the full model.
Nati Srebro (Toyota Technological Institute at Chicago) presented collaborative work exploring the power and limitations of gradient methods. He described recent results relating problems that can be learned by various kinds of ‘differentiable approaches’ (such as gradient descent and stochastic gradient descent) to those that can be learned with more classical models, including linear methods and statistical query approaches.
Emmanuel Candés (Stanford University) considered how machinelearning methods such as deep networks can be used to provide predictions with wellcalibrated error bars. He described recently developed methods that combine quantile regression with conformal prediction to give guaranteed confidence sets and showed how ideas from online learning can be incorporated to accommodate problems that vary over time.
Peter Bartlett (University of California, Berkeley) described an exploration of a particularly undesirable property of deep networks: those small perturbations of an input can lead to a large change of the network’s prediction. He presented recent work showing that this phenomenon of adversarial examples arises in networks with random parameters because they compute approximately linear functions.
Alejandro Ribeiro (University of Pennsylvania) described work on learning with constraints. Applications such as fair classification and robust image recognition lead to constrained stochastic optimization problems. Ribeiro’s talk described results on the difficulty of accurately solving such problems using data and how it depends on sensitivity properties as well as the notions of complexity that arise in unconstrained learning problems.
Yi Ma (University of California, Berkeley) closed the workshop with a presentation on his work on developing deeplayered learning methods based on the principle of maximizing rate reduction. He showed how this principled approach leads to components similar to those that are used in modern deep layered architectures.

Agenda
Thursday, September 30
9:30 AM Andrea Montanari  Highdimensional phenomena in interpolating models 11:00 AM Emmanuel Candès  Recent progress in predictive inference 1:00 PM Nati Srebro  The Power of Differentiable Learning 2:30 PM René Vidal and Soledad Villar  Learning Dynamics and Implicit Bias of Gradient Flow in Overparametrized Linear Models 4:00 PM Peter Bartlett  Adversarial examples in deep networks Friday, October 1
9:30 AM Alejandro Ribeiro  Learning under requirements 11:00 AM Gil Kur  On the Minimal Error of Empirical Risk Minimization 1:00 PM Yi Ma  ReduNet: Whitebox Deep (Convolution) Networks from the Principle of Rate Reduction 
Abstracts & Slides
Andrea Montanari, Stanford University
Highdimensional phenomena in interpolating models
Overparameterization poses a number of fascinating questions: (1) When is a network sufficiently large as to interpolate a sample of n random data points? (2) Why and how does overparameterization simplify the computational task of finding a global empirical minimizer? (3) How is it possible that interpolating models have good generalization properties? I will address these questions in the context of twolayer neural networks, neural tangent models and simple linear models. Among other results, I will present a toy example of a nonconvex learning problem which becomes simpler because of overparameterization.
Nati Srebro, Toyota Technological Institute at Chicago
The Power of Differentiable Learning
Training deep networks typically involves minimizing the training error over a large, overparameterized, model class. That is, it involves solving an underdetermined optimization problem with many global minima. It has become evidently clear and widely accepted, that a central element in the success of deep learning is the implicit (algorithmic) bias afforded by the optimization algorithm, i.e., a bias toward particular global minima not explicitly specified by the optimization problem. This has launched research into understanding and analyzing the implicit bias of optimization algorithms. In this talk, I will survey what we know and don’t know, about this implicit bias. In particular, I will discuss how much of our current understanding can be viewed as special cases of analysis of mirror descent under different potential functions, ask whether this limits our view and biases our understanding and attempt to break out of this limited view.
Peter Bartlett, University of California, Berkeley
Adversarial examples in deep networks
In this talk, Peter Bartlett considers ReLU networks of constant depth with independent gaussian parameters and shows that these networks compute functions that are very close to linear, so that a small step in the gradient direction leads to a large change of output. This result is for networks with constant depth, but Bartlett will also show that some constraint on depth is necessary: there are suitably deep networks that, with constant probability, compute a function that is close to constant. Combined with results characterizing benign overfitting in linear regression, this suggests two potential mechanisms behind adversarial examples, one arising from label noise and the other from random initialization.
This talk is based on joint work with Sébastien Bubeck and Yeshwanth Cherapanamjeri.
Gil Kur, Massachusetts Institute of Technology
On the minimal error of empirical risk minimization
We study the minimal error of the empirical risk minimization (ERM) procedure in the task of regression, both in the random and the fixed design settings. Our sharp lower bounds shed light on the possibility (or impossibility) of adapting to the simplicity of the model generating the data. In the fixed design setting, we show that the error is governed by the global complexity of the entire class. In contrast, in random design, ERM may only adapt to simpler models if the local neighborhoods around the regression function are nearly as complex as the class itself, a somewhat counterintuitive conclusion.
Emmanuel Candès, Stanford University
Recent progress in predictive inference
Machine learning algorithms provide predictions with a selfreported confidence score, but they are frequently inaccurate and uncalibrated, limiting their use in sensitive applications. This talk introduces two novel calibration techniques addressing two frequently discussed problems. First, we show how to calibrate learned predictors to control any statistical risk. Here, the proposed methods handles handles complex prediction tasks where the response variable is a highdimensional object (such as a set of objects detected, segmented, and classified from an image). Our new techniques, based on multiple testing, allow control of the false discovery rate and other targets, unlike conformal prediction and other previous methods. Second, one limitation of conformal inference is that we need to have access to identically distributed samples so that future test samples look like the training samples we have seen before. But what if this is not case? For example, in most settings the distribution of observations can shift drastically—think of finance or economics where market behavior can change in response to new legislation or major world events. We introduce an algorithm which can maintain prediction coverage over time despite substantial changes in the data distribution; for instance, it maintains coverage during the 2008 financial crisis and during the 2020 COVID19 period.
René Vidal and Soledad Villar, Johns Hopkins University
Learning dynamics and implicit bias of gradient flow in overparameterized linear models
Vidal Slides (PDF)
Villar Slides (PDF)Contrary to the common belief that overparameterization may hurt generalization and optimization, recent work suggests that overparameterization may bias the optimization algorithm towards solutions that generalize well — a phenomenon known as implicit regularization or implicit bias — and may also accelerate convergence — a phenomenon known as implicit acceleration. This talk will provide a detailed analysis of the dynamics of gradient flow in overparameterized twolayer linear models showing that convergence to equilibrium depends on the imbalance between input and output weights (which is fixed at initialization) and the margin of the initial solution. We will also provide an analysis of the implicit bias, showing that large hidden layer width, together with (properly scaled) random initialization, constrains[HC2] the network parameters to converge to a solution which is close to the minnorm solution. Finally, we study the generalization properties of overparameterized linear regression, showing that the risk can grow arbitrarily when the number of parameters approaches the number of samples and that such a divergence can be avoided with a PCA[HC3] like dimensionality reduction.
Alejandro Ribeiro, University of Pennsylvania
Learning under requirements: The theoretical underpinning of learning with constraints
View Slides (PDF)
View Animation (H.264 MOV)Statistical learning theory relates the problem of minimizing a statistical risk to the problem of minimizing an empirical risk. In this talk, we consider the approximation of a constrained statistical program with a constrained empirical program. To do that, we develop a theory of probably approximately correct constrained (PACC) learning that leverages and extends the standard theory of probably approximately correct (PAC) learning. Our results show that PAC and PACC learning problems have essentially sample complexities of the same order but that PACC learning problems require more samples than analogous PAC learning problems when the objective value is sensitive to perturbations of the constraint set. We illustrate the practical importance of our theory by considering applications to fair classification and robust image recognition.
Yi Ma, University of California, Berkeley
ReduNet: Whitebox Deep (Convolution) Networks from the Principle of Rate Reduction
In this talk, Yi Ma offers an entirely ‘white box’ interpretation of deep (convolution) networks from the perspective of data compression (and group invariance). In particular, he will show how modern deep layered architectures, linear (convolution) operators and nonlinear activations, and even all parameters can be derived from the principle of maximizing rate reduction (with group invariance). All layers, operators and parameters of the network are explicitly constructed via forward propagation, instead of learned via back propagation. All components of thus obtained network, called ReduNet, have precise optimization, geometric and statistical interpretation. There are also several nice surprises from this principled approach: it reveals a fundamental tradeoff between invariance and sparsity for class separability; it reveals a fundamental connection between deep networks and Fourier transform for group invariance — the computational advantage in the spectral domain (why spiking neurons?); this approach also clarifies the mathematical role of forward propagation (optimization) and backward propagation (variation). In particular, the ReduNet is amenable to finetuning via both forward and backward (stochastic) propagation, both for optimizing the same objective.