Organizers:
Peter Bartlett, University of California Berkeley
Stacey Levine, National Science Foundation
Rene Vidal, University of Pennsylvania
Speakers:
Pedro Abdalla, University of California, Irvine
Rong Ge, Duke University
Alejandro Ribeiro, University of Pennsylvania
Nati Srebro, Toyota Technological Institute at Chicago
René Vidal, University of Pennsylvania
Soledad Villar, Johns Hopkins University
Bin Yu, University of California, Berkeley
Meeting Goals:
This meeting brought together members of the NSF-Simons Research Collaborations on the Mathematical and Scientific Foundations of Deep Learning (MoDL) and researchers working on related topics. The focus of the meeting was the set of challenging theoretical questions posed by deep learning methods and the development of mathematical and statistical tools to understand their success and limitations, to guide the design of more effective methods, and to initiate the study of the mathematical problems that emerge. The meeting reported on progress in these topics and stimulated discussions of future directions.
Past Meetings:
-
The meeting brought together members of the two NSF-Simons Research Collaborations on the Mathematical and Scientific Foundations of Deep Learning together with other participants in the NSF SCALE MoDL program. Talks at the meeting reported on progress of the collaborations in their fifth year. These collaborations aim to understand the success and limitations of deep learning.
Pedro Abdalla (University of California, Irvine) presented results on large language model (LLM) watermarking algorithms, which allow us to secretly tag text generated by an LLM in a way that makes it easy to reliably distinguish it from other text. These algorithms rely on a statistical perspective, using mixtures of distributions.
René Vidal (University of Pennsylvania) described work on the learning dynamics of gradient flow in various neural network settings. First, for two-layer ReLU networks trained on orthogonal separable data, he showed that training proceeds in phases, corresponding to alignment with class directions, then convergence of the loss. Second, although the function learned in this case is not robust to adversarial perturbations of the input, he described how to modify the nonlinearity to ensure robustness without sacrificing prediction accuracy. Finally, he described the dynamics of low-rank adaptation (LoRA) for a matrix factorization problem.
Rong Ge (Duke University) also described work on learning dynamics in neural networks, particularly in cases where there are several distinct mechanisms at play, and the dynamics involve a competition between these mechanisms. He presented an overparameterized sparse linear model that demonstrates a transition from feature learning to neural tangent kernel learning, as well as a transformer example that reveals a competition between in-context and in-weight learning.
Bin Yu (University of California, Berkeley) gave a talk on the role of interaction importance for deep learning models, describing algorithms that measure interaction importance in pretrained models and evaluations of their effectiveness with faithfulness, predictivity, and reasoning metrics.
Nati Srebro (Toyota Technological Institute at Chicago) considered the phenomenon of weak-to-strong generalization, where a powerful learning algorithm learns from a less powerful model but ultimately outperforms it. He considered linearly parameterized two-layer networks with random first-layer features, where a weak teacher, that is, a network with few random features, learns a prediction task, and a strong learner, with many random features, learns from the teacher. He showed that the strong learner can outperform the weak teacher when gradient flow with early stopping provides a helpful inductive bias.
Soledad Villar (Johns Hopkins University) presented a number of applications of algebraic techniques (such as invariant theory, Galois theory, and representation stability) to the analysis and design of machine learning models, with applications like learning functions on graphs that are permutation invariant or learning functions on point clouds that are invariant to permutations and orthogonal transformations.
Peter Bartlett (University of California, Berkeley) described recent results on the implicit bias provided by gradient optimization methods with logistic loss for a general family of deep networks with parameterizations that satisfy a near-homogeneity property, which quantifies how their outputs grow as the parameters get large. As the logistic loss gets small, these networks converge in direction to satisfy first-order stationarity conditions for a certain margin maximization problem. Although such solutions can have desirable statistical properties, the convergence is at a logarithmic rate, and so the behavior along the trajectory is important. He also described overparameterized prediction settings, where early stopping gives statistical benefits over the asymptotic solution.
Alejandro Ribeiro (University of Pennsylvania) gave a talk on the spectral analysis of graph neural networks to elucidate their stability (robustness to graph perturbations) and transferability (convergence to continuous limits like graphons) properties. He showed that these properties are determined by a spectral discriminability property: the Lipschitz constant of the frequency response, revealing a fundamental tradeoff.
-
Thursday, September 25, 2025
9:30 AM Pedro Abdalla | LLM Watermarking Using Mixtures and Statistical-to-Computational Gaps 11:00 AM René Vidal | Learning Dynamics in the Feature-Learning Regime: Implicit Bias, Robustness, and Low-Rank Adaptation 1:00 PM Bin Yu | Understanding Deep Learning Models via Interaction Importance 2:30 PM Rong Ge | Competing Mechanisms in Training Dynamics 4:00 PM Nati Srebro | Weak to Strong Generalization in Random Feature Models Friday, September 26, 2025
9:30 AM Soledad Villar | Algebraic techniques for machine learning 11:00 AM Peter Bartlett | Gradient Optimization Methods: Implicit Bias and Benefits of Early Stopping 1:00 PM Alejandro Ribeiro | Spectral Analyses of Graph Neural Networks -
Pedro Abdalla
University of California, IrvineLLM Watermarking Using Mixtures and Statistical-to-Computational Gaps
View Slides (PDF)Given a text, can we determine whether it was generated by a large language model (LLM) or by a human? Watermarking is a prominent approach used to tackle this question. In this talk, we will examine the watermarking problem from a statistical perspective, with a particular focus on how mixtures of distributions can be useful for watermarking. Time permitting, we will also discuss a robust framework involving tools from statistical-to-computational gaps. This is joint work with Roman Vershynin.
Peter Bartlett
University of California, BerkeleyGradient Optimization Methods: Implicit Bias and Benefits of Early Stopping
View Slides (PDF)Deep learning methods do not explicitly control statistical complexity; instead, it seems to be implicitly controlled by the simple gradient descent algorithms used in optimizing training loss. Previous results on the asymptotic implicit bias of gradient descent with the logistic loss—and other losses with exponential tails—rely on a homogeneity property of the parameterization that is violated for many interesting examples of deep neural networks. We describe recent results on the asymptotic implicit bias of gradient descent for a general family of non-homogeneous deep networks, showing how the iterates converge in direction to satisfy first order stationarity conditions of a margin maximization problem. Since this convergence is slow, we investigate the statistical behavior along the trajectory, and we identify overparameterized prediction settings where early stopping gives statistical benefits.
Rong Ge
Duke UniversityCompeting Mechanisms in Training Dynamics
View Slides (PDF)Recent research has identified and analyzed many mechanisms for the training dynamics of neural networks and other overparametrized models, including neural tangent kernel and mean-field limits. However, in practice the training dynamics are often more complicated and may involve competition between different mechanisms. In this talk we will first see a simple example for overparametrized sparse linear models, where we show under suitable parametrization the training will transition from a feature learning phase to a NTK phase (which memorizes noises), achieving near-optimal sample complexity for an interpolating model. We will then look at a more complicated example on the competition between in-context and in-weight learning for simple transformers.
Alejandro Ribeiro
University of PennsylvaniaSpectral Analyses of Graph Neural Networks
Layers of graph neural networks (GNNs) involve linear operators that admit scalar representations on the graph’s spectrum. We use these spectral representations to investigate the response of GNNs to graph perturbations (stability) and scaling towards continuous limits in the form of manifolds or Graphons (transferability). We uncover that stability and transferability are determined by the Lipschitz constant of these frequency responses. Since Lipschitz constants are related to spectral discriminability, these analyses uncover fundamental stability and transferability vs discriminability tradeoffs. GNNs that attempt to discern features aligned with graph eigenvectors associated with larger eigenvalues are more unstable and more difficult to transfer at scale. Scalar (or low dimensional) spectral representations are common to any architecture that involves the composition of an operator. This includes GNNs as well as standard convolutional neural networks (CNNs) in time and space along with other less common information processing architectures. We use tools of algebraic signal processing to show that all these architectures share analogous stability and transferability properties. Materials available at gnn.seas.upenn.edu.
Nati Srebro
Toyota Technological Institute at ChicagoWeak to Strong Generalization in Random Feature Models
Weak-to-strong generalization (Burns et al., 2023) is the phenomenon whereby a strong student, say GPT-4, learns a task from a weak teacher, say GPT-2, and ends up significantly outperforming the teacher. We show that this phenomenon does not require a strong and complex learner like GPT-4, nor pre-training. We consider students and teachers that are random feature models, described by two-layer networks with a random and fixed bottom layer and trained top layer. A ‘weak’ teacher, with a small number of units (i.e., random features), is trained on the population, and a ‘strong’ student, with a much larger number of units (i.e., random features), is trained only on labels generated by the weak teacher. We demonstrate, prove, and understand, how the student can outperform the teacher, even though trained only on data labeled by the teacher, with no pretraining or other knowledge or data advantage over the teacher. We explain how such weak-to-strong generalization is enabled by early stopping. Importantly, we also show the quantitative limits of weak-to-strong generalization in this model.
Joint work with Marko Medvedev, Kaifeng Lyu, Dingli Yu, Sanjeev Arora and Zhiyuan Li.
René Vidal
University of PennsylvaniaLearning Dynamics in the Feature-Learning Regime: Implicit Bias, Robustness, and Low-Rank Adaptation
This talk will present new insights into the learning dynamics of gradient flow in the feature-learning regime. For orthogonally separable data, we show that the neurons of a two-layer ReLU network align with class directions, yielding an approximately low-rank structure before the loss converges at a rate of 1/t. We also show that while trained ReLU networks can be non-robust to adversarial perturbations, using normalized polynomial ReLU activations ensures both generalization and provable robustness without the need for adversarial training. Finally, we analyze Low-Rank Adaptation (LoRA) for matrix factorization, showing that gradient flow converges to a neighborhood of the optimal solution, with an error that depends on the misalignment between pretraining and finetuning tasks. These results highlight how dynamics, data structure, architecture, and initialization jointly determine generalization, robustness, and adaptation.
Soledad Villar
Johns Hopkins UniversityAlgebraic Techniques for Machine Learning
During the 2023 PI meeting, the NSF program officers asked which areas of mathematics could play a significant role in machine learning but remain underutilized. The answer across the board was ‘algebra.’ In this talk, we present a few ideas based on algebraic techniques (invariant theory, Galois theory, and representation stability) to analyze existing machine learning models and design new ones.
Bin Yu
University of California, BerkeleyUnderstanding Deep Learning Models via Interaction Importance
View Slides (PDF)The power of deep learning models lies in its ability to capture interactions among input features, embeddings, subnetworks, and response. In this talk, we will introduce computationally efficient algorithms to measure interaction importance for a pre-trained DL model including LLMs. They are Contextual Decomposition for Transformers (CD-T) and Spectral Explainer (SPEX and proxy SPEX). Their effectiveness evaluations include faithfulness, predictivity, and reasoning metrics.