2024 Mathematical and Scientific Foundations of Deep Learning Annual Meeting
Organizer:
Peter Bartlett, University of California, Berkley
René Vidal, University of Pennsylvania
Meeting Goals:
This meeting brought together members of the NSF-Simons Research Collaborations on the Mathematical and Scientific Foundations of Deep Learning (MoDL) and researchers working on related topics. The focus of the meeting was the set of challenging theoretical questions posed by deep learning methods and the development of mathematical and statistical tools to understand their success and limitations, to guide the design of more effective methods, and to initiate the study of the mathematical problems that emerge. The meeting reported on progress in these topics and stimulated discussions of future directions.
-
The meeting brought together members of the two NSF-Simons Research Collaborations on the Mathematical and Scientific Foundations of Deep Learning together with other participants in the NSF SCALE MoDL program. These collaborations aim to understand the success and limitations of deep learning. Talks at the meeting reported on progress of the collaborations in the last year.
Rong Ge (Duke University) opened the workshop with an insightful presentation on the in-context learning capabilities of linear transformers. He highlighted recent findings that reveal these models implicitly execute gradient-descent-like algorithms during their forward inference steps. He also illustrated how a single linear transformer can effectively solve regression problems with varying noise levels while simultaneously leveraging a task descriptor to enhance performance.
Elchanan Mossel (Massachusetts Institute of Technology) presented a theoretical perspective on the advantages of depth in deep learning inference. His framework is composed of four key components: identifying natural data models, ensuring computational and statistical efficiency in inference, demonstrating that depth (or another complexity measure) is essential, and establishing that the inference procedure can be efficiently learned from data.
Jeremias Sulam (Johns Hopkins University) presented two innovative approaches aimed at making general deep-learning models more interpretable. The first approach focused on unsupervised learning in the context of inverse problems, including a principled framework for learning proximal networks that achieves state-of-the-art performance while offering insights into the learned data priors. The second approach tackled supervised classification problems in computer vision, employing a novel betting framework to assess the semantic importance of various concepts. Through these methods, Jeremias highlighted the potential to unlock deeper insights into model behavior, ultimately facilitating more transparent and trustworthy AI applications.
Bin Yu (University of California, Berkeley) presented her work on the efficient fine-tuning of large deep-learning models, leveraging insights from infinite-width theory. She focused on two critical aspects of low-rank adaptation (LoRA) that enhance fine-tuning practices. First, Bin highlighted that using the same learning rate for the two LoRA matrices is suboptimal, often resulting in inefficient fine-tuning for large models. She proposed a straightforward yet effective modification: applying a significantly larger learning rate for one matrix to improve feature learning efficiency. Second, Bin discussed the impact of initialization on LoRA fine-tuning dynamics, revealing that a specific initialization scheme—setting one matrix to random values while keeping the other matrix at zero—generally yields better performance compared to the reverse initialization. Her findings offer valuable guidelines for improving the fine-tuning of large models, paving the way for more efficient and effective deep learning practices.
Nikolai Matni (University of Pennsylvania) discussed the complexities of learning to control dynamical systems. He explored the fundamental question of what factors contribute to the ease or difficulty of learning to control, drawing from both theoretical insights and practical applications. Nikolai outlined key elements, such as system dynamics, data availability, and the structure of the learning algorithm, that significantly influence the learning process. He introduced a framework that analyzes how various properties of the system and the controller interact, leading to insights on when control tasks can be efficiently learned versus when they pose significant challenges. By illustrating these concepts with examples from real-world applications, Nikolai provided a comprehensive understanding of the conditions that simplify or complicate learning to control, highlighting the importance of aligning learning strategies with the specific characteristics of the control task at hand.
Jingfeng Wu (University of California, Berkeley) presented work that explores the use of large stepsizes in gradient descent optimization for deep learning, challenging the conventional wisdom regarding stepsize selection. He argued that while smaller stepsizes are typically favored for stability, larger stepsizes can lead to faster convergence and enhanced generalization. Specifically, he discussed the “edge of stability” phenomenon, where gradient descent operates at the boundary between stability and instability, potentially accelerating optimization without the need for momentum or varying stepsizes. He then presented theoretical insights and empirical results demonstrating that large stepsizes can improve optimization efficiency under certain conditions, such as logistic regression with separable data and neural networks with appropriate activation functions. He also showed that this type of analysis can be extended to stochastic gradient descent and other loss functions, offering a comprehensive theory of large stepsize gradient descent.
Gitta Kutyniok (Ludwig Maximilian University) emphasized the necessity of a mathematical foundation to enhance the reliability of AI methodologies, outlining recent progress in establishing generalization bounds and reliable explainability techniques. Gitta then delved into the fundamental limitations of computability that pose challenges to AI reliability, particularly when the power consumption of current algorithms and model architectures are accounted for, proposing solutions that connect these issues to the future of AI computing. This connection not only addresses the reliability concerns but also sheds light on the pressing energy inefficiencies associated with current AI technologies and how potential solutions could arise by considering spiking neural architectures. Towards this goal Gitta provided theoretical guarantees on various aspects of computability with spiking network architectures. Through her talk, Gitta highlighted the critical intersection of mathematical theory and practical applications, paving the way for next- generation AI systems that are both effective, trustworthy, and resource efficient.
Misha Belkin (University of California, San Diego) closed the workshop with a presentation of his work on theoretical understanding of the grokking phenomenon. He showed that grokking is not specific to gradient-based training of neural networks, but also occurs when learning modular arithmetic with Recursive Feature Machines (RFM), an iterative algorithm that uses the Average Gradient Outer Product (AGOP) to enable task-specific feature learning with a general machine learning model. He showed that for the first few iterations of RFM, the training accuracy is at 100%, while the test loss/accuracy is first at a random level and then quickly transitions to 100%. He associated these observations with the gradual appearance of circulant patterns in the AGOP feature matrices during the early iterations, which is essential for learning modular arithmetic. Through his talk, Misha associated the emergent grokking properties of neural networks with their ability to learn features, paving the way for a deeper understanding of feature learning.
-
Thursday, September 26
9:30 AM Rong Ge | What can linear transformers learn in context 11:00 AM Elchanan Mossel | Why Depth? A theorectical perspectives on the advantages of depth in inference 1:00 PM Jeremias Sulam | Yay, my deep network works! But.. what did it learn? 2:30 PM Bin Yu | Efficent fine-tuning of large deep learning models via infinite-width theory and experiments 4:00 PM Nikolai Matni | What makes learning to control easy or hard? Friday, September 27
9:30 AM Jingfeng Wu | Reimaging Gradient Descent: Large Stepsize, Oscillation, and Acceleration 11:00 AM Gitta Kutyniok | Reliable AI: From mathematical foundations to next generation AI computing 1:00 PM Misha Belkin | Emergence and Grokking in "Simple" Architectures -
Misha Belkin
University of California at San DiegoEmergence and Grokking in “Simple” Architectures
View Slides (PDF)In recent years transformers have become a dominant machine learning methodology. A key element of transformer architectures is a standard neural network (MLP). I argue that MLPs alone already exhibit many remarkable behaviors observed in modern LLMs, including emergent phenomena. Furthermore, despite large amounts of work, we are still far from understanding how 2-layer MLPs learn relatively simple problems, such as “grokking” modular arithmetic. I will discuss recent progress and will argue that feature-learning kernel machines (Recursive Feature Machines) isolate some key computational aspects of modern neural architectures and are preferable to MLPs as a model for analysis of emergent phenomena as well as a powerful predictor in their own right.
Rong Ge
Duke UniversityWhat can linear transformers learn in context
View Slides (PDF)Large language models exhibit strong in-context learning capabilities — their performance can improve given few in-context examples provided in the prompt. Recent research used linear regression as a simple setting to understand in-context learning. Results have demonstrated that transformers, particularly linear attention models, implicitly execute gradient-descent-like algorithms on data provided in context during their forward inference step. In this talk, we show that even linear transformers are very versatile and can go beyond simple gradient descent on more interesting data. In particular, we show how the same linear transformer can simultaneously handle regression problems with different noise level and how they can leverage a task descriptor.
Gitta Kutyniok
Ludwig Maximilian University of MunichReliable AI: From mathematical foundations to next generation AI computing
Artificial intelligence is currently leading to one breakthrough after the other, in industry, public life and the sciences. However, one current major drawback worldwide, in particular, in light of regulations such as the EU AI Act and the G7 Hiroshima AI Process, is the lack of reliability of such methodologies.
In this talk, we will first highlight the role of a mathematical perspective to this highly topical research direction and survey our recent advances concerning generalization bounds and reliable explainability approaches. We then discuss fundamental limitations in terms of computability, which affect AI’s reliability, and show solutions to this serious obstacle by revealing an intriguing connection to next generation AI computing, thereby also touching upon the enormous energy problem of current AI technology.
Nikolai Matni
University of PennsylvaniaWhat makes learning to control easy or hard?
View Slides (PDF)Designing autonomous systems that are simultaneously high-performing, adaptive and provably safe remains an open problem. In this talk, we will argue that in order to meet this goal, new theoretical and algorithmic tools are needed that blend the stability, robustness and safety guarantees of robust control with the flexibility, adaptability and performance of machine and reinforcement learning. We will highlight our progress towards developing such a theoretical foundation of robust learning for safe control in the context of the following case studies: (i) characterizing fundamental limits of learning-enabled control, (ii) developing novel robust imitation learning algorithms with finite sample-complexity guarantees and, if time allows, (iii) leveraging data from diverse but related tasks for efficient multi-task learning for control. In all cases, we will emphasize the interplay between robust learning, robust control and robust stability and their consequences on the sample-complexity and generalizability of the resulting learning-based control algorithms.
Elchanan Mossel
Massachusetts Institute of TechnologyWhy depth? A theoretical perspective on the advantages of depth in inference
View Slides (PDF)Can theory help explain the success of deep nets on real data? One avenue to explore this question is to ask if we can find
1. Natural data models where:
2. Inference is computationally and statistically efficient,
3. Inference requires depth (or some other measure of complexity) and
4. The inference procedure can be learned efficiently from data.As proving depth lower bounds in theoretical computer science for explicit objects is hard, perhaps the most difficult task is to establish 3. I will discuss some recent works that try to establish 1–4 for the broadcast model on the tree and where the inference procedure is belief propagation.
Based on:
- https://arxiv.org/pdf/2402.13359
- https://dl.acm.org/doi/abs/10.1145/3564246.3585155
- https://proceedings.neurips.cc/paper_files/paper/2022/hash/77e6814d32a86b76123bd10aa7e2ad81-Abstract-Conference.html
- https://proceedings.mlr.press/v125/moitra20a.html
- https://arxiv.org/pdf/1612.09057
Jeremias Sulam
Johns Hopkins UniversityYay, my deep network works! But… what did it learn?
View Slides (PDF)Modern machine-learning methods are revolutionizing what we can do with data — from TikTok video recommendations to biomarkers discovery in cancer research. Yet, the complexity of these deep models makes it harder to understand what functions these data-dependent models are computing, and which features they detect and regard as important for a given task. In this talk, I will review two approaches for turning general deep-learning models more interpretable, both in an unsupervised setting in the context of imaging inverse problems — through learned proximal networks, as well as in supervised classification problems for computer vision — by testing for the semantic importance of concepts via betting.
Jingfeng Wu
University of California, BerkeleyReimaging gradient descent: Large stepsize oscillation and acceleration
View Slides (PDF)Gradient descent (GD) and its variants are pivotal in machine learning, particularly deep learning. Conventional wisdom suggests smaller stepsizes for stability, yet in practice, larger stepsizes often yield faster convergence and improved generalization, despite initial instability. This talk delves into the dynamics of GD with a constant stepsize applied to logistic regression with linearly separable data, where the constant stepsize eta is so large that the loss initially oscillates. We show that GD exits the initial oscillatory phase rapidly in O(eta) steps and subsequently achieves an O(1/(t eta)) convergence rate. Our results imply that, given a budget of T steps, GD can achieve an accelerated loss of O(1/T^2) with an aggressive stepsize of eta = Theta(T), without any use of momentum or variable stepsize schedulers. This suggests that large stepsize GD achieves accelerated optimization by entering an initially unstable regime. Based on the new insights drawn from the linear model, I will further discuss the provable benefits of large stepsizes for GD in training non-linear neural networks.
Bin Yu
University of California, BerkeleyEfficient fine-tuning of large deep-learning models via infinite-width theory and experiments
View Slides (PDF)In this talk, we will describe our recent works that use both infinite-width theory and extensive experiments to obtain practical lessons regarding three aspects of low-rank adaptation (LoRA) for efficient fine-tuning of large deep-learning models:
1. Learning rate parametrization: We show that the standard LoRA with the same learning rate for A and B is suboptimal in the sense that it leads to inefficient fine-tuning (in large models), and we propose a simple modification: set a much larger learning rate for matrix B to achieve more efficient feature learning.
2. Learning rate transfer: We show that decreasing the model size of a large pre-trained model in a principled way still preserves the optimal hyperparameters for fine-tuning, extending previous work from Yang et al. (2022) for model pre-training. In particular, we introduce a novel non-uniform downsampling procedure for decreasing model size by combining results from infinite-width neural network theory with classical statistical sampling theory.
3. Impact of initialization: We briefly discuss how initialization influences LoRA fine-tuning dynamics and show that one initialization scheme (A set to random and B to 0) generally leads to better performance compared to the other initialization scheme (B set to zero and B random).
This talk is based on joint work with Nikhil Ghosh and Soufiane Hayou at University of California, Berkeley.