Sanjeev Arora, Princeton University
Nina Balcan, Carnegie-Mellon University
Sham Kakade, University of Washington
Sanjoy Dasgupta, University of California, San Diego
The second Simons Symposium on New Directions in Theoretical Machine Learning focused on the following questions:
- Does lack of fundamental understanding hold back progress toward the goal of AI and general purpose learning agents?
- What kinds of new fundamental models and understanding are needed?
- Is current theory on the right path?
This symposium brought together a diverse group of leading researchers in machine learning, theoretical computer science, natural language processing and computer vision. The goal was to achieve a joint understanding of the state of the art in artificial intelligence and to identify directions of mathematical inquiry that would be of practical interest as well as potentially being amenable to analysis.
A prominent theme of the symposium was transformers, a neural architecture geared towards sequence processing that has defined the state of the art in language processing tasks and that is the basis of systems like GPT-3 and DALL-E that have captured the public’s imagination. As with other deep learning methods, there is still a lot of mystery around how these models work and in what sense they can generalize from the examples they have seen.
The first talk of the symposium, by Chris Manning, began with a history of compositionality in natural language processing: the idea that language has a recursive syntactic structure (as captured by parse trees, for instance) that mirrors its semantics. Manning then talked about the mystery of whether transformers, when trained on language, also encapsulate some form of compositionality, and described a variety of experiments that his group has performed to get at this fundamental question.
Sébastien Bubeck also talked about approaches to understanding transformers. In recent work, he has created a sequence processing problem called “LEGO” that involves simple constraint satisfaction. He has found that a transformer trained on this task achieves perfect performance but is unable to generalize to larger problem sizes. Interestingly, by instead starting from a standard transformer (BERT) trained on Wikipedia, the generalization performance can be improved significantly. This raises intriguing questions of whether there are general-purpose relational operations (corresponding to the “attention heads” of transformers) that are useful across many different domains and problem types.
Another major theme of the symposium was theoretical and algorithmic advances in reinforcement learning.
Jitendra Malik talked about a line of work that trains a four-legged robot to walk in such a way that it is able to adapt to new terrains quickly and robustly. The starting point is to train the robot in a simulated environment in which all relevant physical parameters (e.g., frictional properties of the walking surface) are known and can be varied. Through these simulations, the robot learns a deep neural net that captures the relevant environmental information in a low-dimensional latent variable and bases its walking policy on this variable. Malik likened this to the exploratory ambulation of a child who is learning to walk. When moved to a real-world context with an unknown environment, the robot then continuously infers the appropriate values of the latent variable based on real-time feedback about the discrepancy between how it expects to move and how it is actually moving. The talk raised many interesting questions about domain adaptation in reinforcement learning.
On the theory side, there were a variety of talks, from Simon Du, Dylan Foster and Sham Kakade, addressing advances in understanding what generalization is possible in reinforcement learning. Kakade argued strongly against a popular view in empirical AI to the effect that all AI tasks should be formulatable as reinforcement learning with a suitable reward function. Another exciting talk, by Samory Kpotufe, described recent work developing an elementary analysis of bandit problems with distributional shift.
Sanjeev Arora gave an overview of key directions in understanding the generalization behavior of neural networks, as well as the current state of the art on these questions. He identified four broad avenues of inquiry. The first is about favorable properties of the optimization procedures that are used for training neural networks. One such property, which was also discussed at length in a talk by Nati Srebro, is that these procedures embody an implicit bias, so that the final model is regularized despite having an astronomical number of parameters (far more than would seem to be necessary). Interpretability of deep models was a theme in several talks. Using harmonic analysis, Arora also identified interpretability as a potential direction for theory work as well as the study of emergent properties of deep models. Grosse’s talk described better algorithms for influence functions building upon some well-known optimization ideas. Yet another direction involves understanding how neural networks can be trained so as to produce representations that are useful for a variety of different but related tasks. One such methodology, also discussed by Tengyu Ma, is self-supervised learning, in which deep nets are trained to solve artificial prediction problems that implicitly capture key structure in the data space. Other talks by Yejin Choi and Jacob Steinhardt elaborated at length on large language models. Choi described an intriguing “logic” filter built on top of large language models that greatly improves their performance on many tasks.
There were numerous other talks on new directions in machine learning with strong potential for theoretical analysis. Stefanie Jegelka gave an outstanding tutorial on graph neural networks: deep networks that take graphs as input, for instance contact graphs of molecules. Moritz Hardt gave a talk with the menacing title “The invisible hand of prediction” about formally characterizing what happens when machine learning systems are deployed in contexts where their predictions influence the distribution of instances to which they will be applied. Some talks on the last day related machine learning ideas to neuroscience. Sanjoy Dasgupta described memory mechanisms that are implementable by simple networks and are also empirically seen in simple organisms. Surya Ganguli applied methods of analysis from statistical physics to study the problem of data pruning: reducing the size of training sets by eliminating the least informative instances. He argued that the claims about a “scaling effect” of dataset size should be revisited.
In all, the symposium was a valuable exchange between theoreticians and practitioners. Quite apart from the talks, the meals and social gatherings provided extensive opportunities for discussing research priorities and thinking through potential impacts of different research avenues.
Agenda & Slides
MONDAY, SEPTEMBER 5
Chris Manning | Compositionality (in LMs and Generalization) View Slides (PDF) Nati Srebro | Benign Overfitting and Why it Matters View Slides (PDF) Seb Bubeck | Unveiling Transformers with LEGO View Slides (PDF) Sasha Rush | Quantifying LM Behaviors: Prompting and Discourse View Slides (PDF) Tengyu Ma | Three Facets of Understanding Pre-Training: Self-Supervised Loss, Inductive bias and Implicit Bias View Slides (PDF)
TUESDAY, SEPTEMBER 6
Yejin Choi | The Neuro-Symbolic Continuum Between Language, Knowledge and Reasoning View Slides (PDF) Stefanie Jegelka | Representation and Learning in Graph Neural Networks - An Overview of Results View Slides (PDF) Surbhi Goel | Sparse Feature Emergence in Deep Learning View Slides (PDF) Jason Lee | Beyond NTK via Feature Learning with SGD View Slides (PDF) Sanjoy Dasgupta | Memory Games View Slides (PDF)
WEDNESDAY, SEPTEMBER 7
THURSDAY, SEPTEMBER 8
Jitendra Malik | Adaptive Control via Deep RL, with Applications to Robotics View Slides (PDF) Sham Kakade | Why Reward is Not Enough for RL View Slides (PDF) Dylan Foster | The Statistical Complexity of Interactive Decision Making View Slides (PDF) Jamie Morgenstern | Shifts in Distributions and Preferences in Response to Learning View Slides (PDF) Samory Kpotufe | Tracking Significant Changes in Bandits View Slides (PDF)
FRIDAY, SEPTEMBER 9
Roger Grosse | How to Approximate Neural Net Function Space Distance and Why View Slides (PDF) Sanjeev Arora | Deep Learning Becoming Even More of a Black Box: What Can Theory Do? View Slides (PDF) Surya Ganguli | Beyond Neural Scaling Laws: Beating Power Law Scaling Through View Slides (PDF) Ellen Vittercik | Theoretical Foundations of Machine Learning for Cutting Plane Selection View Slides (PDF) Simon Du | Offline Reinforcement Learning View Slides (PDF)