2596 Publications

Birth of a Transformer: A Memory Viewpoint

A. Bietti, Vivien Cabannes, Diane Bouchacourt, Herve Jegou, Leon Bottou

Large language models based on transformers have achieved great empirical successes. However, as they are deployed more widely, there is a growing need to better understand their internal mechanisms in order to make them more reliable. These models appear to store vast amounts of knowledge from their training data, and to adapt quickly to new information provided in their context or prompt. We study how transformers balance these two types of knowledge by considering a synthetic setup where tokens are generated from either global or context-specific bigram distributions. By a careful empirical analysis of the training process on a simplified two-layer transformer, we illustrate the fast learning of global bigrams and the slower development of an “induction head” mechanism for the in-context bigrams. We highlight the role of weight matrices as associative memories, provide theoretical insights on how gradients enable their learning during training, and study the role of data-distributional properties.

Show Abstract

On Learning Gaussian Multi-index Models with Gradient Flow

A. Bietti, Joan Bruna, L. Pillaud-Vivien

We study gradient flow on the multi-index regression problem for high-dimensional Gaussian data. Multi-index functions consist of a composition of an unknown low-rank linear projection and an arbitrary unknown, low-dimensional link function. As such, they constitute a natural template for feature learning in neural networks. We consider a two-timescale algorithm, whereby the low-dimensional link function is learnt with a non-parametric model infinitely faster than the subspace parametrizing the low-rank projection. By appropriately exploiting the matrix semigroup structure arising over the subspace correlation matrices, we establish global convergence of the resulting Grassmannian population gradient flow dynamics, and provide a quantitative description of its associated `saddle-to-saddle' dynamics. Notably, the timescales associated with each saddle can be explicitly characterized in terms of an appropriate Hermite decomposition of the target link function. In contrast with these positive results, we also show that the related

Show Abstract

Extracting thermodynamic properties from van ’t Hoff plots with emphasis on temperature-sensing ion channels

Jakob T. Bullerjahn, S. Hanson

Transient receptor potential (TRP) ion channels are among the most well-studied classes of temperature-sensing molecules. Yet, the molecular mechanism and thermodynamic basis for the temperature sensitivity of TRP channels remains to this day poorly understood. One hypothesis is that the temperature-sensing mechanism can simply be described by a difference in heat capacity between the closed and open channel states. While such a two-state model may be simplistic it nonetheless has descriptive value, in the sense that it can be used to compare overall temperature sensitivity between different channels and mutants. Here, we introduce a mathematical framework based on the two-state model to reliably extract temperature-dependent thermodynamic potentials and heat capacities from measurements of equilibrium constants at different temperatures. Our framework is implemented in an open-source data analysis package that provides a straightforward way to fit both linear and nonlinear van ’t Hoff plots, thus avoiding some of the previous, potentially erroneous, assumptions when extracting thermodynamic variables from TRP channel electrophysiology data.

Show Abstract
November 2, 2023

Nonlinear Classification Without a Processor

Sam Dillavou, Andrea Liu, Douglas Durian, et al.

Computers, as well as most neuromorphic hardware systems, use central processing and top-down algorithmic control to train for machine learning tasks. In contrast, brains are ensembles of 100 billion neurons working in tandem, giving them tremendous advantages in power efficiency and speed. Many physical systems `learn' through history dependence, but training a physical system to perform arbitrary nonlinear tasks without a processor has not been possible. Here we demonstrate the successful implementation of such a system - a learning meta-material. This nonlinear analog circuit is comprised of identical copies of a single simple element, each following the same local update rule. By applying voltages to our system (inputs), inference is performed by physics in microseconds. When labels are properly enforced (also via voltages), the system's internal state evolves in time, approximating gradient descent. Our system; it requires no processor. Once trained, it performs inference passively, requiring approximately 100~W of total power dissipation across its edges. We demonstrate the flexibility and power efficiency of our system by solving nonlinear 2D classification tasks. Learning meta-materials have immense potential as fast, efficient, robust learning systems for edge computing, from smart sensors to medical devices to robotic control.

Show Abstract
November 1, 2023

Contrastive power-efficient physical learning in resistor networks

Menachem Stern, Douglas Durian, Andrea Liu, et al.

The prospect of substantial reductions in the power consumption of AI is a major motivation for the development of neuromorphic hardware. Less attention has been given to the complementary research of power-efficient learning rules for such systems. Here we study self-learning physical systems trained by local learning rules based on contrastive learning. We show how the physical learning rule can be biased toward finding power-efficient solutions to learning problems, and demonstrate in simulations and laboratory experiments the emergence of a trade-off between power-efficiency and task performance.

Show Abstract
November 1, 2023

Universal scaling of shear thickening transitions

Meera Ramaswamy, E. Katifori, et al.

Nearly, all dense suspensions undergo dramatic and abrupt thickening transitions in their flow behavior when sheared at high stresses. Such transitions occur when the dominant interactions between the suspended particles shift from hydrodynamic to frictional. Here, we interpret abrupt shear thickening as a precursor to a rigidity transition and give a complete theory of the viscosity in terms of a universal crossover scaling function from the frictionless jamming point to a rigidity transition associated with friction, anisotropy, and shear. Strikingly, we find experimentally that for two different systems—cornstarch in glycerol and silica spheres in glycerol—the viscosity can be collapsed onto a single universal curve over a wide range of stresses and volume fractions. The collapse reveals two separate scaling regimes due to a crossover between frictionless isotropic jamming and frictional shear jamming, with different critical exponents. The material-specific behavior due to the microscale particle interactions is incorporated into a scaling variable governing the proximity to shear jamming, that depends on both stress and volume fraction. This reformulation opens the door to importing the vast theoretical machinery developed to understand equilibrium critical phenomena to elucidate fundamental physical aspects of the shear thickening transition.

Show Abstract

Phase plane dynamics of ERK phosphorylation

S. Shvartsman, Sarah McFann, Martin Wühr , Boris Y. Rubinstein

The extracellular signal–regulated kinase (ERK) controls multiple critical processes in the cell and is deregulated in human cancers, congenital abnormalities, immune diseases, and neurodevelopmental syndromes. Catalytic activity of ERK requires dual phosphorylation by an upstream kinase, in a mechanism that can be described by two sequential Michaelis-Menten steps. The estimation of individual reaction rate constants from kinetic data in the full mechanism has proved challenging. Here, we present an analytically tractable approach to parameter estimation that is based on the phase plane representation of ERK activation and yields two combinations of six reaction rate constants in the detailed mechanism. These combinations correspond to the ratio of the specificities of two consecutive phosphorylations and the probability that monophosphorylated substrate does not dissociate from the enzyme before the second phosphorylation. The presented approach offers a language for comparing the effects of mutations that disrupt ERK activation and function in vivo. As an illustration, we use phase plane representation to analyze dual phosphorylation under heterozygous conditions, when two enzyme variants compete for the same substrate.

Show Abstract

Foveated metamers of the early visual system

B. Broderick, G Rufo, J Winawer, E. P. Simoncelli

Human ability to discriminate and identify visual attributes varies across the visual field, and is generally worse in the periphery than in the fovea. This decline in performance is revealed in many kinds of tasks, from detection to recognition. A parsimonious hypothesis is that the representation of any visual feature is blurred (spatially averaged) by an amount that differs for each feature, but that in all cases increases with eccentricity. Here, we examine models for two such features: local luminance and spectral energy. Each model averages the corresponding feature in pooling windows whose diameters scale linearly with eccentricity. We performed psychophysical experiments with synthetic stimuli to determine the window scaling for which human and model discrimination abilities match, called the critical scaling. We used much larger stimuli than those of previous studies, subtending 53.6 by 42.2 degrees of visual angle. We found the critical scaling for the luminance model was approximately one-fourth that of the energy model, and consistent with earlier studies, that a smaller critical scaling value was required when discriminating a synthesized image from a natural image than when discriminating two synthesized images. We offer a coherent explanation for these results in terms of alignments and misalignments of the models with human perceptual representations.

Show Abstract

Mott insulators with boundary zeros

The topological classification of electronic band structures is based on symmetry properties of Bloch eigenstates of single-particle Hamiltonians. In parallel, topological field theory has opened the doors to the formulation and characterization of non-trivial phases of matter driven by strong electron-electron interaction. Even though important examples of topological Mott insulators have been constructed, the relevance of the underlying non-interacting band topology to the physics of the Mott phase has remained unexplored. Here, we show that the momentum structure of the Green's function zeros defining the "Luttinger surface" provides a precise topological characterization of the Mott phase surprisingly related to the one of the single-particle electronic dispersion. Considerations on the zeros lead to the prediction of new phenomena: a topological Mott insulator with an inverted gap for the bulk zeros must possess gapless zeros at the boundary, which behave as a form of "topological antimatter" annihilating conventional edge states. Placing band and Mott topological insulators in contact produces distinctive observable signatures at the interface, revealing the otherwise spectroscopically elusive Green's function zeros.
Show Abstract

Theory of shot noise in strange metals

We extend the theory of shot noise in coherent metals to shot noise in strange metals without quasiparticle excitations. This requires a generalization of the Boltzmann equation with a noise source to distribution functions which depend independently on the excitation momentum and energy. We apply this theory to a model of a strange metal with linear in temperature (T) resistivity, describing a Fermi surface with a spatially random Yukawa coupling to a critical boson. We find a suppression of the Fano factor in the strange metal, and describe the dependence of the shot noise on temperature and applied voltage. At low temperatures, we obtain a Fano factor equal to 1/6, in contrast to the 1/3 Fano factor in diffusive metals with quasiparticles. Our results are in general agreement with recent observations by Chen et al. (arXiv:2206.00673). We further compare the random Yukawa model to quasi-elastic electron-phonon scattering that also generates T-linear resistivity, and argue that shot noise observations offer a useful diagnostic to distinguish between them.
Show Abstract
  • Previous Page
  • Viewing
  • Next Page
Advancing Research in Basic Science and MathematicsSubscribe to Flatiron Institute announcements and other foundation updates

privacy consent banner

Privacy preference

We use cookies to provide you with the best online experience. By clicking "Accept All," you help us understand how our site is used and enhance its performance. You can change your choice at any time here. To learn more, please visit our Privacy Policy.