421 Publications

Understanding Optimization in Deep Learning with Central Flows

J. Cohen, Alex Damian, Ameet Talwalkar, J Zico Kolter, Jason D. Lee

Optimization in deep learning remains poorly understood. A key difficulty is that optimizers exhibit complex oscillatory dynamics, referred to as "edge of stability," which cannot be captured by traditional optimization theory. In this paper, we show that the path taken by an oscillatory optimizer can often be captured by a central flow: a differential equation which directly models the time-averaged (i.e. smoothed) optimization trajectory. We empirically show that these central flows can predict long-term optimization trajectories for generic neural networks with a high degree of numerical accuracy. By interpreting these flows, we are able to understand how gradient descent makes progress even as the loss sometimes goes up; how adaptive optimizers ``adapt'' to the local loss landscape; and how adaptive optimizers implicitly seek out regions of weight space where they can take larger steps. These insights (and others) are not apparent from the optimizers' update rules, but are revealed by the central flows. Therefore, we believe that central flows constitute a promising tool for reasoning about optimization in deep learning.

Show Abstract

Discrete Lehmann representation of three-point functions

Dominik Kiese, Hugo U. R. Strand, Kun Chen, Nils Wentzell, Olivier Parcollet, J. Kaye

We present a generalization of the discrete Lehmann representation (DLR) to three-point correlation and vertex functions in imaginary time and Matsubara frequency. The representation takes the form of a linear combination of judiciously chosen exponentials in imaginary time, and products of simple poles in Matsubara frequency, which are universal for a given temperature and energy cutoff. We present a systematic algorithm to generate compact sampling grids, from which the coefficients of such an expansion can be obtained by solving a linear system. We show that the explicit form of the representation can be used to evaluate diagrammatic expressions involving infinite Matsubara sums, such as polarization functions or self-energies, with controllable, high-order accuracy. This collection of techniques establishes a framework through which methods involving three-point objects can be implemented robustly, with a substantially reduced computational cost and memory footprint.

Show Abstract

An Accurate and Efficient Scheme for Function Extension on Smooth Domains

C. Epstein, F. Fryklund, S. Jiang

A new scheme is proposed to construct an (n)-times differentiable function extension of an (n)-times differentiable function defined on a smooth domain, (D,) in (d)-dimensions. The extension scheme relies on an explicit formula consisting of a linear combination of (n+1) function values in (D,) which extends the function along directions normal to the boundary. Smoothness tangent to the boundary is automatic. The performance of the scheme is illustrated by using function extension as part of a numerical solver for the Poisson equation on domains with complex geometry in both two and three dimensions. Although the cost of extending the function increases mildly with the extension order, it remains a small fraction of the overall algorithm. Moreover, the modest additional work required for high order function extensions leads to considerably more accurate solutions of the partial differential equation.

Show Abstract

How does Mamba Perform Associative Recall? A Mechanistic Study

Grégoire LE CORRE, T. Huang, A. Bietti

Mamba has recently emerged as a promising alternative to Transformers, demonstrating competitive performance in many language modeling tasks with linear-time computational complexity. Theoretical characterization of Mamba has largely focused on its approximation power for solving certain tasks through specific constructions. However, it remains unclear whether Mamba trained with gradient descent can learn such constructions. As a first step to address this gap, we perform a mechanistic study of simplified Mamba models on associative recall tasks. By analyzing the learned model weights and the hidden state evolution, we uncover the mechanisms used by simplified Mamba models to perform associative recall. We complement our study with theoretical analysis on the optimization dynamics of simplified Mamba models that give rise to such mechanisms.

Show Abstract

Equispaced Fourier representations for efficient Gaussian process regression from a billion data points

Philip Greengard, M. Rachh, A. Barnett

We introduce a Fourier-based fast algorithm for Gaussian process regression in low dimensions. It approximates a translationally invariant covariance kernel by complex exponentials on an equispaced Cartesian frequency grid of \(M\) nodes. This results in a weight-space \(M\times M\) system matrix with Toeplitz structure, which can thus be applied to a vector in \({\mathcal O}(M \log{M})\) operations via the fast Fourier transform (FFT), independent of the number of data points \(N\). The linear system can be set up in \({\mathcal O}(N+M \log{M})\) operations using nonuniform FFTs. This enables efficient massive-scale regression via an iterative solver, even for kernels with fat-tailed spectral densities (large \(M\)). We provide bounds on both kernel approximation and posterior mean errors. Numerical experiments for squared-exponential and Matérn kernels in one, two, and three dimensions often show 1–2 orders of magnitude acceleration over state-of-the-art rank-structured solvers at comparable accuracy. Our method allows two-dimensional Matérn-\(\frac{3}{2}\) regression from \(N=10^9\) data points to be performed in two minutes on a standard desktop, with posterior mean accuracy \(10^{-3}\). This opens up spatial statistics applications 100 times larger than previously possible.

Show Abstract

Uniqueness, regularity and characteristic flow for a non strictly convex singular variational problem

Jean-Francois Babadjian, G. Francfort

This work addresses the question of uniqueness and regularity of the minimizers of a convex but not strictly convex integral functional with linear growth in a two-dimensional setting. The integrand -- whose precise form derives directly from the theory of perfect plasticity -- behaves quadratically close to the origin and grows linearly once a specific threshold is reached. Thus, in contrast with the only existing literature on uniqueness for functionals with linear growth, that is that which pertains to the generalized least gradient, the integrand is not a norm. We make use of hyperbolic conservation laws hidden in the structure of the problem to tackle uniqueness. Our argument strongly relies on the regularity of a vector field -- the Cauchy stress in the terminology of perfect plasticity -- which allows us to define characteristic lines, and then to employ the method of characteristics. Using the detailed structure of the characteristic landscape evidenced in our preliminary study BF, we show that this vector field is actually continuous, save for possibly two points. The different behaviors of the energy density at zero and at infinity imply an inequality constraint on the Cauchy stress. Under a barrier type convexity assumption on the set where the inequality constraint is saturated, we show that uniqueness holds for pure Dirichlet boundary data devoid of any regularity properties, a stronger result than that of uniqueness for a given trace on the whole boundary since our minimizers can fail to attain the boundary data. We also show a partial regularity result for the minimizer.

Show Abstract

A mixing time bound for Gibbs sampling from log-smooth log-concave distributions

The Gibbs sampler, also known as the coordinate hit-and-run algorithm, is a Markov chain that is widely used to draw samples from probability distributions in arbitrary dimensions. At each iteration of the algorithm, a randomly selected coordinate is resampled from the distribution that results from conditioning on all the other coordinates. We study the behavior of the Gibbs sampler on the class of log-smooth and strongly log-concave target distributions supported on ℝ

Show Abstract

Neural network learns low-dimensional polynomials with SGD near the information-theoretic limit

Jason D. Lee, Kazusato Oko, Taiji Suzuki, D. Wu

We study the problem of gradient descent learning of a single-index target function f∗(x) = σ∗(⟨x,θ⟩) under isotropic Gaussian data in Rd, where the unknown link function σ∗ : R → R has information exponent p (defined as the lowest degree in the Hermite expansion). Prior works showed that gradientbased training of neural networks can learn this target with n ≳ dΘ(p) samples, and such complexity is predicted to be necessary by the correlational statistical query lower bound. Surprisingly, we prove that a two-layer neural network optimized by an SGD-based algorithm (on the squared loss) learns f∗ with a complexity that is not governed by the information exponent. Specifically, for arbitrary polynomial single-index models, we establish a sample and runtime complexity of n ≃ T = Θ(d·polylogd), where Θ(·) hides a constant only depending on the degree of σ∗; this dimension dependence matches the information theoretic limit up to polylogarithmic factors. More generally, we show that n ≳ d(p∗−1)∨1 samples are sufficient to achieve low generalization error, where p∗ ≤ p is the generative exponent of the link function. Core to our analysis is the reuse of minibatch in the gradient computation, which gives rise to higher-order information beyond correlational queries.

Show Abstract

Trapped acoustic waves and raindrops: high-order accurate integral equation method for localized excitation of a periodic staircase

F. Agocs, A. Barnett

We present a high-order boundary integral equation (BIE) method for the frequency-domain acoustic scattering of a point source by a singly-periodic, infinite, corrugated boundary. We apply it to the accurate numerical study of acoustic radiation in the neighborhood of a sound-hard two-dimensional staircase modeled after the El Castillo pyramid. Such staircases support trapped waves which travel along the surface and decay exponentially away from it. We use the array scanning method (Floquet–Bloch transform) to recover the scattered field as an integral over the family of quasiperiodic solutions parameterized by on-surface wavenumber. Each such BIE solution requires the quasiperiodic Green's function, which we evaluate using an efficient integral representation of lattice sum coefficients. We avoid the singularities and branch cuts present in the array scanning integral by complex contour deformation. For each frequency, this enables a solution accurate to around 10 digits in a few seconds. We propose a residue method to extract the limiting powers carried by trapped modes far from the source. Finally, by computing the trapped mode dispersion relation, we use a simple ray model to explain an acoustic chirp-like time-domain response that is referred to in the literature as the “raindrop effect.”

Show Abstract

xVal: A Continuous Numerical Tokenization for Scientific Language Models

Siavash Golkar, Ph.D. , Mariel Pettee, Ph.D. , M. Eickenberg, A. Bietti, et al.

Due in part to their discontinuous and discrete default encodings for numbers, Large Language Models (LLMs) have not yet been commonly used to process numerically-dense scientific datasets. Rendering datasets as text, however, could help aggregate diverse and multi-modal scientific data into a single training corpus, thereby potentially facilitating the development of foundation models for science. In this work, we introduce xVal, a strategy for continuously tokenizing numbers within language models that results in a more appropriate inductive bias for scientific applications. By training specially-modified language models from scratch on a variety of scientific datasets formatted as text, we find that xVal generally outperforms other common numerical tokenization strategies on metrics including out-of-distribution generalization and computational efficiency.

Show Abstract
  • Previous Page
  • Viewing
  • Next Page
Advancing Research in Basic Science and MathematicsSubscribe to Flatiron Institute announcements and other foundation updates