CCM: Publications

Birth of a Transformer: A Memory Viewpoint

A. Bietti, Vivien Cabannes, Diane Bouchacourt, Herve Jegou, Leon Bottou

Large language models based on transformers have achieved great empirical successes. However, as they are deployed more widely, there is a growing need to better understand their internal mechanisms in order to make them more reliable. These models appear to store vast amounts of knowledge from their training data, and to adapt quickly to new information provided in their context or prompt. We study how transformers balance these two types of knowledge by considering a synthetic setup where tokens are generated from either global or context-specific bigram distributions. By a careful empirical analysis of the training process on a simplified two-layer transformer, we illustrate the fast learning of global bigrams and the slower development of an “induction head” mechanism for the in-context bigrams. We highlight the role of weight matrices as associative memories, provide theoretical insights on how gradients enable their learning during training, and study the role of data-distributional properties.

Show Abstract

On Learning Gaussian Multi-index Models with Gradient Flow

A. Bietti, Joan Bruna, L. Pillaud-Vivien

We study gradient flow on the multi-index regression problem for high-dimensional Gaussian data. Multi-index functions consist of a composition of an unknown low-rank linear projection and an arbitrary unknown, low-dimensional link function. As such, they constitute a natural template for feature learning in neural networks. We consider a two-timescale algorithm, whereby the low-dimensional link function is learnt with a non-parametric model infinitely faster than the subspace parametrizing the low-rank projection. By appropriately exploiting the matrix semigroup structure arising over the subspace correlation matrices, we establish global convergence of the resulting Grassmannian population gradient flow dynamics, and provide a quantitative description of its associated `saddle-to-saddle' dynamics. Notably, the timescales associated with each saddle can be explicitly characterized in terms of an appropriate Hermite decomposition of the target link function. In contrast with these positive results, we also show that the related

Show Abstract

Direct stellarator coil design using global optimization: application to a comprehensive exploration of quasi-axisymmetric devices

A. Giuliani

Many stellarator coil design problems are plagued by multiple minima, where the locally optimal coil sets can sometimes vary substantially in performance. As a result, solving a coil design problem a single time with a local optimization algorithm is usually insufficient and better optima likely do exist. To address this problem, we propose a global optimization algorithm for the design of stellarator coils and outline how to apply box constraints to the physical positions of the coils. The algorithm has a global exploration phase that searches for interesting regions of design space and is followed by three local optimization algorithms that search in these interesting regions (a "global-to-local" approach). The first local algorithm (phase I), following the globalization phase, is based on near-axis expansions and finds stellarator coils that optimize for quasisymmetry in the neighborhood of a magnetic axis. The second local algorithm (phase II) takes these coil sets and optimizes them for nested flux surfaces and quasisymmetry on a toroidal volume. The final local algorithm (phase III) polishes these configurations for an accurate approximation of quasisymmetry. Using our global algorithm, we study the trade-off between coil length, aspect ratio, rotational transform, and quality of quasi-axisymmetry. The database of stellarators, which comprises almost 140,000 coil sets, is available online and is called QUASR, for "QUAsi-symmetric Stellarator Repository".

Show Abstract

Discriminative calibration: Check Bayesian computation from simulations and flexible classifier

Y. Yao, Justin Domke

To check the accuracy of Bayesian computations, it is common to use rank-based simulation-based calibration (SBC). However, SBC has drawbacks: The test statistic is somewhat ad-hoc, interactions are difficult to examine, multiple testing is a challenge, and the resulting p-value is not a divergence metric. We propose to replace the marginal rank test with a flexible classification approach that learns test statistics from data. This measure typically has a higher statistical power than the SBC rank test and returns an interpretable divergence measure of miscalibration, computed from classification accuracy. This approach can be used with different data generating processes to address likelihood-free inference or traditional inference methods like Markov chain Monte Carlo or variational inference. We illustrate an automated implementation using neural networks and statistically-inspired features, and validate the method with numerical and real data experiments.

Show Abstract

Solving the Transmission Problem for Open Wave-Guides, II Outgoing Estimates

C. Epstein

The paper continues the analysis, started in [1] (Part I,arXiv:2302.04353), of the model open wave-guide problem defined by 2 semi-infinite, rectangular wave-guides meeting along a common perpendicular line. In Part I we reduce the solution of the physical problem to a transmission problem rephrased as a system of integral equations on the common perpendicular line. In this part we show that solutions of the integral equations introduced in Part I have asymptotic expansions, if the data allows it. Using these expansions we show that the solutions to the PDE found in each half space satisfy appropriate outgoing radiation conditions. In Part III we show that these conditions imply uniqueness of the solution to the PDE as well as uniqueness for our system of integral equations.

Show Abstract

Solving the Transmission Problem for Open Wave-Guides, I Fundamental Solutions and Integral Equations

C. Epstein

We introduce a layer potential representation for the solution of the transmission problem defined by two dielectric channels, or open wave-guides, meeting along the straight-line interface, $\{x_1=0\}.$ The main observation is that the outgoing fundamental solution for the operator $\Delta +k_1^2+q(x_2),$ acting on functions defined in ${\mathbb R}^2,$ is easily constructed using the Fourier transform in the $x_1$-variable and the elementary theory of ordinary differential equations. These fundamental solutions can then be used to represent the solution to the transmission problem in half planes. The transmission boundary conditions lead to integral equations along the intersection of the half planes, which, in our normalization, is the $x_2$-axis. We show that, in appropriate Banach spaces, these integral equations are Fredholm equations of second kind, which are therefore generically solvable. We analyze the representation of the guided modes in our formulation.

Show Abstract

A Neural Network Warm-Start Approach for the Inverse Acoustic Obstacle Scattering Problem

Mo Zhou, J. Han, M. Rachh, Carlos Borges

In this paper, we consider the inverse acoustic obstacle problem for sound-soft star-shaped obstacles in two dimensions wherein the boundary of the obstacle is determined from measurements of the scattered field at a collection of receivers outside the object. One of the standard approaches for solving this problem is to reformulate it as an optimization problem: finding the boundary of the domain that minimizes the L2 distance between computed values of the scattered field and the given measurement data. The optimization problem is computationally challenging since the local set of convexity shrinks with increasing frequency and results in an increasing number of local minima in the vicinity of the true solution. In many practical experimental settings, low frequency measurements are unavailable due to limitations of the experimental setup or the sensors used for measurement. Thus, obtaining a good initial guess for the optimization problem plays a vital role in this environment. We present a neural network warm-start approach for solving the inverse scattering problem, where an initial guess for the optimization problem is obtained using a trained neural network. We demonstrate the effectiveness of our method with several numerical examples. For high frequency problems, this approach outperforms traditional iterative methods such as Gauss-Newton initialized without any prior (i.e., initialized using a unit circle), or initialized using the solution of a direct method such as the linear sampling method. The algorithm remains robust to noise in the scattered field measurements and also converges to the true solution for limited aperture data. However, the number of training samples required to train the neural network scales exponentially in frequency and the complexity of the obstacles considered. We conclude with a discussion of this phenomenon and potential directions for future research.

Show Abstract

A class of dimensionality-free metrics for the convergence of empirical measures

J. Han, Ruimeng Hu, Jihao Long

This paper concerns the convergence of empirical measures in high dimensions. We propose a new class of probability metrics and show that under such metrics, the convergence is free of the curse of dimensionality (CoD). Such a feature is critical for high-dimensional analysis and stands in contrast to classical metrics (e.g., the Wasserstein metric). The proposed metrics fall into the category of integral probability metrics, for which we specify criteria of test function spaces to guarantee the property of being free of CoD. Examples of the selected test function spaces include the reproducing kernel Hilbert spaces, Barron space, and flow-induced function spaces. Three applications of the proposed metrics are presented: 1. The convergence of empirical measure in the case of random variables; 2. The convergence of n-particle system to the solution to McKean–Vlasov stochastic differential equation; 3. The construction of an ɛ-Nash equilibrium for a homogeneous n-player game by its mean-field limit. As a byproduct, we prove that, given a distribution close to the target distribution measured by our metric and a certain representation of the target distribution, we can generate a distribution close to the target one in terms of the Wasserstein metric and relative entropy. Overall, we show that the proposed class of metrics is a powerful tool to analyze the convergence of empirical measures in high dimensions without CoD.

Show Abstract

Variational Inference with Gaussian Score Matching

C. Modi, C. Margossian, Y. Yao, R. M. Gower, D. Blei, L. Saul

Variational inference (VI) is a method to approximate the computationally intractable posterior distributions that arise in Bayesian statistics. Typically, VI fits a simple parametric distribution to be close to the target posterior, optimizing an appropriate objective such as the evidence lower bound (ELBO). In this work, we present a new approach to VI. Our method is based on the principle of score matching---namely, that if two distributions are equal then their score functions (i.e., gradients of the log density) are equal at every point on their support. With this principle, we develop score-matching VI, an iterative algorithm that seeks to match the scores between the variational approximation and the exact posterior. At each iteration, score-matching VI solves an inner optimization, one that minimally adjusts the current variational estimate to match the scores at a newly sampled value of the latent variables. We show that when the variational family is a Gaussian, this inner optimization enjoys a closed-form solution, which we call Gaussian score matching VI (GSM-VI). GSM-VI is a ``black box'' variational algorithm in that it only requires a differentiable joint distribution, and as such it can be applied to a wide class of models. We compare GSM-VI to black box variational inference (BBVI), which has similar requirements but instead optimizes the ELBO. We first study how GSM-VI behaves as a function of the problem dimensionality, the condition number of the target covariance matrix (when the target is Gaussian), and the degree of mismatch between the approximating and exact posterior distribution. We then study GSM-VI on a collection of real-world Bayesian inference problems from the posteriorDB database of datasets and models. We find that GSM-VI is faster than BBVI and equally or more accurate. Specifically, over a wide range of target posteriors, GSM-VI requires 10-100x fewer gradient evaluations than BBVI to obtain a comparable quality of approximation.

Show Abstract

FMM-accelerated solvers for the Laplace-Beltrami problem on complex surfaces in three dimensions

Dhwanit Agarwal, Michael O'Neil, M. Rachh

The Laplace–Beltrami problem on closed surfaces embedded in three dimensions arises in many areas of physics, including molecular dynamics (surface diffusion), electromagnetics (harmonic vector fields), and fluid dynamics (vesicle deformation). Using classical potential theory, the Laplace–Beltrami operator can be pre-/post-conditioned with an integral operator whose kernel is translation invariant, resulting in well-conditioned Fredholm integral equations of the second-kind. These equations have the standard 1/r kernel from potential theory, and therefore the equations can be solved rapidly and accurately using a combination of fast multipole methods (FMMs) and high-order quadrature corrections. In this work we detail such a scheme, presenting two alternative integral formulations of the Laplace–Beltrami problem, each of whose solution can be obtained via FMM acceleration. We then present several applications of the solvers, focusing on the computation of what are known as harmonic vector fields, relevant for many applications in electromagnetics. A battery of numerical results are presented for each application, detailing the performance of the solver in various geometries.

Show Abstract