481 Publications

Nested R̂ : Assessing the Convergence of Markov Chain Monte Carlo When Running Many Short Chains

C. Margossian, Matthew D. Hoffman, Pavel Sountsov, Lionel Riou-Durand, Aki Vehtari, Andrew Gelman

Recent developments in parallel Markov chain Monte Carlo (MCMC) algorithms allow us to run thousands of chains almost as quickly as a single chain, using hardware accelerators such as GPUs. While each chain still needs to forget its initial point during a warmup phase, the subsequent sampling phase can be shorter than in classical settings, where we run only a few chains. To determine if the resulting short chains are reliable, we need to assess how close the Markov chains are to their stationary distribution after warmup. The potential scale reduction factor Rˆ is a popular convergence diagnostic but unfortunately can require a long sampling phase to work well. We present a nested design to overcome this challenge and a generalization called nested Rˆ. This new diagnostic works under conditions similar to Rˆ and completes the workflow for GPU-friendly samplers. In addition, the proposed nesting provides theoretical insights into the utility of Rˆ, in both classical and short-chains regimes.

Show Abstract

MoMo: Momentum Models for Adaptive Learning Rates

Fabian Schaipp, R. Ohana, M. Eickenberg, Aaron Defazio, R. M. Gower

Training a modern machine learning architecture on a new task requires extensive learning-rate tuning, which comes at a high computational cost. Here we develop new Polyak-type adaptive learning rates that can be used on top of any momentum method, and require less tuning to perform well. We first develop MoMo, a Momentum Model based adaptive learning rate for SGD-M (stochastic gradient descent with momentum). MoMo uses momentum estimates of the batch losses and gradients sampled at each iteration to build a model of the loss function. Our model also makes use of any known lower bound of the loss function by using truncation, e.g. most losses are lower-bounded by zero. The models is then approximately minimized at each iteration to compute the next step. We show how MoMo can be used in combination with any momentum-based method, and showcase this by developing MoMo-Adam - which is Adam with our new model-based adaptive learning rate. We show that MoMo attains a $\mathcal{O}(1/\sqrt{K})$ convergence rate for convex problems with interpolation, needing knowledge of no problem-specific quantities other than the optimal value. Additionally, for losses with unknown lower bounds, we develop on-the-fly estimates of a lower bound, that are incorporated in our model. We demonstrate that MoMo and MoMo-Adam improve over SGD-M and Adam in terms of robustness to hyperparameter tuning for training image classifiers on MNIST, CIFAR, and Imagenet, for recommender systems on the Criteo dataset, for a transformer model on the translation task IWSLT14, and for a diffusion model.

Show Abstract

Learning Associative Memories with Gradient Descent

Vivien Cabannes, B. Şimşek, A. Bietti

This work focuses on the training dynamics of one associative memory module storing outer products of token embeddings. We reduce this problem to the study of a system of particles, which interact according to properties of the data distribution and correlations between embeddings. Through theory and experiments, we provide several insights. In overparameterized regimes, we obtain logarithmic growth of the “classification margins.” Yet, we show that imbalance in token frequencies and memory interferences due to correlated embeddings lead to oscillatory transitory regimes. The oscillations are more pronounced with large step sizes, which can create benign loss spikes, although these learning rates speed up the dynamics and accelerate the asymptotic convergence. We also find that underparameterized regimes lead to suboptimal memorization schemes. Finally, we assess the validity of our findings on small Transformer models.

Show Abstract

Listening to the noise: Blind Denoising with Gibbs Diffusion

David Heurtel-Depeiges, C. Margossian, R. Ohana, B. Régaldo-Saint Blancard

In recent years, denoising problems have become intertwined with the development of deep generative models. In particular, diffusion models are trained like denoisers, and the distribution they model coincide with denoising priors in the Bayesian picture. However, denoising through diffusion-based posterior sampling requires the noise level and covariance to be known, preventing blind denoising. We overcome this limitation by introducing Gibbs Diffusion (GDiff), a general methodology addressing posterior sampling of both the signal and the noise parameters. Assuming arbitrary parametric Gaussian noise, we develop a Gibbs algorithm that alternates sampling steps from a conditional diffusion model trained to map the signal prior to the class of noise distributions, and a Monte Carlo sampler to infer the noise parameters. Our theoretical analysis highlights potential pitfalls, guides diagnostic usage, and quantifies errors in the Gibbs stationary distribution caused by the diffusion model. We showcase our method for 1) blind denoising of natural images involving colored noises with unknown amplitude and exponent, and 2) a cosmology problem, namely the analysis of cosmic microwave background data, where Bayesian inference of "noise" parameters means constraining models of the evolution of the Universe.

Show Abstract

SILVER: Single-loop variance reduction and application to federated learning

Kazusato Oko, Shunta Akiyama, D. Wu, Tomoya Murata, Taiji Suzuki

Most variance reduction methods require multiple times of full gradient computation, which is time-consuming and hence a bottleneck in application to distributed optimization. We present a single-loop variance-reduced gradient estimator named SILVER (SIngle-Loop VariancE-Reduction) for the finite-sum non-convex optimization, which does not require multiple full gradients but nevertheless achieves the optimal gradient complexity. Notably, unlike existing methods, SILVER provably reaches second-order optimality, with exponential convergence in the Polyak-Łojasiewicz (PL) region, and achieves further speedup depending on the data heterogeneity. Owing to these advantages, SILVER serves as a new base method to design communication-efficient federated learning algorithms: we combine SILVER with local updates which gives the best communication rounds and number of communicated gradients across all range of Hessian heterogeneity, and, at the same time, guarantees second-order optimality and exponential convergence in the PL region.

Show Abstract

Amortized Variational Inference: When and Why?

C. Margossian, D. Blei

In a probabilistic latent variable model, factorized (or mean-field) variational inference (F-VI) fits a separate parametric distribution for each latent variable. Amortized variational inference (A-VI) instead learns a common inference function, which maps each observation to its corresponding latent variable’s approximate posterior. Typically, A-VI is used as a cog in the training of variational autoencoders, however it stands to reason that A-VI could also be used as a general alternative to F-VI. In this paper we study when and why A-VI can be used for approximate Bayesian inference. We derive conditions on a latent variable model which are necessary, sufficient, and verifiable under which A-VI can attain F-VI’s optimal solution, thereby closing the amortization gap. We prove these conditions are uniquely verified by simple hierarchical models, a broad class that encompasses many models in machine learning. We then show, on a broader class of models, how to expand the domain of AVI’s inference function to improve its solution, and we provide examples, e.g. hidden Markov models, where the amortization gap cannot be closed.

Show Abstract

Heavy-Tailed Class Imbalance and Why Adam Outperforms Gradient Descent on Language Models

Frederik Kunstner, Robin Yadav, Alan Milligan, Mark Schmidt, A. Bietti

Adam has been shown to outperform gradient descent on large language models by a larger margin than on other tasks, but it is unclear why. We show that a key factor in this performance gap is the heavy-tailed class imbalance found in language tasks. When trained with gradient descent, the loss of infrequent words decreases more slowly than the loss of frequent ones. This leads to a slow decrease on the average loss as most samples come from infrequent words. On the other hand, Adam and sign-based methods are less sensitive to this problem. To establish that this behavior is caused by class imbalance, we show empirically that it can be reproduced across architectures and data types, on language transformers, vision CNNs, and linear models. On a linear model with cross-entropy loss, we show that class imbalance leads to imbalanced, correlated gradients and Hessians that have been hypothesized to benefit Adam. We also prove that, in continuous time, gradient descent converges slowly on low-frequency classes while sign descent does not.

Show Abstract

Delayed rejection Hamiltonian Monte Carlo for sampling multiscale distributions

The efficiency of Hamiltonian Monte Carlo (HMC) can suffer when sampling a distribution with a wide range of length scales, because the small step sizes needed for stability in high-curvature regions are inefficient elsewhere. To address this we present a delayed rejection (DR) variant: if an initial HMC trajectory is rejected, we make one or more subsequent proposals each using a step size geometrically smaller than the last. To reduce the cost of DR approaches, we extend the standard delayed rejection to a probabilistic framework wherein we do not make multiple proposals at every rejection, but allow the probability of a retry to depend on the probability of accepting the previous proposal. We test the scheme in several sampling tasks, including statistical applications and multiscale model distributions such as Neal’s funnel. Delayed rejection enables sampling multiscale distributions for which standard approaches such as HMC fail to explore the tails, and improves performance five-fold over optimally-tuned HMC as measured by effective sample size per gradient evaluation. Even for simpler distributions, delayed rejection provides increased robustness to step size misspecification.

Show Abstract

posteriordb: Testing, Benchmarking and Developing Bayesian Inference Algorithms

Måns Magnusson, Jakob Torgander, Paul-Christian Bürkner, Lu Zhang, B. Carpenter, Aki Vehtari

The generality and robustness of inference algorithms is critical to the success of widely used probabilistic programming languages such as Stan, PyMC, Pyro, and this http URL. When designing a new general-purpose inference algorithm, whether it involves Monte Carlo sampling or variational approximation, the fundamental problem arises in evaluating its accuracy and efficiency across a range of representative target models. To solve this problem, we propose posteriordb, a database of models and data sets defining target densities along with reference Monte Carlo draws. We further provide a guide to the best practices in using posteriordb for model evaluation and comparison. To provide a wide range of realistic target densities, posteriordb currently comprises 120 representative models and has been instrumental in developing several general inference algorithms.

Show Abstract

Good Rates From Bad Coordinates: The Exponential Average Time-dependent Rate Approach

Nicodemo Mazzaferro, Subarna Sasmal, P. Cossio, Glen M. Hocky

Our ability to calculate rate constants of biochemical processes using molecular dynamics simulations is severely limited by the fact that the time scales for reactions, or changes in conformational state, scale exponentially with the relevant free-energy barrier heights. In this work, we improve upon a recently proposed rate estimator that allows us to predict transition times with molecular dynamics simulations biased to rapidly explore one or several collective variables (CVs). This approach relies on the idea that not all bias goes into promoting transitions, and along with the rate, it estimates a concomitant scale factor for the bias termed the “CV biasing efficiency”γ. First, we demonstrate mathematically that our new formulation allows us to derive the commonly used Infrequent Metadynamics (iMetaD) estimator when using a perfect CV, where γ= 1. After testing it on a model potential, we then study the unfolding behavior of a previously well characterized coarse-grained protein, which is sufficiently complex that we can choose many different CVs to bias, but which is sufficiently simple that we are able to compute the unbiased rate directly. For this system, we demonstrate that predictions from our new Exponential Average Time-Dependent Rate (EATR) estimator converge to the true rate constant more rapidly as a function of bias deposition time than does the previous iMetaD approach, even for bias deposition times that are short. We also show that the γparameter can serve as a good metric for assessing the quality of the biasing coordinate. We demonstrate that these results hold when applying the methods to an atomistic protein folding example. Finally, we demonstrate that our approach works when combining multiple less-than-optimal bias coordinates, and adapt our method to the related “OPES flooding”approach. Overall, our time-dependent rate approach offers a powerful framework for predicting rate constants from biased simulations.

Show Abstract
  • Previous Page
  • Viewing
  • Next Page
Advancing Research in Basic Science and MathematicsSubscribe to Flatiron Institute announcements and other foundation updates