443 Publications

On Single Index Models beyond Gaussian Data

L. Pillaud-Vivien, Joan Bruna, Ph.D., Aaron Zweig, Ph.D.

Sparse high-dimensional functions have arisen as a rich framework to study the behavior of gradient-descent methods using shallow neural networks, showcasing their ability to perform feature learning beyond linear models. Amongst those functions, the simplest are single-index models $f(x) = \phi( x \cdot \theta^*)$, where the labels are generated by an arbitrary non-linear scalar link function $\phi$ applied to an unknown one-dimensional projection $\theta^*$ of the input data. By focusing on Gaussian data, several recent works have built a remarkable picture, where the so-called information exponent (related to the regularity of the link function) controls the required sample complexity. In essence, these tools exploit the stability and spherical symmetry of Gaussian distributions. In this work, building from the framework of \cite{arous2020online}, we explore extensions of this picture beyond the Gaussian setting, where both stability or symmetry might be violated. Focusing on the planted setting where $\phi$ is known, our main results establish that Stochastic Gradient Descent can efficiently recover the unknown direction $\theta^*$ in the high-dimensional regime, under assumptions that extend previous works \cite{yehudai2020learning,wu2022learning}.

Show Abstract

SGD with Large Step Sizes Learns Sparse Features

Maksym Andriushchenko, Aditya Varre, L. Pillaud-Vivien, Nicolas Flammarion

We showcase important features of the dynamics of the Stochastic Gradient Descent (SGD) in the training of neural networks. We present empirical observations that commonly used large step sizes (i) may lead the iterates to jump from one side of a valley to the other causing loss stabilization, and (ii) this stabilization induces a hidden stochastic dynamics that biases it implicitly toward simple predictors. Furthermore, we show empirically that the longer large step sizes keep SGD high in the loss landscape valleys, the better the implicit regularization can operate and find sparse representations. Notably, no explicit regularization is used: the regularization effect comes solely from the SGD dynamics influenced by the large step sizes schedule. Therefore, these observations unveil how, through the step size schedules, both gradient and noise drive together the SGD dynamics through the loss landscape of neural networks. We justify these findings theoretically through the study of simple neural network models as well as qualitative arguments inspired from stochastic processes. This analysis allows us to shed new light on some common practices and observed phenomena when training deep networks. The code of our experiments is available at https://github.com/tml-epfl/sgd-sparse-features.

Show Abstract

A Model-Based Method for Minimizing CVaR and Beyond

Si Yi Meng, R. M. Gower

We develop a variant of the stochastic prox-linear method for minimizing the Conditional Value-at-Risk (CVaR) objective. CVaR is a risk measure focused on minimizing worst-case performance, defined as the average of the top quantile of the losses. In machine learning, such a risk measure is useful to train more robust models. Although the stochastic subgradient method (SGM) is a natural choice for minimizing the CVaR objective, we show that our stochastic prox-linear (SPL+) algorithm can better exploit the structure of the objective, while still providing a convenient closed form update. Our SPL+ method also adapts to the scaling of the loss function, which allows for easier tuning. We then specialize a general convergence theorem for SPL+ to our setting, and show that it allows for a wider selection of step sizes compared to SGM. We support this theoretical finding experimentally.

Show Abstract

BridgeStan: Efficient in-memory access to the methods of a Stan model

Edward A. Roualdes, B. Ward, B. Carpenter, Adrian Seyboldt , Seth D. Axen

Stan provides a probabilistic programming language in which users can code Bayesian models (Carpenter et al., 2017; Stan Development Team, 2022). A Stan program is transpiled to to a C++ class that links to the Stan math library to implement smooth, unconstrained posterior log densities, gradients, and Hessians as well as constraining/unconstraining transforms. Implementation is provided through automatic differentiation in the Stan math library (Carpenter et al., 2015). BridgeStan provides in-memory access to the methods of Stan models through Python, Julia, R, and Rust. This allows algorithm development in these languages with the numerical efficiency and expressiveness of Stan models. Furthermore, these features are exposed through a language agnostic C API, allowing foreign function interfaces in other languages to utilize BridgeStan with minimal additional development.

Show Abstract

Unlocking the Potential of Similarity Matching: Scalability, Supervision and Pre-training

Y. Bahroun, Shagesh Sridharan, Atithi Acharya, D. Chklovskii, A. Sengupta

While effective, the backpropagation (BP) algorithm exhibits limitations in terms of biological plausibility, computational cost, and suitability for online learning. As a result, there has been a growing interest in developing alternative biologically plausible learning approaches that rely on local learning rules. This study focuses on the primarily unsupervised similarity matching (SM) framework, which aligns with observed mechanisms in biological systems and offers online, localized, and biologically plausible algorithms. i) To scale SM to large datasets, we propose an implementation of Convolutional Nonnegative SM using PyTorch. ii) We introduce a localized supervised SM objective reminiscent of canonical correlation analysis, facilitating stacking SM layers. iii) We leverage the PyTorch implementation for pre-training architectures such as LeNet and compare the evaluation of features against BP-trained models. This work combines biologically plausible algorithms with computational efficiency opening multiple avenues for further explorations.

Show Abstract

Kernelized Diffusion Maps

L. Pillaud-Vivien, Francis Bach, Ph.D.

Spectral clustering and diffusion maps are celebrated dimensionality reduction algorithms built on eigen-elements related to the diffusive structure of the data. The core of these procedures is the approximation of a Laplacian through a graph kernel approach, however this local average construction is known to be cursed by the high-dimension 𝑑. In this article, we build a different estimator of the Laplacian, via a reproducing kernel Hilbert spaces method, which adapts naturally to the regularity of the problem. We provide non-asymptotic statistical rates proving that the kernel estimator we build can circumvent the curse of dimensionality. Finally we discuss techniques (Nyström subsampling, Fourier features) that enable to reduce the computational cost of the estimator while not degrading its overall performance.

Show Abstract

General adjoint-differentiated Laplace approximation

C. Margossian

The hierarchical prior used in Latent Gaussian models (LGMs) induces a posterior geometry prone to frustrate inference algorithms. Marginalizing out the latent Gaussian variable using an integrated Laplace approximation removes the offending geometry, allowing us to do efficient inference on the hyperparameters. To use gradient-based inference we need to compute the approximate marginal likelihood and its gradient. The adjoint-differentiated Laplace approximation differentiates the marginal likelihood and scales well with the dimension of the hyperparameters. While this method can be applied to LGMs with any prior covariance, it only works for likelihoods with a diagonal Hessian. Furthermore, the algorithm requires methods which compute the first three derivatives of the likelihood with current implementations relying on analytical derivatives. I propose a generalization which is applicable to a broader class of likelihoods and does not require analytical derivatives of the likelihood. Numerical experiments suggest the added flexibility comes at no computational cost: on a standard LGM, the new method is in fact slightly faster than the existing adjoint-differentiated Laplace approximation. I also apply the general method to an LGM with an unconventional likelihood. This example highlights the algorithm's potential, as well as persistent challenges.

Show Abstract

Multi-level analysis of the gut–brain axis shows autism spectrum disorder-associated molecular and microbial profiles

J. Morton, Dong-Min Jin, Robert H. Mills, Yan Shao, Gibraan Rahman, Daniel McDonald, Qiyun Zhu, Metin Balaban, Yueyu Jiang, Kalen Cantrell, Antonio Gonzalez, Julie Carmel, Linoy Mia Frankiensztajn, Sandra Martin-Brevet, Kirsten Berding, Brittany D. Needham, María Fernanda Zurita, Maude David, Olga V. Averina, Alexey S. Kovtun, Antonio Noto, Michele Mussap, Mingbang Wang, Daniel N. Frank, Et al.

Autism spectrum disorder (ASD) is a neurodevelopmental disorder characterized by heterogeneous cognitive, behavioral and communication impairments. Disruption of the gut–brain axis (GBA) has been implicated in ASD although with limited reproducibility across studies. In this study, we developed a Bayesian differential ranking algorithm to identify ASD-associated molecular and taxa profiles across 10 cross-sectional microbiome datasets and 15 other datasets, including dietary patterns, metabolomics, cytokine profiles and human brain gene expression profiles. We found a functional architecture along the GBA that correlates with heterogeneity of ASD phenotypes, and it is characterized by ASD-associated amino acid, carbohydrate and lipid profiles predominantly encoded by microbial species in the genera Prevotella, Bifidobacterium, Desulfovibrio and Bacteroides and correlates with brain gene expression changes, restrictive dietary patterns and pro-inflammatory cytokine profiles. The functional architecture revealed in age-matched and sex-matched cohorts is not present in sibling-matched cohorts. We also show a strong association between temporal changes in microbiome composition and ASD phenotypes. In summary, we propose a framework to leverage multi-omic datasets from well-defined cohorts and investigate how the GBA influences ASD.

Show Abstract

Improving Gradient Computation for Differentiable Physics Simulation with Contacts

Yaofeng Desmond Zhong, J. Han, Biswadip Dey, Georgia Olympia Brikis

Differentiable simulation enables gradients to be back-propagated through physics simulations. In this way, one can learn the dynamics and properties of a physics system by gradient-based optimization or embed the whole differentiable simulation as a layer in a deep learning model for downstream tasks, such as planning and control. However, differentiable simulation at its current stage is not perfect and might provide wrong gradients that deteriorate its performance in learning tasks. In this paper, we study differentiable rigid-body simulation with contacts. We find that existing differentiable simulation methods provide inaccurate gradients when the contact normal direction is not fixed - a general situation when the contacts are between two moving objects. We propose to improve gradient computation by continuous collision detection and leverage the time-of-impact (TOI) to calculate the post-collision velocities. We demonstrate our proposed method, referred to as TOI-Velocity, on two optimal control problems. We show that with TOI-Velocity, we are able to learn an optimal control sequence that matches the analytical solution, while without TOI-Velocity, existing differentiable simulation methods fail to do so.

Show Abstract
  • Previous Page
  • Viewing
  • Next Page
Advancing Research in Basic Science and MathematicsSubscribe to Flatiron Institute announcements and other foundation updates