CCM: Publications

Offline supervised learning vs online direct policy optimization: A comparative study and a unified training paradigm for neural network-based optimal feedback control

Yue Zhao, J. Han

This work is concerned with solving neural network-based feedback controllers efficiently for optimal control problems. We first conduct a comparative study of two prevalent approaches: offline supervised learning and online direct policy optimization. Albeit the training part of the supervised learning approach is relatively easy, the success of the method heavily depends on the optimal control dataset generated by open-loop optimal control solvers. In contrast, direct policy optimization turns the optimal control problem into an optimization problem directly without any requirement of pre-computing, but the dynamics-related objective can be hard to optimize when the problem is complicated. Our results underscore the superiority of offline supervised learning in terms of both optimality and training time. To overcome the main challenges, dataset and optimization, in the two approaches respectively, we complement them and propose the Pre-train and Fine-tune strategy as a unified training paradigm for optimal feedback control, which further improves the performance and robustness significantly. Our code is accessible at https://github.com/yzhao98/DeepOptimalControl.

Show Abstract

Contextual Counting: A Mechanistic Study of Transformers on a Quantitative Task

Siavash Golkar, A. Bietti, Mariel Pettee, Michael Eickenberg, et al.

Transformers have revolutionized machine learning across diverse domains, yet understanding their behavior remains crucial, particularly in high-stakes applications. This paper introduces the contextual counting task, a novel toy problem aimed at enhancing our understanding of Transformers in quantitative and scientific contexts. This task requires precise localization and computation within datasets, akin to object detection or region-based scientific analysis. We present theoretical and empirical analysis using both causal and non-causal Transformer architectures, investigating the influence of various positional encodings on performance and interpretability. In particular, we find that causal attention is much better suited for the task, and that no positional embeddings lead to the best accuracy, though rotary embeddings are competitive and easier to train. We also show that out of distribution performance is tightly linked to which tokens it uses as a bias term.

Show Abstract

Crowdsourcing with Difficulty: A Bayesian Rating Model for Heterogeneous Items

Seong Woo Han, Ozan Adıgüzel, B. Carpenter

In applied statistics and machine learning, the "gold standards" used for training are often biased and almost always noisy. Dawid and Skene's justifiably popular crowdsourcing model adjusts for rater (coder, annotator) sensitivity and specificity, but fails to capture distributional properties of rating data gathered for training, which in turn biases training. In this study, we introduce a general purpose measurement-error model with which we can infer consensus categories by adding item-level effects for difficulty, discriminativeness, and guessability. We further show how to constrain the bimodal posterior of these models to avoid (or if necessary, allow) adversarial raters. We validate our model's goodness of fit with posterior predictive checks, the Bayesian analogue of χ

Show Abstract

Neurosift: DANDI exploration and NWB visualization in the browser

J. Magland, J. Soules, Cody Baker, Benjamin Dichter

Neurosift, a browser-based visualization tool, is designed for the interactive exploration of Neurodata Without Borders (NWB) files, whether stored locally, on remote servers, or within the Distributed Archives for Neurophysiology Data Integration (DANDI). NWB (Rübel et al., 2022; Teeters et al., 2015) is an open data standard for neurophysiology that enables the sharing, archiving, and analysis of various types of neurophysiology data. DANDI (Rübel et al., 2022) is a cloud-based platform that supports the storage, sharing, and analysis of neurophysiology data including NWB files. With Neurosift integration, users browsing DANDI can easily open any NWB file in the browser and explore its contents, including timeseries data, images, and more. Neurosift can also be used to browse the DANDI database or individual Dandisets. Overall, Neurosift simplifies the visualization and exploration of complex NWB file structures, making it a valuable tool for neuroscientists.

Show Abstract

Why is parameter averaging beneficial in SGD? An objective smoothing perspective

Atsushi Nitanda, Ryuhei Kikuchi, Shugo Maeda, D. Wu

It is often observed that stochastic gradient descent (SGD) and its variants implicitly select a solution with good generalization performance; such implicit bias is often characterized in terms of the sharpness of the minima. Kleinberg et al. (2018) connected this bias with the smoothing effect of SGD which eliminates sharp local minima by the convolution using the stochastic gradient noise. We follow this line of research and study the commonly-used averaged SGD algorithm, which has been empirically observed in Izmailov et al. (2018) to prefer a flat minimum and therefore achieves better generalization. We prove that in certain problem settings, averaged SGD can efficiently optimize the smoothed objective which avoids sharp local minima. In experiments, we verify our theory and show that parameter averaging with an appropriate step size indeed leads to significant improvement in the performance of SGD.

Show Abstract

Simulation-Based Stacking

Yuling Yao , B. Régaldo-Saint Blancard, Justin Domke

Simulation-based inference has been popular for amortized Bayesian computation. It is typical to have more than one posterior approximation, from different inference algorithms, different architectures, or simply the randomness of initialization and stochastic gradients. With a consistency guarantee, we present a general posterior stacking framework to make use of all available approximations. Our stacking method is able to combine densities, simulation draws, conf idence intervals, and moments, and address the overall precision, calibration, coverage, and bias of the posterior approximation at the same time. We illustrate our method on several benchmark simulations and a challenging cosmological inference task.

Show Abstract

GIST: Gibbs self-tuning for locally adaptive Hamiltonian Monte Carlo

N. Bou-Rabee, B. Carpenter, Milo Marsden

We introduce a novel and flexible framework for constructing locally adaptive Hamiltonian Monte Carlo (HMC) samplers by Gibbs sampling the algorithm's tuning parameters conditionally based on the position and momentum at each step. For adaptively sampling path lengths, this framework -- which we call Gibbs self-tuning (GIST) -- encompasses randomized HMC, multinomial HMC, the No-U-Turn Sampler (NUTS), and the Apogee-to-Apogee Path Sampler as special cases. The GIST framework is illustrated with a novel alternative to NUTS for locally adapting path lengths, evaluated with an exact Hamiltonian for a high-dimensional, ill-conditioned Gaussian measure and with the leapfrog integrator for a suite of diverse models.

Show Abstract

Improved statistical and computational complexity of the mean-field Langevin dynamics under structured data

Atsushi Nitanda, Kazusato Oko, Taiji Suzuki, D. Wu

Recent works have shown that neural networks optimized by gradient-based methods can adapt to sparse or low-dimensional target functions through feature learning; an often studied target is the sparse parity function on the unit hypercube. However, such isotropic data setting does not capture the anisotropy and low intrinsic dimensionality exhibited in realistic datasets. In this work, we address this shortcoming by studying how gradient-based feature learning interacts with structured (anisotropic) input data: we consider the classification of -sparse parity on high-dimensional orthotope where the feature coordinates have varying magnitudes, and analyze the learning complexity of the mean-field Langevin dynamics (MFLD), which describes the noisy gradient descent update on two-layer neural network. We show that the statistical complexity (i.e. sample size) and computational complexity (i.e. network width) of MFLD can both be improved when prominent directions of the anisotropic input data align with the support of the target function. Moreover, by employing a coordinate transform determined by the gradient covariance, the width can be made independent of the target degree. Lastly, we demonstrate the benefit of feature learning by establishing a kernel lower bound on the classification error, which applies to neural networks in the lazy regime.

Show Abstract

Efficient convergent boundary integral methods for slender bodies

D. Malhotra, A. Barnett

The interaction of fibers in a viscous (Stokes) fluid plays a crucial role in industrial and biological processes, such as sedimentation, rheology, transport, cell division, and locomotion. Numerical simulations generally rely on slender body theory (SBT), an asymptotic, nonconvergent approximation whose error blows up as fibers approach each other. Yet convergent boundary integral equation (BIE) methods which completely resolve the fiber surface have so far been impractical due to the prohibitive cost of layer-potential quadratures in such high aspect-ratio 3D geometries. We present a high-order Nyström quadrature scheme with aspect-ratio independent cost, making such BIEs practical. It combines centerline panels (each with a small number of poloidal Fourier modes), toroidal Green's functions, generalized Chebyshev quadratures, HPC parallel implementation, and FMM acceleration. We also present new BIE formulations for slender bodies that lead to well conditioned linear systems upon discretization. We test Laplace and Stokes Dirichlet problems, and Stokes mobility problems, for slender rigid closed fibers with (possibly varying) circular cross-section, at separations down to 1/20 of the slender radius, reporting convergence typically to at least 10 digits. We use this to quantify the breakdown of numerical SBT for close-to-touching rigid fibers. We also apply the methods to time-step the sedimentation of 512 loops with up to 1.65 million unknowns at around 7 digits of accuracy.

Show Abstract

Galaxy clustering analysis with SimBIG and the wavelet scattering transform

B. Régaldo-Saint Blancard, ChangHoon Hahn, Shirley Ho, Jiamin Hou, Pablo Lemos, Elena Massara , C. Modi, Azadeh Moradinezhad Dizgah, Liam Parker, Y. Yao, M. Eickenberg

The non-Gaussian spatial distribution of galaxies traces the large-scale structure of the Universe and therefore constitutes a prime observable to constrain cosmological parameters. We conduct Bayesian inference of the Λ CDM parameters Ωm, Ωb, h , ns, and σ8 from the Baryon Oscillation Spectroscopic Survey CMASS galaxy sample by combining the wavelet scattering transform (WST) with a simulation-based inference approach enabled by the SimBIG forward model. We design a set of reduced WST statistics that leverage symmetries of redshift-space data. Posterior distributions are estimated with a conditional normalizing flow trained on 20,000 simulated SimBIG galaxy catalogs with survey realism. We assess the accuracy of the posterior estimates using simulation-based calibration and quantify generalization and robustness to the change of forward model using a suite of 2000 test simulations. When probing scales down to kmax=0.5 h /Mpc , we are able to derive accurate posterior estimates that are robust to the change of forward model for all parameters, except σ8. We mitigate the robustness issues with σ8 by removing the WST coefficients that probe scales smaller than k ∼0.3 h /Mpc . Applied to the Baryon Oscillation Spectroscopic Survey CMASS sample, our WST analysis yields seemingly improved constraints obtained from a standard perturbation-theory-based power spectrum analysis with kmax=0.25 h /Mpc for all parameters except h . However, we still raise concerns on these results. The observational predictions significantly vary across different normalizing flow architectures, which we interpret as a form of model misspecification. This highlights a key challenge for forward modeling approaches when using summary statistics that are sensitive to detailed model-specific or observational imprints on galaxy clustering.

Show Abstract