443 Publications

cryoJAX: A Cryo-electron Microscopy Image Simulation Library In JAX

Michael J. O'Brien, S. Hanson, D. Needleman, et al.

While cryo-electron microscopy (cryo-EM) has come to prominence in the last decade due to its ability to resolve biomolecular complexes at atomic resolution, advancements in experimental and computational methods have made cryo-EM promising for investigating intracellular organization and heterogeneous molecular states. A primary challenge for these alternative applications is the development of techniques for cryo-EM data analysis, which are very computationally demanding. To this end, it is advantageous to leverage advanced scientific computing frameworks for statistical analysis. One such framework is JAX, an emerging array-oriented Python numerical computing package for automatic differentiation and vectorization with a growing ecosystem for statistical inference and machine learning. We have developed cryoJAX, a cryo-EM image-simulation library for building computational data-analysis applications in JAX. CryoJAX is a flexible modeling language for cryo-EM image formation and therefore can support a wide range of data analysis downstream. By integrating with the JAX ecosystem, cryoJAX enables the development and deployment of algorithms for the growing breadth of scientific applications for cryo-EM.

Show Abstract

A Lightweight, Geometrically Flexible Fast Algorithm for the Evaluation of Layer and Volume Potentials

F. Fryklund, L. Greengard, S. Jiang, Samuel Potter

Over the last two decades, several fast, robust, and high-order accurate methods have been developed for solving the Poisson equation in complicated geometry using potential theory. In this approach, rather than discretizing the partial differential equation itself, one first evaluates a volume integral to account for the source distribution within the domain, followed by solving a boundary integral equation to impose the specified boundary conditions. Here, we present a new fast algorithm which is easy to implement and compatible with virtually any discretization technique, including unstructured domain triangulations, such as those used in standard finite element or finite volume methods. Our approach combines earlier work on potential theory for the heat equation, asymptotic analysis, the nonuniform fast Fourier transform (NUFFT), and the dual-space multilevel kernel-splitting (DMK) framework. It is insensitive to flaws in the triangulation, permitting not just nonconforming elements, but arbitrary aspect ratio triangles, gaps and various other degeneracies. On a single CPU core, the scheme computes the solution at a rate comparable to that of the fast Fourier transform (FFT) in work per gridpoint.

Show Abstract

Understanding the Mechanisms of Fast Hyperparameter Transfer

The growing scale of deep learning models has rendered exhaustive hyperparameter (HP) optimization prohibitively expensive. A promising solution is the use of scale-aware HPs, which can enable direct transfer of optimal settings from small-scale grid searches to large models with minimal performance loss. Such approaches are useful when the optimal settings converge "fast" enough with scale. While approaches like the Maximal Update Parameterization (μP) have empirically displayed fast transfer when scaling model width, a deeper conceptual understanding of the mechanisms that enable this is still missing. Our work establishes a systematic conceptual framework for analyzing fast HP transfer across different synthetic and practical scenarios. In synthetic settings, we present various quantitative examples where transfer either offers a provable computational advantage or fails even under (μP). We then propose a key property that enables the fast transfer often observed in practice: through a novel decomposition of the optimization trajectory, we identify one component that rapidly converges with model width and determines the optimal HPs, and the other that continues to improve the loss with increased width but has negligible impact on HP choice. We conjecture that this decomposition elucidates the key mechanisms behind fast transfer and empirically validate it in practical settings such as LLM training.

Show Abstract

Comparing cryo-EM methods and molecular dynamics simulation to investigate heterogeneity in ligand-bound TRPV1

M. Astore, David Silva-Sánchez, R. Blackwell, P. Cossio, S. Hanson

Cryogenic electron microscopy (cryo-EM) has emerged as a powerful method for resolving the structure of biological macromolecules. Recently, several computational methods have been developed to study the heterogeneity of molecules in single-particle cryo-EM. In this study, we analyze a publicly available dataset of TRPV1 using five such methods: 3DFlex, 3DVA, cryoDRGN, ManifoldEM, and Bayesian ensemble reweighting. We find significant heterogeneity, but each method produces different results, with some detecting only compositional or conformational heterogeneity. To compare these diverse results, we develop AnaVox to quantitatively determine agreement between heterogeneity methods. Furthermore, applying Bayesian ensemble reweighting combined with molecular dynamics simulations supports the presence of these rarer states within the sample. This study shows that although current methods reveal the presence of heterogeneity, their stochasticity and potential bias present challenges for their routine use. However, with future development, these tools will enable the use of cryo-EM data for quantitative biophysical investigations.

Show Abstract

Stabilizing the singularity swap quadrature for near-singular line integrals

David Krantz, A. Barnett, Anna-Karin Tornberg

Singularity swap quadrature (SSQ) is an effective method for the evaluation at nearby targets of potentials due to densities on curves in three dimensions. While highly accurate in most settings, it is known to suffer from catastrophic cancellation when the kernel exhibits both near-vanishing numerators and strong singularities, as arises with scalar double layer potentials or tensorial kernels in Stokes flow or linear elasticity. This precision loss turns out to be tied to the interpolation basis, namely monomial (for open curves) or Fourier (for closed curves). We introduce a simple yet powerful remedy: target-specific translated monomial and Fourier bases that explicitly incorporate the near-vanishing behavior of the kernel numerator. We combine this with a stable evaluation of the constant term which now dominates the integral, significantly reducing cancellation. We show that our approach achieves close to machine precision for prototype integrals, and up to ten orders of magnitude lower error than standard SSQ at extremely close evaluation distances, without significant additional computational cost.

Show Abstract

Facilitating analysis of open neurophysiology data on the DANDI Archive using large language model tools

The DANDI Archive is a key resource for sharing open neurophysiology data, hosting over 400 datasets in the Neurodata Without Borders (NWB) format. While these datasets hold tremendous potential for reanalysis and discovery, many researchers face barriers to reuse, including unfamiliarity with access methods and difficulty identifying relevant content. Here we introduce an AI-powered, agentic chat assistant and a notebook generation pipeline. The chat assistant serves as an interactive tool for exploring DANDI datasets. It leverages large language models (LLMs) and integrates with agentic tools to guide users through data access, visualization, and preliminary analysis. The notebook generator analyzes dataset structure with minimal human input, executing inspection scripts and generating visualizations. It then produces an instructional Python notebook tailored to the dataset. We applied this system to 12 recent datasets. Review by neurophysiology data specialists found the generated notebooks to be generally accurate and well-structured, with most notebooks rated as “very helpful.” This work demonstrates how AI can support FAIR principles by leveraging data standards and lowering barriers to data reuse and engagement.

Show Abstract

A Model-Guided Neural Network Method for the Inverse Scattering Problem

Olivia Tsang, O. Melia, Vasileios Charisopoulos, Jeremy Hoskins, Jeremy Hoskins, Rebecca Willett

Inverse medium scattering is an ill-posed, nonlinear wave-based imaging problem arising in medical imaging, remote sensing, and non-destructive testing. Machine learning (ML) methods offer increased inference speed and flexibility in capturing prior knowledge of imaging targets relative to classical optimization-based approaches; however, they perform poorly in regimes where the scattering behavior is highly nonlinear. A key limitation is that ML methods struggle to incorporate the physics governing the scattering process, which are typically inferred implicitly from the training data or loosely enforced via architectural design. In this paper, we present a method that endows a machine learning framework with explicit knowledge of problem physics, in the form of a differentiable solver representing the forward model. The proposed method progressively refines reconstructions of the scattering potential using measurements at increasing wave frequencies, following a classical strategy to stabilize recovery. Empirically, we find that our method provides high-quality reconstructions at a fraction of the computational or sampling costs of competing approaches.

Show Abstract

Protein Design with Agent Rosetta: A Case Study for Specialized Scientific Agents

Jacopo Teneggi, Tanya Marwah, A. Bietti, P. Douglas Renfrew, Vikram Mulligan, S. Golkar

Large language models (LLMs) are increasingly capable of emulating reasoning and using tools, creating opportunities for autonomous agents that execute complex scientific tasks. Protein design provides a natural case study: existing deep learning models achieve strong results, but they are typically restricted to canonical amino acids and narrow objectives, leaving space for a generalist tool for broad design pipelines. We introduce Agent Rosetta, an LLM agent built on top of the Rosetta suite---the leading physics-based software for heteropolymer design, capable of modeling non-canonical building blocks and geometries. Agent Rosetta is a single-agent, multi-turn framework that iteratively refines heteropolymers to achieve the goals of a user-defined task brief, combining the biophysical knowledge of modern LLMs with the accuracy of Rosetta's physics-based methods. In evaluations, Agent Rosetta achieves performance comparable to specialized deep learning models, especially when combined with inference-time techniques such as best-of-n sampling. Interestingly, we find that prompt engineering alone is insufficient for reliably producing RosettaScripts actions. This underscores the need for building a comprehensive environment that, for example, simplifies the most challenging aspects of RosettaScripts syntax. These results demonstrate that combining frontier LLMs with established domain-specific scientific tools can yield flexible agentic frameworks that not only lower barriers to use but also achieve performance competitive with specialized deep learning models.

Show Abstract

From Shortcut to Induction Head: How Data Diversity Shapes Algorithm Selection in Transformers

Ryotaro Kawata, Yujin Song, A. Bietti, Naoki Nishikawa, Taiji Suzuki, Samuel Vaiter, D. Wu

Transformers can implement both generalizable algorithms (e.g., induction heads) and simple positional shortcuts (e.g., memorizing fixed output positions). In this work, we study how the choice of pretraining data distribution steers a shallow transformer toward one behavior or the other. Focusing on a minimal trigger-output prediction task -- copying the token immediately following a special trigger upon its second occurrence -- we present a rigorous analysis of gradient-based training of a single-layer transformer. In both the infinite and finite sample regimes, we prove a transition in the learned mechanism: if input sequences exhibit sufficient diversity, measured by a low “max-sum” ratio of trigger-to-trigger distances, the trained model implements an induction head and generalizes to unseen contexts; by contrast, when this ratio is large, the model resorts to a positional shortcut and fails to generalize out-of-distribution (OOD). We also reveal a trade-off between the pretraining context length and OOD generalization, and derive the optimal pretraining distribution that minimizes computational cost per sample. Finally, we validate our theoretical predictions with controlled synthetic experiments, demonstrating that broadening context distributions robustly induces induction heads and enables OOD generalization. Our results shed light on the algorithmic biases of pretrained transformers and offer conceptual guidelines for data-driven control of their learned behaviors.

Show Abstract

Emergence of Linear Truth Encodings in Language Models

Shauli Ravfogel, Gilad Yehudai, Tal Linzen, Joan Bruna, A. Bietti

Recent probing studies reveal that large language models exhibit linear subspaces that separate true from false statements, yet the mechanism behind their emergence is unclear. We introduce a transparent, one-layer transformer toy model that reproduces such truth subspaces end-to-end and exposes one concrete route by which they can arise. We study one simple setting in which truth encoding can emerge: a data distribution where factual statements co-occur with other factual statements (and vice-versa), encouraging the model to learn this distinction in order to lower the LM loss on future tokens. We corroborate this pattern with experiments in pretrained language models. Finally, in the toy setting we observe a two-phase learning dynamic: networks first memorize individual factual associations in a few steps, then---over a longer horizon---learn to linearly separate true from false, which in turn lowers language-modeling loss. Together, these results provide both a mechanistic demonstration and an empirical motivation for how and why linear truth representations can emerge in language models.

Show Abstract
  • Previous Page
  • Viewing
  • Next Page
Advancing Research in Basic Science and MathematicsSubscribe to Flatiron Institute announcements and other foundation updates