Publications

Deep Learning Sequence Models for Transcriptional Regulation

Deciphering the regulatory code of gene expression and interpreting the transcriptional effects of genome variation are critical challenges in human genetics. Modern experimental technologies have resulted in an abundance of data, enabling the development of sequence-based deep learning models that link patterns embedded in DNA to the biochemical and regulatory properties contributing to transcriptional regulation, including modeling epigenetic marks, 3D genome organization, and gene expression, with tissue and cell-type specificity. Such methods can predict the functional consequences of any noncoding variant in the human genome, even rare or never-before-observed variants, and systematically characterize their consequences beyond what is tractable from experiments or quantitative genetics studies alone. Recently, the development and application of interpretability approaches have led to the identification of key sequence patterns contributing to the predicted tasks, providing insights into the underlying biological mechanisms learned and revealing opportunities for improvement in future models.

Show Abstract

BlastAssist: a deep learning pipeline to measure interpretable features of human embryos

Helen Y Yang, Brian D Leahy, D. Needleman

Can the BlastAssist deep learning pipeline perform comparably to or outperform human experts and embryologists at measuring interpretable, clinically relevant features of human embryos in IVF?

Show Abstract

The Hund-metal path to strong electronic correlations

A. Georges, G. Kotliar

Different families of materials follow distinct routes to strong-correlation physics. In so-called Mott insulator systems, the Coulomb repulsion of electrons impedes their motion and blocks their kinetic energy. Materials in the heavy-fermion family have two fluids of electrons, which live rather independent lives at high temperatures: mobile electrons and localized f electrons that form local magnetic moments. At very low temperatures, the hybridization, or quantum mechanical mixing, between those two species of electrons becomes relevant. Then a single fluid of itinerant, albeit slowly moving, “heavy” electronic quasiparticles emerges below a characteristic scale known as the Kondo temperature.

Show Abstract

Galaxy clustering analysis with SimBIG and the wavelet scattering transform

B. Régaldo-Saint Blancard, ChangHoon Hahn, Shirley Ho, Jiamin Hou, Pablo Lemos, Elena Massara , C. Modi, Azadeh Moradinezhad Dizgah, Liam Parker, Y. Yao, M. Eickenberg

The non-Gaussian spatial distribution of galaxies traces the large-scale structure of the Universe and therefore constitutes a prime observable to constrain cosmological parameters. We conduct Bayesian inference of the Λ CDM parameters Ωm, Ωb, h , ns, and σ8 from the Baryon Oscillation Spectroscopic Survey CMASS galaxy sample by combining the wavelet scattering transform (WST) with a simulation-based inference approach enabled by the SimBIG forward model. We design a set of reduced WST statistics that leverage symmetries of redshift-space data. Posterior distributions are estimated with a conditional normalizing flow trained on 20,000 simulated SimBIG galaxy catalogs with survey realism. We assess the accuracy of the posterior estimates using simulation-based calibration and quantify generalization and robustness to the change of forward model using a suite of 2000 test simulations. When probing scales down to kmax=0.5 h /Mpc , we are able to derive accurate posterior estimates that are robust to the change of forward model for all parameters, except σ8. We mitigate the robustness issues with σ8 by removing the WST coefficients that probe scales smaller than k ∼0.3 h /Mpc . Applied to the Baryon Oscillation Spectroscopic Survey CMASS sample, our WST analysis yields seemingly improved constraints obtained from a standard perturbation-theory-based power spectrum analysis with kmax=0.25 h /Mpc for all parameters except h . However, we still raise concerns on these results. The observational predictions significantly vary across different normalizing flow architectures, which we interpret as a form of model misspecification. This highlights a key challenge for forward modeling approaches when using summary statistics that are sensitive to detailed model-specific or observational imprints on galaxy clustering.

Show Abstract

Promoter and Gene-Body RNA-Polymerase II co-exist in partial demixed condensates

Arya Changiarath , Jasper J. Michels, S. Hanson

In cells, transcription is tightly regulated on multiple layers. The condensation of the transcription machinery into distinct phases is hypothesized to spatio-temporally fine tune RNA polymerase II behaviour during two key stages, transcription initiation and the elongation of the nascent RNA transcripts. However, it has remained unclear whether these phases would mix when present at the same time or remain distinct chemical environments; either as multi-phase condensates or by forming entirely separate condensates. Here we combine particle-based multi-scale simulations and experiments in the model organism C. elegans to characterise the biophysical properties of RNA polymerase II condensates. Both simulations and the in vivo work describe a lower critical solution temperature (LCST) behaviour of RNA Polymerase II, with condensates dissolving at lower temperatures whereas higher temperatures promote condensate stability, which highlights that these condensates are physio-chemically distinct from heterochromatin condensates. The LCST behavior of CTD correlates with gradual shifts in the transcription program but is largely uncoupled from the classical stress response. Expanding the simulations we model how the degree of phosphorylation of the disordered C-terminal domain of RNA polymerase II (CTD), which is characteristic for each step of transcription, controls the existence and morphology of multi-phasic condensates. We show that the two phases putatively underpinning the initiation of transcription and transcription elongation constitute distinct chemical environments and are in agreement with RNA polymerase II condensates observed in C. elegans embryos by super resolution microscopy. Our analysis shows how depending on its post transcriptional modifications and its interaction partner a single protein can form multiple partially engulfed condensates, potentially promoting the selective recruitment of additional factors to these two phases.

Show Abstract

Supercharged coiled-coil protein with N-terminal decahistidine tag boosts siRNA complexation and delivery efficiency of a lipoproteoplex

Jonathan W. Sun, Joseph S. Thomas, D. Renfrew, et al.

Short interfering RNA (siRNA) therapeutics have soared in popularity due to their highly selective and potent targeting of faulty genes, providing a non-palliative approach to address diseases. Despite their potential, effective transfection of siRNA into cells requires the assistance of an accompanying vector. Vectors constructed from non-viral materials, while offering safer and non-cytotoxic profiles, often grapple with lackluster loading and delivery efficiencies, necessitating substantial milligram quantities of expensive siRNA to confer the desired downstream effects. We detail the recombinant synthesis of a diverse series of coiled-coil supercharged protein (CSP) biomaterials systematically designed to investigate the impact of two arginine point mutations (Q39R and N61R) and decahistidine tags on liposomal siRNA delivery. The most efficacious variant, N8, exhibits a twofold increase in its affinity to siRNA and achieves a twofold enhancement in transfection activity with minimal cytotoxicity in vitro. Subsequent analysis unveils the destabilizing effect of the Q39R and N61R supercharging mutations and the incorporation of C-terminal decahistidine tags on α-helical secondary structure. Cross-correlational regression analyses reveal that the amount of helical character in these mutants is key in N8's enhanced siRNA complexation and downstream delivery efficiency.

Show Abstract

Quaia, the Gaia-unWISE Quasar Catalog: An All-sky Spectroscopic Quasar Sample

Kate Storey-Fisher, D. Hogg, Hans-Walter Rix, Anna-Christina Eilers, Giulio Fabbian, Michael R. Blanton, David Alonso

We present a new, all-sky quasar catalog, Quaia, that samples the largest comoving volume of any existing spectroscopic quasar sample. The catalog draws on the 6,649,162 quasar candidates identified by the Gaia mission that have redshift estimates from the space observatory's low-resolution blue photometer/red photometer spectra. This initial sample is highly homogeneous and complete, but has low purity, and 18% of even the bright (G < 20.0) confirmed quasars have discrepant redshift estimates (∣Δz/(1 + z)∣ > 0.2) compared to those from the Sloan Digital Sky Survey (SDSS). In this work, we combine the Gaia candidates with unWISE infrared data (based on the Wide-field Infrared Survey Explorer survey) to construct a catalog useful for cosmological and astrophysical quasar studies. We apply cuts based on proper motions and colors, reducing the number of contaminants by approximately four times. We improve the redshifts by training a k-Nearest Neighbor model on SDSS redshifts, and achieve estimates on the G < 20.0 sample with only 6% (10%) catastrophic errors with ∣Δz/(1 + z)∣ > 0.2 (0.1), a reduction of approximately three times (approximately two times) compared to the Gaia redshifts. The final catalog has 1,295,502 quasars with G < 20.5, and 755,850 candidates in an even cleaner G < 20.0 sample, with accompanying rigorous selection function models. We compare Quaia to existing quasar catalogs, showing that its large effective volume makes it a highly competitive sample for cosmological large-scale structure analyses. The catalog is publicly available at 10.5281/zenodo.10403370.

Show Abstract

Estimating Shape Distances on Neural Representations with Limited Samples

Dean A. Pospisil, B. Larsen, S. Harvey, A. Williams

Measuring geometric similarity between high-dimensional network representations is a topic of longstanding interest to neuroscience and deep learning. Although many methods have been proposed, only a few works have rigorously analyzed their statistical efficiency or quantified estimator uncertainty in data-limited regimes. Here, we derive upper and lower bounds on the worst-case convergence of standard estimators of shape distance—a measure of representational dissimilarity proposed by Williams et al. (2021). These bounds reveal the challenging nature of the problem in high-dimensional feature spaces. To overcome these challenges, we introduce a novel method-of-moments estimator with a tunable bias-variance tradeoff parameterized by an upper bound on bias. We show that this estimator achieves superior performance to standard estimators in simulation and on neural data, particularly in high-dimensional settings. Our theoretical work and estimator thus respectively define and dramatically expand the scope of neural data for which geometric similarity can be accurately measured.

Show Abstract

ERK inhibits Cic repressor function via multisite phosphorylation

Sayantanee Paul, Khandan Ilkhani, S. Shvartsman, et al.

The receptor tyrosine kinase (RTK)/Extracellular Signal-Regulated Kinase (ERK) signaling pathway controls cell proliferation, differentiation, and survival. How ERK activation is relayed to its phosphorylation targets is not well understood. The transcriptional repressor Capicua (Cic) has emerged as a key target for ERK-mediated downregulation in Drosophila and mammals, and mutations in human CIC result in cancer and neurological diseases. Phosphorylation by ERK is critical for Cic downregulation, but the identities of phosphosites in Drosophila Cic are unknown. Here, we identify sites of phosphorylation in Cic that are directly targeted by ERK and validate their developmental functions in vivo using mutant Cic variants. Cic phosphosites are distributed throughout the length of the protein, and a group of centrally located sites appears to have a primary role in Cic downregulation. Cic mutated in 20 high-confidence sites behaves as a “super-repressor” in vivo that is largely insensitive to ERK-mediated downregulation, despite fully retaining the ability to bind to ERK. No single site is sufficient to turn off Cic activity; instead, we find that ERK must phosphorylate multiple sites in Cic simultaneously to achieve full downregulation. This multisite phosphorylation likely targets phosphodegrons that are recognized by ubiquitin ligases such as Ago/FBXW7 and contributes to Cic degradation. This study advances our understanding of the molecular mechanisms of signal interpretation downstream of the RTK/ERK signaling network.

Show Abstract

Combining machine learning with structure-based protein design to predict and engineer post-translational modifications of proteins

Moritz Ertelt, V. Mulligan, et al.

Post-translational modifications (PTMs) of proteins play a vital role in their function and stability. These modifications influence protein folding, signaling, protein-protein interactions, enzyme activity, binding affinity, aggregation, degradation, and much more. To date, over 400 types of PTMs have been described, representing chemical diversity well beyond the genetically encoded amino acids. Such modifications pose a challenge to the successful design of proteins, but also represent a major opportunity to diversify the protein engineering toolbox. To this end, we first trained artificial neural networks (ANNs) to predict eighteen of the most abundant PTMs, including protein glycosylation, phosphorylation, methylation, and deamidation. In a second step, these models were implemented inside the computational protein modeling suite Rosetta, which allows flexible combination with existing protocols to model the modified sites and understand their impact on protein stability as well as function. Lastly, we developed a new design protocol that either maximizes or minimizes the predicted probability of a particular site being modified. We find that this combination of ANN prediction and structure-based design can enable the modification of existing, as well as the introduction of novel, PTMs. The potential applications of our work include, but are not limited to, glycan masking of epitopes, strengthening protein-protein interactions through phosphorylation, as well as protecting proteins from deamidation liabilities. These applications are especially important for the design of new protein therapeutics where PTMs can drastically change the therapeutic properties of a protein. Our work adds novel tools to Rosetta’s protein engineering toolbox that allow for the rational design of PTMs.

Show Abstract