381 Publications

Application of ensemble pharmacophore-based virtual screening to the discovery of novel antimitotic tubulin inhibitors

Laura Gallego-Yerga, Rodrigo Ochoa, Isaías Lans, Carlos Peña-Varas, Melissa Alegría-Arcos, P. Cossio, David Ramírez, Rafael Peláez

Tubulin is a well-validated target for herbicides, fungicides, anti-parasitic, and anti-tumor drugs. Many of the non-cancer tubulin drugs bind to its colchicine site but no colchicine-site anticancer drug is available. The colchicine site is composed of three interconnected sub-pockets that fit their ligands and modify others’ preference, making the design of molecular hybrids (that bind to more than one sub-pocket) a difficult task. Taking advantage of the more than eighty published X-ray structures of tubulin in complex with ligands bound to the colchicine site, we generated an ensemble of pharmacophore representations that flexibly sample the interactional space between the ligands and target. We searched the ZINC database for scaffolds able to fit several of the subpockets, such as tetrazoles, sulfonamides and diarylmethanes, selected roughly 8000 compounds with favorable predicted properties. A Flexi-pharma virtual screening, based on ensemble pharmacophore, was performed by two different methodologies. Combining the scaffolds that best fit the ensemble pharmacophore-representation, we designed a new family of ligands, resulting in a novel tubulin modulator. We synthesized tetrazole 5 and tested it as a tubulin inhibitor in vitro. In good agreement with the design principles, it demonstrated micromolar activity against in vitro tubulin polymerization and nanomolar anti-proliferative effect against human epithelioid carcinoma HeLa cells through microtubule disruption, as shown by immunofluorescence confocal microscopy. The integrative methodology succedes in the design of new scaffolds for flexible proteins with structural coupling between pockets, thus expanding the way in which computational methods can be used as significant tools in the drug design process.

Show Abstract

COP-E-CAT: Cleaning and Organization Pipeline for EHR Computational and Analytic Tasks

Aishwarya Mandyam, Elizabeth C. Yoo, J. Soules, Krzysztof Laudanski, Barbara E. Engelhardt

In order to ensure that analyses of complex electronic healthcare record (EHR) data are reproducible and generalizable, it is crucial for researchers to use comparable preprocessing, filtering, and imputation strategies. We introduce COP-E-CAT: Cleaning and Organization Pipeline for EHR Computational and Analytic Tasks, an open-source processing and analysis software for MIMIC-IV, a ubiquitous benchmark EHR dataset. COP-E-CAT allows users to select filtering characteristics and preprocess covariates to generate data structures for use in downstream analysis tasks. This user-friendly approach shows promise in facilitating reproducibility and comparability among studies that leverage the MIMIC-IV data, and enhances EHR accessibility to a wider spectrum of researchers than current data processing methods. We demonstrate the versatility of our workflow by describing three use cases: ensemble prediction, reinforcement learning, and dimension reduction. The software is available at: https://github.com/eyeshoe/cop-e-cat.

Show Abstract

AI-assisted superresolution cosmological simulations – II. Halo substructures, velocities, and higher order statistics

Yueying Ni, Y. Li, Patrick Lachance, Rupert A. C. Croft, Tiziana Di Matteo, Simeon Bird, Yu Feng

In this work, we expand and test the capabilities of our recently developed super-resolution (SR) model to generate high-resolution (HR) realizations of the full phase-space matter distribution, including both displacement and velocity, from computationally cheap low-resolution (LR) cosmological N-body simulations. The SR model enhances the simulation resolution by generating 512 times more tracer particles, extending into the deeply non-linear regime where complex structure formation processes take place. We validate the SR model by deploying the model in 10 test simulations of box size 100 Mpc/h, and examine the matter power spectra, bispectra and 2D power spectra in redshift space. We find the generated SR field matches the true HR result at percent level down to scales of k ~ 10 h/Mpc. We also identify and inspect dark matter halos and their substructures. Our SR model generate visually authentic small-scale structures, that cannot be resolved by the LR input, and are in good statistical agreement with the real HR results. The SR model performs satisfactorily on the halo occupation distribution, halo correlations in both real and redshift space, and the pairwise velocity distribution, matching the HR results with comparable scatter, thus demonstrating its potential in making mock halo catalogs. The SR technique can be a powerful and promising tool for modelling small-scale galaxy formation physics in large cosmological volumes.

Show Abstract

Tree-aggregated predictive modeling of microbiome data

Jacob Bien, Xiaohan Yan, Léo Simpson, C. Müller

Modern high-throughput sequencing technologies provide low-cost microbiome survey data across all habitats of life at unprecedented scale. At the most granular level, the primary data consist of sparse counts of amplicon sequence variants or operational taxonomic units that are associated with taxonomic and phylogenetic group information. In this contribution, we leverage the hierarchical structure of amplicon data and propose a data-driven and scalable tree-guided aggregation framework to associate microbial subcompositions with response variables of interest. The excess number of zero or low count measurements at the read level forces traditional microbiome data analysis workflows to remove rare sequencing variants or group them by a fixed taxonomic rank, such as genus or phylum, or by phylogenetic similarity. By contrast, our framework, which we call trac (tree-aggregation of compositional data), learns data-adaptive taxon aggregation levels for predictive modeling, greatly reducing the need for user-defined aggregation in preprocessing while simultaneously integrating seamlessly into the compositional data analysis framework. We illustrate the versatility of our framework in the context of large-scale regression problems in human gut, soil, and marine microbial ecosystems. We posit that the inferred aggregation levels provide highly interpretable taxon groupings that can help microbiome researchers gain insights into the structure and functioning of the underlying ecosystem of interest.

Show Abstract
July 15, 2021

Inverse-Dirichlet Weighting Enables Reliable Training of Physics Informed Neural Networks

Suryanarayana Maddu, Dominik Sturm, Ivo F. Sbalzarin, C. Müller

We characterize and remedy a failure mode that may arise from multi-scale dynamics with scale imbalances during training of deep neural networks, such as Physics Informed Neural Networks (PINNs). PINNs are popular machine-learning templates that allow for seamless integration of physical equation models with data. Their training amounts to solving an optimization problem over a weighted sum of data-fidelity and equation-fidelity objectives. Conflicts between objectives can arise from scale imbalances, heteroscedasticity in the data, stiffness of the physical equation, or from catastrophic interference during sequential training. We explain the training pathology arising from this and propose a simple yet effective inverse-Dirichlet weighting strategy to alleviate the issue. We compare with Sobolev training of neural networks, providing the baseline of analytically ϵ-optimal training. We demonstrate the effectiveness of inverse-Dirichlet weighting in various applications, including a multi-scale model of active turbulence, where we show orders of magnitude improvement in accuracy and convergence over conventional PINN training. For inverse modeling using sequential training, we find that inverse-Dirichlet weighting protects a PINN against catastrophic forgetting.

Show Abstract
July 2, 2021

Regression models for compositional data: General log-contrast formulations, proximal optimization, and microbiome data applications

Patrik L Combettes, C. Müller

Compositional data sets are ubiquitous in science, including geology, ecology, and microbiology. In microbiome research, compositional data primarily arise from high-throughput sequence-based profiling experiments. These data comprise microbial compositions in their natural habitat and are often paired with covariate measurements that characterize physicochemical habitat properties or the physiology of the host. Inferring parsimonious statistical associations between microbial compositions and habitat- or host-specific covariate data is an important step in exploratory data analysis. A standard statistical model linking compositional covariates to continuous outcomes is the linear log-contrast model. This model describes the response as a linear combination of log-ratios of the original compositions and has been extended to the high-dimensional setting via regularization. In this contribution, we propose a general convex optimization model for linear log-contrast regression which includes many previous proposals as special cases. We introduce a proximal algorithm that solves the resulting constrained optimization problem exactly with rigorous convergence guarantees. We illustrate the versatility of our approach by investigating the performance of several model instances on soil and gut microbiome data analysis tasks.

Show Abstract

A Bayesian approach for extracting free energy profiles from cryo-electron microscopy experiments using a path collective variable

Julian Giraldo-Barreto, Sebastian Ortiz, E. Thiede, Karen Palacio-Rodriguez, B. Carpenter, A. Barnett, P. Cossio

Cryo-electron microscopy (cryo-EM) extracts single-particle density projections of individual biomolecules. Although cryo-EM is widely used for 3D reconstruction, due to its single-particle nature, it has the potential to provide information about the biomolecule's conformational variability and underlying free energy landscape. However, treating cryo-EM as a single-molecule technique is challenging because of the low signal-to-noise ratio (SNR) in the individual particles. In this work, we developed the cryo-BIFE method, cryo-EM Bayesian Inference of Free Energy profiles, that uses a path collective variable to extract free energy profiles and their uncertainties from cryo-EM images. We tested the framework over several synthetic systems, where we controlled the imaging parameters and conditions. We found that for realistic cryo-EM environments and relevant biomolecular systems, it is possible to recover the underlying free energy, with the pose accuracy and SNR as crucial determinants. Then, we used the method to study the conformational transitions of a calcium-activated channel with real cryo-EM particles. Interestingly, we recover the most probable conformation (used to generate a high resolution reconstruction of the calcium-bound state), and we find two additional meta-stable states, one which corresponds to the calcium-unbound conformation. As expected for turnover transitions within the same sample, the activation barriers are of the order of a couple $k_BT$. Extracting free energy profiles from cryo-EM will enable a more complete characterization of the thermodynamic ensemble of biomolecules.

Show Abstract

A causal view on compositional data

Elisabeth Ailer, Niki Kilbertus, C. Müller

Many scientific datasets are compositional in nature. Important examples include species abundances in ecology, rock compositions in geology, topic compositions in large-scale text corpora, and sequencing count data in molecular biology. Here, we provide a causal view on compositional data in an instrumental variable setting where the composition acts as the cause. Throughout, we pay particular attention to the interpretation of compositional causes from the viewpoint of interventions and crisply articulate potential pitfalls for practitioners. Focusing on modern high-dimensional microbiome sequencing data as a timely illustrative use case, our analysis first reveals that popular one-dimensional information-theoretic summary statistics, such as diversity and richness, may be insufficient for drawing causal conclusions from ecological data. Instead, we advocate for multivariate alternatives using statistical data transformations and regression techniques that take the special structure of the compositional sample space into account. In a comparative analysis on synthetic and semi-synthetic data we show the advantages and limitations of our proposal. We posit that our framework may provide a useful starting point for cause-effect estimation in the context of compositional data.

Show Abstract
June 21, 2021

Rank-normalization, folding, and localization: An improved \(R\) for assessing convergence of MCMC

Aki Vehtari, Andrew Gelman, Daniel Simpson, B. Carpenter, Paul-Christian Bürkner

Markov chain Monte Carlo is a key computational tool in Bayesian statistics, but it can be challenging to monitor the convergence of an iterative stochastic algorithm. In this paper we show that the convergence diagnostic R of Gelman and Rubin (1992) has serious flaws. Traditional R will fail to correctly diagnose convergence failures when the chain has a heavy tail or when the variance varies across the chains. In this paper we propose an alternative rank-based diagnostic that fixes these problems. We also introduce a collection of quantile-based local efficiency measures, along with a practical approach for computing Monte Carlo error estimates for quantiles. We suggest that common trace plots should be replaced with rank plots from multiple chains. Finally, we give recommendations for how these methods should be used in practice.

Show Abstract
  • Previous Page
  • Viewing
  • Next Page
Advancing Research in Basic Science and MathematicsSubscribe to Flatiron Institute announcements and other foundation updates

privacy consent banner

Privacy preference

We use cookies to provide you with the best online experience. By clicking "Accept All," you help us understand how our site is used and enhance its performance. You can change your choice at any time here. To learn more, please visit our Privacy Policy.