152 Publications

Solving Fredholm second-kind integral equations with singular right-hand sides on non-smooth boundaries

Johan Helsing, S. Jiang

A numerical scheme is presented for the solution of Fredholm second-kind boundary integral equations with right-hand sides that are singular at a finite set of boundary points. The boundaries themselves may be non-smooth. The scheme, which builds on recursively compressed inverse preconditioning (RCIP), is universal as it is independent of the nature of the singularities. Strong right-hand-side singularities, such as $1/|r|^\alpha$ with $\alpha$ close to $1$, can be treated in full machine precision. Adaptive refinement is used only in the recursive construction of the preconditioner, leading to an optimal number of discretization points and superior stability in the solve phase. The performance of the scheme is illustrated via several numerical examples, including an application to an integral equation derived from the linearized BGKW kinetic equation for the steady Couette flow.

Show Abstract
August 23, 2021

latentcor: An R Package for estimating latent correlations from mixed data types

Mingze Huang, C. Müller, Irina Gaynanova

We present `latentcor`, an R package for correlation estimation from data with mixed variable types. Mixed variables types, including continuous, binary, ordinal, zero-inflated, or truncated data are routinely collected in many areas of science. Accurate estimation of correlations among such variables is often the first critical step in statistical analysis workflows. Pearson correlation as the default choice is not well suited for mixed data types as the underlying normality assumption is violated. The concept of semi-parametric latent Gaussian copula models, on the other hand, provides a unifying way to estimate correlations between mixed data types. The R package `latentcor` comprises a comprehensive list of these models, enabling the estimation of correlations between any of continuous/binary/ternary/zero-inflated (truncated) variable types. The underlying implementation takes advantage of a fast multi-linear interpolation scheme with an efficient choice of interpolation grid points, thus giving the package a small memory footprint without compromising estimation accuracy. This makes latent correlation estimation readily available for modern high-throughput data analysis.

Show Abstract
August 20, 2021

Phase Retrieval with Holography and Untrained Priors: Tackling the Challenges of Low-Photon Nanoscale Imaging

Phase retrieval is the inverse problem of recovering a signal from magnitude-only Fourier measurements, and underlies numerous imaging modalities, such as Coherent Diffraction Imaging (CDI). A variant of this setup, known as holography, includes a reference object that is placed adjacent to the specimen of interest before measurements are collected. The resulting inverse problem, known as holographic phase retrieval, is well-known to have improved problem conditioning relative to the original. This innovation, i.e. Holographic CDI, becomes crucial at the nanoscale, where imaging specimens such as viruses, proteins, and crystals require low-photon measurements. This data is highly corrupted by Poisson shot noise, and often lacks low-frequency content as well. In this work, we introduce a dataset-free deep learning framework for holographic phase retrieval adapted to these challenges. The key ingredients of our approach are the explicit and flexible incorporation of the physical forward model into an automatic differentiation procedure, the Poisson log-likelihood objective function, and an optional untrained deep image prior. We perform extensive evaluation under realistic conditions. Compared to competing classical methods, our method recovers signal from higher noise levels and is more resilient to suboptimal reference design, as well as to large missing regions of low frequencies in the observations. To the best of our knowledge, this is the first work to consider a dataset-free machine learning approach for holographic phase retrieval.

Show Abstract

Application of ensemble pharmacophore-based virtual screening to the discovery of novel antimitotic tubulin inhibitors

Laura Gallego-Yerga, Rodrigo Ochoa, Isaías Lans, Carlos Peña-Varas, Melissa Alegría-Arcos, P. Cossio, David Ramírez, Rafael Peláez

Tubulin is a well-validated target for herbicides, fungicides, anti-parasitic, and anti-tumor drugs. Many of the non-cancer tubulin drugs bind to its colchicine site but no colchicine-site anticancer drug is available. The colchicine site is composed of three interconnected sub-pockets that fit their ligands and modify others’ preference, making the design of molecular hybrids (that bind to more than one sub-pocket) a difficult task. Taking advantage of the more than eighty published X-ray structures of tubulin in complex with ligands bound to the colchicine site, we generated an ensemble of pharmacophore representations that flexibly sample the interactional space between the ligands and target. We searched the ZINC database for scaffolds able to fit several of the subpockets, such as tetrazoles, sulfonamides and diarylmethanes, selected roughly 8000 compounds with favorable predicted properties. A Flexi-pharma virtual screening, based on ensemble pharmacophore, was performed by two different methodologies. Combining the scaffolds that best fit the ensemble pharmacophore-representation, we designed a new family of ligands, resulting in a novel tubulin modulator. We synthesized tetrazole 5 and tested it as a tubulin inhibitor in vitro. In good agreement with the design principles, it demonstrated micromolar activity against in vitro tubulin polymerization and nanomolar anti-proliferative effect against human epithelioid carcinoma HeLa cells through microtubule disruption, as shown by immunofluorescence confocal microscopy. The integrative methodology succedes in the design of new scaffolds for flexible proteins with structural coupling between pockets, thus expanding the way in which computational methods can be used as significant tools in the drug design process.

Show Abstract

COP-E-CAT: Cleaning and Organization Pipeline for EHR Computational and Analytic Tasks

Aishwarya Mandyam, Elizabeth C. Yoo, J. Soules, Krzysztof Laudanski, Barbara E. Engelhardt

In order to ensure that analyses of complex electronic healthcare record (EHR) data are reproducible and generalizable, it is crucial for researchers to use comparable preprocessing, filtering, and imputation strategies. We introduce COP-E-CAT: Cleaning and Organization Pipeline for EHR Computational and Analytic Tasks, an open-source processing and analysis software for MIMIC-IV, a ubiquitous benchmark EHR dataset. COP-E-CAT allows users to select filtering characteristics and preprocess covariates to generate data structures for use in downstream analysis tasks. This user-friendly approach shows promise in facilitating reproducibility and comparability among studies that leverage the MIMIC-IV data, and enhances EHR accessibility to a wider spectrum of researchers than current data processing methods. We demonstrate the versatility of our workflow by describing three use cases: ensemble prediction, reinforcement learning, and dimension reduction. The software is available at: https://github.com/eyeshoe/cop-e-cat.

Show Abstract

Discrete Lehmann representation of imaginary time Green’s functions

We present an efficient basis for imaginary time Green's functions based on a low rank decomposition of the spectral Lehmann representation. The basis functions are simply a set of well-chosen exponentials, so the corresponding expansion may be thought of as a discrete form of the Lehmann representation using an effective spectral density which is a sum of $\delta$ functions. The basis is determined only by an upper bound on the product $\beta \omega_{\max}$, with $\beta$ the inverse temperature and $\omega_{\max}$ an energy cutoff, and a user-defined error tolerance $\epsilon$. The number $r$ of basis functions scales as $\mathcal{O}\left(\log(\beta \omega_{\max}) \log (1/\epsilon)\right)$. The discrete Lehmann representation of a particular imaginary time Green's function can be recovered by interpolation at a set of $r$ imaginary time nodes. Both the basis functions and the interpolation nodes can be obtained rapidly using standard numerical linear algebra routines. Due to the simple form of the basis, the discrete Lehmann representation of a Green's function can be explicitly transformed to the Matsubara frequency domain, or obtained directly by interpolation on a Matsubara frequency grid. We benchmark the efficiency of the representation on simple cases, and with a high precision solution of the Sachdev-Ye-Kitaev equation at low temperature. We compare our approach with the related intermediate representation method, and introduce an improved algorithm to build the intermediate representation basis and a corresponding sampling grid.

Show Abstract
July 27, 2021

Tree-aggregated predictive modeling of microbiome data

Jacob Bien, Xiaohan Yan, Léo Simpson, C. Müller

Modern high-throughput sequencing technologies provide low-cost microbiome survey data across all habitats of life at unprecedented scale. At the most granular level, the primary data consist of sparse counts of amplicon sequence variants or operational taxonomic units that are associated with taxonomic and phylogenetic group information. In this contribution, we leverage the hierarchical structure of amplicon data and propose a data-driven and scalable tree-guided aggregation framework to associate microbial subcompositions with response variables of interest. The excess number of zero or low count measurements at the read level forces traditional microbiome data analysis workflows to remove rare sequencing variants or group them by a fixed taxonomic rank, such as genus or phylum, or by phylogenetic similarity. By contrast, our framework, which we call trac (tree-aggregation of compositional data), learns data-adaptive taxon aggregation levels for predictive modeling, greatly reducing the need for user-defined aggregation in preprocessing while simultaneously integrating seamlessly into the compositional data analysis framework. We illustrate the versatility of our framework in the context of large-scale regression problems in human gut, soil, and marine microbial ecosystems. We posit that the inferred aggregation levels provide highly interpretable taxon groupings that can help microbiome researchers gain insights into the structure and functioning of the underlying ecosystem of interest.

Show Abstract
July 15, 2021

Inverse-Dirichlet Weighting Enables Reliable Training of Physics Informed Neural Networks

Suryanarayana Maddu, Dominik Sturm, Ivo F. Sbalzarin, C. Müller

We characterize and remedy a failure mode that may arise from multi-scale dynamics with scale imbalances during training of deep neural networks, such as Physics Informed Neural Networks (PINNs). PINNs are popular machine-learning templates that allow for seamless integration of physical equation models with data. Their training amounts to solving an optimization problem over a weighted sum of data-fidelity and equation-fidelity objectives. Conflicts between objectives can arise from scale imbalances, heteroscedasticity in the data, stiffness of the physical equation, or from catastrophic interference during sequential training. We explain the training pathology arising from this and propose a simple yet effective inverse-Dirichlet weighting strategy to alleviate the issue. We compare with Sobolev training of neural networks, providing the baseline of analytically ϵ-optimal training. We demonstrate the effectiveness of inverse-Dirichlet weighting in various applications, including a multi-scale model of active turbulence, where we show orders of magnitude improvement in accuracy and convergence over conventional PINN training. For inverse modeling using sequential training, we find that inverse-Dirichlet weighting protects a PINN against catastrophic forgetting.

Show Abstract
July 2, 2021

Regression models for compositional data: General log-contrast formulations, proximal optimization, and microbiome data applications

Patrik L Combettes, C. Müller

Compositional data sets are ubiquitous in science, including geology, ecology, and microbiology. In microbiome research, compositional data primarily arise from high-throughput sequence-based profiling experiments. These data comprise microbial compositions in their natural habitat and are often paired with covariate measurements that characterize physicochemical habitat properties or the physiology of the host. Inferring parsimonious statistical associations between microbial compositions and habitat- or host-specific covariate data is an important step in exploratory data analysis. A standard statistical model linking compositional covariates to continuous outcomes is the linear log-contrast model. This model describes the response as a linear combination of log-ratios of the original compositions and has been extended to the high-dimensional setting via regularization. In this contribution, we propose a general convex optimization model for linear log-contrast regression which includes many previous proposals as special cases. We introduce a proximal algorithm that solves the resulting constrained optimization problem exactly with rigorous convergence guarantees. We illustrate the versatility of our approach by investigating the performance of several model instances on soil and gut microbiome data analysis tasks.

Show Abstract
  • Previous Page
  • Viewing
  • Next Page
Advancing Research in Basic Science and MathematicsSubscribe to Flatiron Institute announcements and other foundation updates