162 Publications

Modeling transcriptional regulation of model species with deep learning

E. Cofer, A. Wong, O. Troyanskaya, et al.

To enable large-scale analyses of regulatory logic in model species, we developed DeepArk, a set of deep learning models of the cis-regulatory codes of four widely-studied species: Caenorhabditis elegans, Danio rerio, Drosophila melanogaster, and Mus musculus DeepArk accurately predicts the presence of thousands of different context-specific regulatory features, including chromatin states, histone marks, and transcription factors. In vivo studies show that DeepArk can predict the regulatory impact of any genomic variant (including rare or not previously observed), and enables the regulatory annotation of understudied model species.

Show Abstract
April 19, 2021

ChIP-BIT2: a software tool to detect weak binding events using a Bayesian integration approach

X. Chen, A. Neuwald, L. Hilakivi-Clarke, R. Clarke, J. Xuan

Background
ChIP-seq combines chromatin immunoprecipitation assays with sequencing and identifies genome-wide binding sites for DNA binding proteins. While many binding sites have strong ChIP-seq ‘peak’ observations and are well captured, there are still regions bound by proteins weakly, with a relatively low ChIP-seq signal enrichment. These weak binding sites, especially those at promoters and enhancers, are functionally important because they also regulate nearby gene expression. Yet, it remains a challenge to accurately identify weak binding sites in ChIP-seq data due to the ambiguity in differentiating these weak binding sites from the amplified background DNAs.

Results
ChIP-BIT2 (http://sourceforge.net/projects/chipbitc/) is a software package for ChIP-seq peak detection. ChIP-BIT2 employs a mixture model integrating protein and control ChIP-seq data and predicts strong or weak protein binding sites at promoters, enhancers, or other genomic locations. For binding sites at gene promoters, ChIP-BIT2 simultaneously predicts their target genes. ChIP-BIT2 has been validated on benchmark regions and tested using large-scale ENCODE ChIP-seq data, demonstrating its high accuracy and wide applicability.

Conclusion
ChIP-BIT2 is an efficient ChIP-seq peak caller. It provides a better lens to examine weak binding sites and can refine or extend the existing binding site collection, providing additional regulatory regions for decoding the mechanism of gene expression regulation.

Show Abstract
April 15, 2021

An automated framework for efficiently designing deep convolutional neural networks in genomics

Convolutional neural networks (CNNs) have become a standard for analysis of biological sequences. Tuning of network architectures is essential for a CNN’s performance, yet it requires substantial knowledge of machine learning and commitment of time and effort. This process thus imposes a major barrier to broad and effective application of modern deep learning in genomics. Here we present Automated Modelling for Biological Evidence-based Research (AMBER), a fully automated framework to efficiently design and apply CNNs for genomic sequences. AMBER designs optimal models for user-specified biological questions through the state-of-the-art neural architecture search (NAS). We applied AMBER to the task of modelling genomic regulatory features and demonstrated that the predictions of the AMBER-designed model are significantly more accurate than the equivalent baseline non-NAS models and match or even exceed published expert-designed models. Interpretation of AMBER architecture search revealed its design principles of utilizing the full space of computational operations for accurately modelling genomic sequences. Furthermore, we illustrated the use of AMBER to accurately discover functional genomic variants in allele-specific binding and disease heritability enrichment. AMBER provides an efficient automated method for designing accurate deep learning models in genomics.

Show Abstract

AMBIENT: Accelerated Convolutional Neural Network Architecture Search for Regulatory Genomics

Convolutional neural networks (CNN) have become a standard approach for modeling genomic sequences. CNNs can be effectively built by Neural Architecture Search (NAS) by trading computing power for accurate neural architectures. Yet, the consumption of immense computing power is a major practical, financial, and environmental issue for deep learning. Here, we present a novel NAS framework,
AMBIENT, that generates highly accurate CNN architectures for biological sequences of diverse functions, while substantially reducing the computing cost of conventional NAS.

Show Abstract
February 27, 2021

mRNA-1273 efficacy in a severe COVID-19 model: attenuated activation of pulmonary immune cells after challenge

M. Meyer, Y. Wang, D. Edwards, G. Smith, A. Rubenstein, P. Ramanathan, C. Mire, C. Pietzch, X. Chen, Y. Ge, W. Cheng, C. Henry, A. Woods, L. Ma, G. Stewart-Jones, K. Bock, M. Minai, B. Nagata, S. Periasamy, P. Shi, B. Graham, I. Moore, I. Ramos, O. Troyanskaya, E. Zaslavsky, A. Carfi, S. Sealfon, A. Bukreyev

The mRNA-1273 vaccine was recently determined to be effective against severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) from interim Phase 3 results. Human studies, however, cannot provide the controlled response to infection and complex immunological insight that are only possible with preclinical studies. Hamsters are the only model that reliably exhibit more severe SARS-CoV-2 disease similar to hospitalized patients, making them pertinent for vaccine evaluation. We demonstrate that prime or prime-boost administration of mRNA-1273 in hamsters elicited robust neutralizing antibodies, ameliorated weight loss, suppressed SARS-CoV-2 replication in the airways, and better protected against disease at the highest prime-boost dose. Unlike in mice and non-human primates, mRNA-1273- mediated immunity was non-sterilizing and coincided with an anamnestic response. Single-cell RNA sequencing of lung tissue permitted high resolution analysis which is not possible in vaccinated humans. mRNA-1273 prevented inflammatory cell infiltration and the reduction of lymphocyte proportions, but enabled antiviral responses conducive to lung homeostasis. Surprisingly, infection triggered transcriptome programs in some types of immune cells from vaccinated hamsters that were shared, albeit attenuated, with mock-vaccinated hamsters. Our results support the use of mRNA-1273 in a two-dose schedule and provides insight into the potential responses within the lungs of vaccinated humans who are exposed to SARS-CoV-2.

Show Abstract
January 25, 2021

Genome-wide landscape of RNA-binding protein target site dysregulation reveals a major impact on psychiatric disorder risk

C. Park, J. Zhou, A. Wong, K. Chen, C. Theesfeld, R. Darnell, O. Troyanskaya

Despite the strong genetic basis of psychiatric disorders, the underlying molecular mechanisms are largely unmapped. RNA-binding proteins (RBPs) are responsible for most post-transcriptional regulation, from splicing to translation to localization. RBPs thus act as key gatekeepers of cellular homeostasis, especially in the brain. However, quantifying the pathogenic contribution of noncoding variants impacting RBP target sites is challenging. Here, we leverage a deep learning approach that can accurately predict the RBP target site dysregulation effects of mutations and discover that RBP dysregulation is a principal contributor to psychiatric disorder risk. RBP dysregulation explains a substantial amount of heritability not captured by large-scale molecular quantitative trait loci studies and has a stronger impact than common coding region variants. We share the genome-wide profiles of RBP dysregulation, which we use to identify DDHD2 as a candidate schizophrenia risk gene. This resource provides a new analytical framework to connect the full range of RNA regulation to complex disease.

Show Abstract

Genome-wide landscape of RNA-binding protein target site dysregulation reveals a major impact on psychiatric disorder risk

C. Park, J. Zhou, A. Wong, K. Chen, C. Theesfeld, R. Darnell , O. Troyanskaya

Despite the strong genetic basis of psychiatric disorders, the underlying molecular mechanisms are largely unmapped. RNA-binding proteins (RBPs) are responsible for most post-transcriptional regulation, from splicing to translation to localization. RBPs thus act as key gatekeepers of cellular homeostasis, especially in the brain. However, quantifying the pathogenic contribution of noncoding variants impacting RBP target sites is challenging. Here, we leverage a deep learning approach that can accurately predict the RBP target site dysregulation effects of mutations and discover that RBP dysregulation is a principal contributor to psychiatric disorder risk. RBP dysregulation explains a substantial amount of heritability not captured by large-scale molecular quantitative trait loci studies and has a stronger impact than common coding region variants. We share the genome-wide profiles of RBP dysregulation, which we use to identify DDHD2 as a candidate schizophrenia risk gene. This resource provides a new analytical framework to connect the full range of RNA regulation to complex disease.

Show Abstract
Nature Genetics, 53(2): 166-173
January 18, 2021

A Multimodal and Integrated Approach to Interrogate Human Kidney Biopsies with Rigor and Reproducibility: Guidelines from the Kidney Precision Medicine Project

T El-Achkar, C. Park, R. Sealfon, O. Troyanskaya, et al.

Comprehensive and spatially mapped molecular atlases of organs at a cellular level are a critical resource to gain insights into pathogenic mechanisms and personalized therapies for diseases. The Kidney Precision Medicine Project (KPMP) is an endeavor to generate 3-dimensional (3D) molecular atlases of healthy and diseased kidney biopsies using multiple state-of-the-art OMICS and imaging technologies across several institutions. Obtaining rigorous and reproducible results from disparate methods and at different sites to interrogate biomolecules at a single cell level or in 3D space is a significant challenge that can be a futile exercise if not well controlled. We describe a "follow the tissue" pipeline for generating a reliable and authentic single cell/region 3D molecular atlas of human adult kidney. Our approach emphasizes quality assurance, quality control, validation and harmonization across different OMICS and imaging technologies from sample procurement, processing, storage, shipping to data generation, analysis and sharing. We established benchmarks for quality control, rigor, reproducibility and feasibility across multiple technologies through a pilot experiment using common source tissue that was processed and analyzed at different institutions and different technologies. A peer review system was established to critically review quality control measures and the reproducibility of data generated by each technology before being approved to interrogate clinical biopsy specimens. The process established economizes the use of valuable biopsy tissue for multi-OMICS and imaging analysis with stringent quality control to ensure rigor and reproducibility of results and serves as a model for precision medicine projects across laboratories, institutions and consortia.

Show Abstract

Identifying intracellular signaling modules and exploring pathways associated with breast cancer recurrence

X. Chen, J. Gu, A. Neuwald, L. Hilakivi-Clarke, R. Clarke, J. Xuan

Exploring complex modularization of intracellular signal transduction pathways is critical to understanding aberrant cellular responses during disease development and drug treatment. IMPALA (Inferred Modularization of PAthway LAndscapes) integrates information from high throughput gene expression experiments and genome-scale knowledge databases to identify aberrant pathway modules, thereby providing a powerful sampling strategy to reconstruct and explore pathway landscapes. Here IMPALA identifies pathway modules associated with breast cancer recurrence and Tamoxifen resistance. Focusing on estrogen-receptor (ER) signaling, IMPALA identifies alternative pathways from gene expression data of Tamoxifen treated ER positive breast cancer patient samples. These pathways were often interconnected through cytoplasmic genes such as IRS1/2, JAK1, YWHAZ, CSNK2A1, MAPK1 and HSP90AA1 and significantly enriched with ErbB, MAPK, and JAK-STAT signaling components. Characterization of the pathway landscape revealed key modules associated with ER signaling and with cell cycle and apoptosis signaling. We validated IMPALA-identified pathway modules using data from four different breast cancer cell lines including sensitive and resistant models to Tamoxifen. Results showed that a majority of genes in cell cycle/apoptosis modules that were up-regulated in breast cancer patients with short survivals (< 5 years) were also over-expressed in drug resistant cell lines, whereas the transcription factors JUN, FOS, and STAT3 were down-regulated in both patient and drug resistant cell lines. Hence, IMPALA identified pathways were associated with Tamoxifen resistance and an increased risk of breast cancer recurrence. The IMPALA package is available at https://dlrl.ece.vt.edu/software/.

Show Abstract

Identifying intracellular signaling modules and exploring pathways associated with breast cancer recurrence

X. Chen, J. Gu, A. Neuwald, L. Hilakivi-Clarke, R. Clarke, J. Xuan

Exploring complex modularization of intracellular signal transduction pathways is critical to understanding aberrant cellular responses during disease development and drug treatment. IMPALA (Inferred Modularization of PAthway LAndscapes) integrates information from high throughput gene expression experiments and genome-scale knowledge databases to identify aberrant pathway modules, thereby providing a powerful sampling strategy to reconstruct and explore pathway landscapes. Here IMPALA identifies pathway modules associated with breast cancer recurrence and Tamoxifen resistance. Focusing on estrogen-receptor (ER) signaling, IMPALA identifies alternative pathways from gene expression data of Tamoxifen treated ER positive breast cancer patient samples. These pathways were often interconnected through cytoplasmic genes such as IRS1/2, JAK1, YWHAZ, CSNK2A1, MAPK1 and HSP90AA1 and significantly enriched with ErbB, MAPK, and JAK-STAT signaling components. Characterization of the pathway landscape revealed key modules associated with ER signaling and with cell cycle and apoptosis signaling. We validated IMPALA-identified pathway modules using data from four different breast cancer cell lines including sensitive and resistant models to Tamoxifen. Results showed that a majority of genes in cell cycle/apoptosis modules that were up-regulated in breast cancer patients with short survivals (< 5 years) were also over-expressed in drug resistant cell lines, whereas the transcription factors JUN, FOS, and STAT3 were down-regulated in both patient and drug resistant cell lines. Hence, IMPALA identified pathways were associated with Tamoxifen resistance and an increased risk of breast cancer recurrence. The IMPALA package is available at https://dlrl.ece.vt.edu/software/ .

Show Abstract
Scientific Reports , 11(1): 385
January 11, 2021
  • Previous Page
  • Viewing
  • Next Page
Advancing Research in Basic Science and MathematicsSubscribe to Flatiron Institute announcements and other foundation updates