CCB: Publications

Understanding multicellular function and disease with human tissue-specific networks

C.S. Greene, et al.

Tissue and cell-type identity lie at the core of human physiology and disease. Understanding the genetic underpinnings of complex tissues and individual cell lineages is crucial for developing improved diagnostics and therapeutics. We present genome-wide functional interaction networks for 144 human tissues and cell types developed using a data-driven Bayesian methodology that integrates thousands of diverse experiments spanning tissue and disease states. Tissue-specific networks predict lineage-specific responses to perturbation, identify the changing functional roles of genes across tissues and illuminate relationships among diseases. We introduce NetWAS, which combines genes with nominally significant genome-wide association study (GWAS) P values and tissue-specific networks to identify disease-gene associations more accurately than GWAS alone. Our webserver, GIANT, provides an interface to human tissue networks through multi-gene queries, network visualization, analysis tools including NetWAS and downloadable networks. GIANT enables systematic exploration of the landscape of interacting genes that shape specialized cellular functions across more than a hundred human tissues and cell types.

Show Abstract

Tissue-Aware Data Integration Approach for the Inference of Pathway Interactions in Metazoan Organisms

C. Park, A. Krishnan , Q. Zhu , A. Wong, Y. Lee, O. Troyanskaya

MOTIVATION:
Leveraging the large compendium of genomic data to predict biomedical pathways and specific mechanisms of protein interactions genome-wide in metazoan organisms has been challenging. In contrast to unicellular organisms, biological and technical variation originating from diverse tissues and cell-lineages is often the largest source of variation in metazoan data compendia. Therefore, a new computational strategy accounting for the tissue heterogeneity in the functional genomic data is needed to accurately translate the vast amount of human genomic data into specific interaction-level hypotheses.

RESULTS:
We developed an integrated, scalable strategy for inferring multiple human gene interaction types that takes advantage of data from diverse tissue and cell-lineage origins. Our approach specifically predicts both the presence of a functional association and also the most likely interaction type among human genes or its protein products on a whole-genome scale. We demonstrate that directly incorporating tissue contextual information improves the accuracy of our predictions, and further, that such genome-wide results can be used to significantly refine regulatory interactions from primary experimental datasets (e.g. ChIP-Seq, mass spectrometry).

AVAILABILITY AND IMPLEMENTATION:
An interactive website hosting all of our interaction predictions is publically available at http://pathwaynet.princeton.edu. Software was implemented using the open-source Sleipnir library, which is available for download at https://bitbucket.org/libsleipnir/libsleipnir.bitbucket.org.

Show Abstract

Lymphocyte Invasion in IC10/Basal-Like Breast Tumors Is Associated with Wild-Type TP53

D. Quigley, L. Silwal-Pandit, R. Dannenfelser , A. Langerød , H. Vollan , C. Vaske , J. Siegel , O. Troyanskaya, S. Chin , C. Caldas , A. Balmain , A. Børresen-Dale , V. Kristensen

Lymphocytic infiltration is associated with better prognosis in several epithelial malignancies including breast cancer. The tumor suppressor TP53 is mutated in approximately 30% of breast adenocarcinomas, with varying frequency across molecular subtypes. In this study of 1,420 breast tumors, we tested for interaction between TP53 mutation status and tumor subtype determined by PAM50 and integrative cluster analysis. In integrative cluster 10 (IC10)/basal-like breast cancer, we identify an association between lymphocytic infiltration, determined by an expression score, and retention of wild-type TP53. The expression-derived score agreed with the degree of lymphocytic infiltration assessed by pathologic review, and application of the Nanodissect algorithm was suggestive of this infiltration being primarily of cytotoxic T lymphocytes (CTL). Elevated expression of this CTL signature was associated with longer survival in IC10/Basal-like tumors. These findings identify a new link between the TP53 pathway and the adaptive immune response in estrogen receptor (ER)-negative breast tumors, suggesting a connection between TP53 inactivation and failure of tumor immunosurveillance.

Show Abstract

Targeted exploration and analysis of large cross-platform human transcriptomic compendia

Q. Zhu, A. Wong, A. Krishnan, M. Aure, A. Tadych, R. Zhang, D. Corney, C. Greene, L. Bongo, V. Kristensen, M. Charikar, K. Li, O. Troyanskaya

We present SEEK (search-based exploration of expression compendia; http://seek.princeton.edu/), a query-based search engine for very large transcriptomic data collections, including thousands of human data sets from many different microarray and high-throughput sequencing platforms. SEEK uses a query-level cross-validation–based algorithm to automatically prioritize data sets relevant to the query and a robust search approach to identify genes, pathways and processes co-regulated with the query. SEEK provides multigene query searching with iterative metadata-based search refinement and extensive visualization-based analysis options.

Show Abstract

Global Quantitative Modeling of Chromatin Factor Interactions

J. Zhou, O. Troyanskaya

Chromatin is the driver of gene regulation, yet understanding the molecular interactions underlying chromatin factor combinatorial patterns (or the “chromatin codes”) remains a fundamental challenge in chromatin biology. Here we developed a global modeling framework that leverages chromatin profiling data to produce a systems-level view of the macromolecular complex of chromatin. Our model ultilizes maximum entropy modeling with regularization-based structure learning to statistically dissect dependencies between chromatin factors and produce an accurate probability distribution of chromatin code. Our unsupervised quantitative model, trained on genome-wide chromatin profiles of 73 histone marks and chromatin proteins from modENCODE, enabled making various data-driven inferences about chromatin profiles and interactions. We provided a highly accurate predictor of chromatin factor pairwise interactions validated by known experimental evidence, and for the first time enabled higher-order interaction prediction. Our predictions can thus help guide future experimental studies. The model can also serve as an inference engine for predicting unknown chromatin profiles — we demonstrated that with this approach we can leverage data from well-characterized cell types to help understand less-studied cell type or conditions.

Show Abstract

Deep Supervised and Convolutional Generative Stochastic Network for Protein Secondary Structure Prediction

J. Zhou, O. Troyanskaya

Predicting protein secondary structure is a fundamental problem in protein structure prediction. Here we present a new supervised generative stochastic network (GSN) based method to predict local secondary structure with deep hierarchical representations. GSN is a recently proposed deep learning technique (Bengio & Thibodeau-Laufer, 2013) to globally train deep generative model. We present the supervised extension of GSN, which learns a Markov chain to sample from a conditional distribution, and applied it to protein structure prediction. To scale the model to full-sized, high-dimensional data, like protein sequences with hundreds of amino acids, we introduce a convolutional architecture, which allows efficient learning across multiple layers of hierarchical representations. Our architecture uniquely focuses on predicting structured low-level labels informed with both low and high-level representations learned by the model. In our application this corresponds to labeling the secondary structure state of each amino-acid residue. We trained and tested the model on separate sets of non-homologous proteins sharing less than 30% sequence identity. Our model achieves 66.4% Q8 accuracy on the CB513 dataset, better than the previously reported best performance 64.9% (Wang et al., 2011) for this challenging secondary structure prediction problem.

Show Abstract

Broad Metabolic Sensitivity Profiling of a Prototrophic Yeast Deletion Collection

Benjamin VanderSluis, Ph.D., O. Troyanskaya

Genome-wide sensitivity screens in yeast have been immensely popular following the construction of a collection of deletion mutants of non-essential genes. However, the auxotrophic markers in this collection preclude experiments on minimal growth medium, one of the most informative metabolic environments. Here we present quantitative growth analysis for mutants in all 4,772 non-essential genes from our prototrophic deletion collection across a large set of metabolic conditions.

Show Abstract

Individual and Combined Effects of DNA Methylation and Copy Number Alterations on miRNA Expression in Breast Tumors

M. Aure, O. Troyanskaya

The global effect of copy number and epigenetic alterations on miRNA expression in cancer is poorly understood. In the present study, we integrate genome-wide DNA methylation, copy number and miRNA expression and identify genetic mechanisms underlying miRNA dysregulation in breast cancer.

RESULTS:
We identify 70 miRNAs whose expression was associated with alterations in copy number or methylation, or both. Among these, five miRNA families are represented. Interestingly, the members of these families are encoded on different chromosomes and are complementarily altered by gain or hypomethylation across the patients. In an independent breast cancer cohort of 123 patients, 41 of the 70 miRNAs were confirmed with respect to aberration pattern and association to expression. In vitro functional experiments were performed in breast cancer cell lines with miRNA mimics to evaluate the phenotype of the replicated miRNAs. let-7e-3p, which in tumors is found associated with hypermethylation, is shown to induce apoptosis and reduce cell viability, and low let-7e-3p expression is associated with poorer prognosis. The overexpression of three other miRNAs associated with copy number gain, miR-21-3p, miR-148b-3p and miR-151a-5p, increases proliferation of breast cancer cell lines. In addition, miR-151a-5p enhances the levels of phosphorylated AKT protein.

CONCLUSIONS:
Our data provide novel evidence of the mechanisms behind miRNA dysregulation in breast cancer. The study contributes to the understanding of how methylation and copy number alterations influence miRNA expression, emphasizing miRNA functionality through redundant encoding, and suggests novel miRNAs important in breast cancer.

Show Abstract

Defining Cell-Type Specificity at the Transcriptional Level in Human Disease

W. Ju, W. Ju , C. Greene , F. Eichinger , V. Nair , J. Hodgin , M. Bitzer , Y. Lee , Q. Zhu , M. Kehata , M. Li , S. Jiang , M. Rastaldi , C. Cohen , O. Troyanskaya, M. Kretzler

Cell-lineage-specific transcripts are essential for differentiated tissue function, implicated in hereditary organ failure, and mediate acquired chronic diseases. However, experimental identification of cell-lineage-specific genes in a genome-scale manner is infeasible for most solid human tissues. We developed the first genome-scale method to identify genes with cell-lineage-specific expression, even in lineages not separable by experimental microdissection. Our machine-learning-based approach leverages high-throughput data from tissue homogenates in a novel iterative statistical framework. We applied this method to chronic kidney disease and identified transcripts specific to podocytes, key cells in the glomerular filter responsible for hereditary and most acquired glomerular kidney disease. In a systematic evaluation of our predictions by immunohistochemistry, our in silico approach was significantly more accurate (65% accuracy in human) than predictions based on direct measurement of in vivo fluorescence-tagged murine podocytes (23%). Our method identified genes implicated as causal in hereditary glomerular disease and involved in molecular pathways of acquired and chronic renal diseases. Furthermore, based on expression analysis of human kidney disease biopsies, we demonstrated that expression of the podocyte genes identified by our approach is significantly related to the degree of renal impairment in patients. Our approach is broadly applicable to define lineage specificity in both cell physiology and human disease contexts. We provide a user-friendly website that enables researchers to apply this method to any cell-lineage or tissue of interest. Identified cell-lineage-specific transcripts are expected to play essential tissue-specific roles in organogenesis and disease and can provide starting points for the development of organ-specific diagnostics and therapies.

Show Abstract

Ontology-aware classification of tissue and cell-type signals in gene expression profiles across platforms and technologies

Y.-S. Lee, A. Krishnan, Q. Zhu, O. Troyanskaya

We present Unveiling RNA Sample Annotation (URSA) that leverages the complex tissue/cell-type relationships and simultaneously estimates the probabilities associated with hundreds of tissues/cell-types for any given gene expression profile. URSA provides accurate and intuitive probability values for expression profiles across independent studies and outperforms other methods, irrespective of data preprocessing techniques. Moreover, without re-training, URSA can be used to classify samples from diverse microarray platforms and even from next-generation sequencing technology. Finally, we provide a molecular interpretation for the tissue and cell-type models as the biological basis for URSA’s classifications.

Show Abstract