CCB: Publications

Implications of Big Data for cell biology

“Big Data” has surpassed “systems biology” and “omics” as the hottest buzzword in the biological sciences, but is there any substance behind the hype? Certainly, we have learned about various aspects of cell and molecular biology from the many individual high-throughput data sets that have been published in the past 15–20 years. These data, although useful as individual data sets, can provide much more knowledge when interrogated with Big Data approaches, such as applying integrative methods that leverage the heterogeneous data compendia in their entirety. Here we discuss the benefits and challenges of such Big Data approaches in biology and how cell and molecular biologists can best take advantage of them.

Show Abstract

IMP 2.0: A Multi-Species Functional Genomics Portal for Integration, Visualization and Prediction of Protein Functions and Networks

A. Wong, A. Krishnan, V. Yao, A. Tadych, O. Troyanskaya

IMP (Integrative Multi-species Prediction), originally released in 2012, is an interactive web server that enables molecular biologists to interpret experimental results and to generate hypotheses in the context of a large cross-organism compendium of functional predictions and networks. The system provides biologists with a framework to analyze their candidate gene sets in the context of functional networks, expanding or refining their sets using functional relationships predicted from integrated high-throughput data. IMP 2.0 integrates updated prior knowledge and data collections from the last three years in the seven supported organisms (Homo sapiens, Mus musculus, Rattus norvegicus, Drosophila melanogaster, Danio rerio, Caenorhabditis elegans, and Saccharomyces cerevisiae) and extends function prediction coverage to include human disease. IMP identifies homologs with conserved functional roles for disease knowledge transfer, allowing biologists to analyze disease contexts and predictions across all organisms. Additionally, IMP 2.0 implements a new flexible platform for experts to generate custom hypotheses about biological processes or diseases, making sophisticated data-driven methods easily accessible to researchers. IMP does not require any registration or installation and is freely available for use at http://imp.princeton.edu.

Show Abstract

Oxopiperazine helix mimetics for control of hypoxia-inducible gene expression

P.S. Arora, B.B. Lao, R. Bonneau, K. Drew

The present invention relates to oxopiperazines that mimic helix αB of the C-terminal transactivation domain of HIF1α. Also disclosed are pharmaceutical compositions containing these oxopiperazines and methods of using these oxopiperazines (e.g., to reduce gene transcription, treat or prevent disorders mediated by interaction of HIF1a with CREB-binding protein and/or p300, reduce or prevent angiogenesis in a tissue, induce apoptosis, and decrease cell survival and/or proliferation).

Show Abstract

Sparse and Compositionally Robust Inference of Microbial Ecological Networks

Z.D. Kurtz, C. Müller, E. Miraldi, D.R. Littman, M.J. Blaser, R. Bonneau

16S ribosomal RNA (rRNA) gene and other environmental sequencing techniques provide snapshots of microbial communities, revealing phylogeny and the abundances of microbial populations across diverse ecosystems. While changes in microbial community structure are demonstrably associated with certain environmental conditions (from metabolic and immunological health in mammals to ecological stability in soils and oceans), identification of underlying mechanisms requires new statistical tools, as these datasets present several technical challenges. First, the abundances of microbial operational taxonomic units (OTUs) from amplicon-based datasets are compositional. Counts are normalized to the total number of counts in the sample. Thus, microbial abundances are not independent, and traditional statistical metrics (e.g., correlation) for the detection of OTU-OTU relationships can lead to spurious results. Secondly, microbial sequencing-based studies typically measure hundreds of OTUs on only tens to hundreds of samples; thus, inference of OTU-OTU association networks is severely under-powered, and additional information (or assumptions) are required for accurate inference. Here, we present SPIEC-EASI (SParse InversE Covariance Estimation for Ecological Association Inference), a statistical method for the inference of microbial ecological networks from amplicon sequencing datasets that addresses both of these issues. SPIEC-EASI combines data transformations developed for compositional data analysis with a graphical model inference framework that assumes the underlying ecological association network is sparse. To reconstruct the network, SPIEC-EASI relies on algorithms for sparse neighborhood and inverse covariance selection. To provide a synthetic benchmark in the absence of an experimentally validated gold-standard network, SPIEC-EASI is accompanied by a set of computational tools to generate OTU count data from a set of diverse underlying network topologies. SPIEC-EASI outperforms state-of-the-art methods to recover edges and network properties on synthetic data under a variety of scenarios. SPIEC-EASI also reproducibly predicts previously unknown microbial associations using data from the American Gut project.

Show Abstract

Low-Variance RNAs Identify Parkinson’s Disease Molecular Signature in Blood

C. Mese, C. Gerald , X. Li , Y. Ge , H. Pincas , A. Wong , A. Krishnan , O. Troyanskaya, D. Raymond , R. Saunders-Pullman , S. Bressman , Z. Yue , C. Sealfon

The diagnosis of Parkinson's disease (PD) is usually not established until advanced neurodegeneration leads to clinically detectable symptoms. Previous blood PD transcriptome studies show low concordance, possibly resulting from the use of microarray technology, which has high measurement variation. The Leucine-rich repeat kinase 2 (LRRK2) G2019S mutation predisposes to PD. Using preclinical and clinical studies, we sought to develop a novel statistically motivated transcriptomic-based approach to identify a molecular signature in the blood of Ashkenazi Jewish PD patients, including LRRK2 mutation carriers. Using a digital gene expression platform to quantify 175 messenger RNA (mRNA) markers with low coefficients of variation (CV), we first compared whole-blood transcript levels in mouse models (1) overexpressing wild-type (WT) LRRK2, (2) overexpressing G2019S LRRK2, (3) lacking LRRK2 (knockout), and (4) and in WT controls. We then studied an Ashkenazi Jewish cohort of 34 symptomatic PD patients (both WT LRRK2 and G2019S LRRK2) and 32 asymptomatic controls. The expression profiles distinguished the four mouse groups with different genetic background. In patients, we detected significant differences in blood transcript levels both between individuals differing in LRRK2 genotype and between PD patients and controls. Discriminatory PD markers included genes associated with innate and adaptive immunity and inflammatory disease. Notably, gene expression patterns in levodopa-treated PD patients were significantly closer to those of healthy controls in a dose-dependent manner. We identify whole-blood mRNA signatures correlating with LRRK2 genotype and with PD disease state. This approach may provide insight into pathogenesis and a route to early disease detection.

Show Abstract

Understanding multicellular function and disease with human tissue-specific networks

C.S. Greene, et al.

Tissue and cell-type identity lie at the core of human physiology and disease. Understanding the genetic underpinnings of complex tissues and individual cell lineages is crucial for developing improved diagnostics and therapeutics. We present genome-wide functional interaction networks for 144 human tissues and cell types developed using a data-driven Bayesian methodology that integrates thousands of diverse experiments spanning tissue and disease states. Tissue-specific networks predict lineage-specific responses to perturbation, identify the changing functional roles of genes across tissues and illuminate relationships among diseases. We introduce NetWAS, which combines genes with nominally significant genome-wide association study (GWAS) P values and tissue-specific networks to identify disease-gene associations more accurately than GWAS alone. Our webserver, GIANT, provides an interface to human tissue networks through multi-gene queries, network visualization, analysis tools including NetWAS and downloadable networks. GIANT enables systematic exploration of the landscape of interacting genes that shape specialized cellular functions across more than a hundred human tissues and cell types.

Show Abstract

Interpreting 4C-Seq data: how far can we go?

R. Raviram, P.P. Rocha, R. Bonneau, J.A. Skok

The linear sequence of the genome has been extremely valuable in mapping regulatory elements relative to the genes they control. However, it has become increasingly evident that characterizing the three-dimensional organization of the genome is critical to get a better understanding of long-range regulation. Early studies using fluorescent in-situ hybridization (FISH) revealed that individual chromosomes occupy distinct spaces in the nucleus with minimal intermingling between territories[1]. Recent advances using chromosome conformation capture (3C) techniques have confirmed these findings and further improved the depth at which we can determine the organization of chromosomes and the physical interactions that occur within and between them[2, 3]. Variations of the 3C technique include (i) Hi-C, to capture all pairwise interactions, (ii) 5C, to capture interactions within and between loci of interest and (iii) 4C-Seq, to capture all interactions with a single locus of interest. The choice of technique depends on the biological question being asked and the scale at which this needs to be examined. While Hi-C has been instrumental in characterizing higher-order organization of chromosomes in the nucleus, it lacks the resolution that is required for analysis of specific interactions, such as between enhancers and promoters. This can be achieved with 4C-Seq, which allows interrogation of interactions from a single viewpoint or bait, to the rest of the genome. Several studies have used 4C-Seq to better understand phenomena such as X chromosome inactivation[4], enhancer-promoter interactions[5, 6], organization of antigen receptor loci[7], choice of translocation partners[8, 9] and collinear transcriptional regulation[10]. Here we aim to focus on the current state of the 4C-Seq method and the limitations and challenges of the associated computational analysis.

Show Abstract

OXOPIPERAZINE HELIX MIMETICS AS INHIBITORS OF THE p53-MDM2 INTERACTION

P.S. Arora, B.B. Lao, D. Guarracino, R. Bonneau, K. Drew

The present invention relates to oligooxopiperzines for modulating the p53-Mdm2 interaction. Methods of using the oligooxopiperazines are also disclosed.

Show Abstract

Tissue-Aware Data Integration Approach for the Inference of Pathway Interactions in Metazoan Organisms

C. Park, A. Krishnan , Q. Zhu , A. Wong, Y. Lee, O. Troyanskaya

MOTIVATION:
Leveraging the large compendium of genomic data to predict biomedical pathways and specific mechanisms of protein interactions genome-wide in metazoan organisms has been challenging. In contrast to unicellular organisms, biological and technical variation originating from diverse tissues and cell-lineages is often the largest source of variation in metazoan data compendia. Therefore, a new computational strategy accounting for the tissue heterogeneity in the functional genomic data is needed to accurately translate the vast amount of human genomic data into specific interaction-level hypotheses.

RESULTS:
We developed an integrated, scalable strategy for inferring multiple human gene interaction types that takes advantage of data from diverse tissue and cell-lineage origins. Our approach specifically predicts both the presence of a functional association and also the most likely interaction type among human genes or its protein products on a whole-genome scale. We demonstrate that directly incorporating tissue contextual information improves the accuracy of our predictions, and further, that such genome-wide results can be used to significantly refine regulatory interactions from primary experimental datasets (e.g. ChIP-Seq, mass spectrometry).

AVAILABILITY AND IMPLEMENTATION:
An interactive website hosting all of our interaction predictions is publically available at http://pathwaynet.princeton.edu. Software was implemented using the open-source Sleipnir library, which is available for download at https://bitbucket.org/libsleipnir/libsleipnir.bitbucket.org.

Show Abstract

Lymphocyte Invasion in IC10/Basal-Like Breast Tumors Is Associated with Wild-Type TP53

D. Quigley, L. Silwal-Pandit, R. Dannenfelser , A. Langerød , H. Vollan , C. Vaske , J. Siegel , O. Troyanskaya, S. Chin , C. Caldas , A. Balmain , A. Børresen-Dale , V. Kristensen

Lymphocytic infiltration is associated with better prognosis in several epithelial malignancies including breast cancer. The tumor suppressor TP53 is mutated in approximately 30% of breast adenocarcinomas, with varying frequency across molecular subtypes. In this study of 1,420 breast tumors, we tested for interaction between TP53 mutation status and tumor subtype determined by PAM50 and integrative cluster analysis. In integrative cluster 10 (IC10)/basal-like breast cancer, we identify an association between lymphocytic infiltration, determined by an expression score, and retention of wild-type TP53. The expression-derived score agreed with the degree of lymphocytic infiltration assessed by pathologic review, and application of the Nanodissect algorithm was suggestive of this infiltration being primarily of cytotoxic T lymphocytes (CTL). Elevated expression of this CTL signature was associated with longer survival in IC10/Basal-like tumors. These findings identify a new link between the TP53 pathway and the adaptive immune response in estrogen receptor (ER)-negative breast tumors, suggesting a connection between TP53 inactivation and failure of tumor immunosurveillance.

Show Abstract