CCB: Publications

Detailing Regulatory Networks through Large Scale Data Integration

C. Huttenhower, K. Mutungu, N. Indik, W. Yang, M. Schroeder, J. Forman, O. Troyanskaya, H. Coller

MOTIVATION:
Much of a cell's regulatory response to changing environments occurs at the transcriptional level. Particularly in higher organisms, transcription factors (TFs), microRNAs and epigenetic modifications can combine to form a complex regulatory network. Part of this system can be modeled as a collection of regulatory modules: co-regulated genes, the conditions under which they are co-regulated and sequence-level regulatory motifs.

RESULTS:
We present the Combinatorial Algorithm for Expression and Sequence-based Cluster Extraction (COALESCE) system for regulatory module prediction. The algorithm is efficient enough to discover expression biclusters and putative regulatory motifs in metazoan genomes (>20,000 genes) and very large microarray compendia (>10,000 conditions). Using Bayesian data integration, it can also include diverse supporting data types such as evolutionary conservation or nucleosome placement. We validate its performance using a functional evaluation of co-clustered genes, known yeast and Escherichea coli TF targets, synthetic data and various metazoan data compendia. In all cases, COALESCE performs as well or better than current biclustering and motif prediction tools, with high accuracy in functional and TF/target assignments and zero false positives on synthetic data. COALESCE provides an efficient and flexible platform within which large, diverse data collections can be integrated to predict metazoan regulatory networks.

AVAILABILITY:
Source code (C++) is available at http://function.princeton.edu/sleipnir, and supporting data and a web interface are provided at http://function.princeton.edu/coalesce.

Show Abstract

Graphle: Interactive Exploration of Large, Dense Graphs

C. Huttenhower, S. Mehmood, O. Troyanskaya

BACKGROUND:
A wide variety of biological data can be modeled as network structures, including experimental results (e.g. protein-protein interactions), computational predictions (e.g. functional interaction networks), or curated structures (e.g. the Gene Ontology). While several tools exist for visualizing large graphs at a global level or small graphs in detail, previous systems have generally not allowed interactive analysis of dense networks containing thousands of vertices at a level of detail useful for biologists. Investigators often wish to explore specific portions of such networks from a detailed, gene-specific perspective, and balancing this requirement with the networks' large size, complex structure, and rich metadata is a substantial computational challenge.

RESULTS:
Graphle is an online interface to large collections of arbitrary undirected, weighted graphs, each possibly containing tens of thousands of vertices (e.g. genes) and hundreds of millions of edges (e.g. interactions). These are stored on a centralized server and accessed efficiently through an interactive Java applet. The Graphle applet allows a user to examine specific portions of a graph, retrieving the relevant neighborhood around a set of query vertices (genes). This neighborhood can then be refined and modified interactively, and the results can be saved either as publication-quality images or as raw data for further analysis. The Graphle web site currently includes several hundred biological networks representing predicted functional relationships from three heterogeneous data integration systems: S. cerevisiae data from bioPIXIE, E. coli data using MEFIT, and H. sapiens data from HEFalMp.

CONCLUSIONS:
Graphle serves as a search and visualization engine for biological networks, which can be managed locally (simplifying collaborative data sharing) and investigated remotely. The Graphle framework is freely downloadable and easily installed on new servers, allowing any lab to quickly set up a Graphle site from which their own biological network data can be shared online.

Show Abstract

Systems-Level Dynamic Analyses of Fate Change in Murine Embryonic Stem Cells

R. Lu, F. Markowetz, R. Unwin , J. Leek , E. Airoldi , B. MacArthur , A. Lachmann , R. Rozov , A. Ma'ayan , L. Boyer , O. Troyanskaya, A. Whetton, I. Lemischka

Molecular regulation of embryonic stem cell (ESC) fate involves a coordinated interaction between epigenetic, transcriptional and translational mechanisms. It is unclear how these different molecular regulatory mechanisms interact to regulate changes in stem cell fate. Here we present a dynamic systems-level study of cell fate change in murine ESCs following a well-defined perturbation. Global changes in histone acetylation, chromatin-bound RNA polymerase II, messenger RNA (mRNA), and nuclear protein levels were measured over 5 days after downregulation of Nanog, a key pluripotency regulator. Our data demonstrate how a single genetic perturbation leads to progressive widespread changes in several molecular regulatory layers, and provide a dynamic view of information flow in the epigenome, transcriptome and proteome. We observe that a large proportion of changes in nuclear protein levels are not accompanied by concordant changes in the expression of corresponding mRNAs, indicating important roles for translational and post-translational regulation of ESC fate. Gene-ontology analysis across different molecular layers indicates that although chromatin reconfiguration is important for altering cell fate, it is preceded by transcription-factor-mediated regulatory events. The temporal order of gene expression alterations shows the order of the regulatory network reconfiguration and offers further insight into the gene regulatory network. Our studies extend the conventional systems biology approach to include many molecular species, regulatory layers and temporal series, and underscore the complexity of the multilayer regulatory mechanisms responsible for changes in protein expression that determine stem cell fate.

Show Abstract

The Impact of Incomplete Knowledge on Evaluation: An Experimental Benchmark for Protein Function Prediction

C. Huttenhower, M. Hibbs , C. Myers, A. Caudy , D. Hess, O. Troyanskaya

MOTIVATION:
Rapidly expanding repositories of highly informative genomic data have generated increasing interest in methods for protein function prediction and inference of biological networks. The successful application of supervised machine learning to these tasks requires a gold standard for protein function: a trusted set of correct examples, which can be used to assess performance through cross-validation or other statistical approaches. Since gene annotation is incomplete for even the best studied model organisms, the biological reliability of such evaluations may be called into question.

RESULTS:
We address this concern by constructing and analyzing an experimentally based gold standard through comprehensive validation of protein function predictions for mitochondrion biogenesis in Saccharomyces cerevisiae. Specifically, we determine that (i) current machine learning approaches are able to generalize and predict novel biology from an incomplete gold standard and (ii) incomplete functional annotations adversely affect the evaluation of machine learning performance. While computational approaches performed better than predicted in the face of incomplete data, relative comparison of competing approaches-even those employing the same training data-is problematic with a sparse gold standard. Incomplete knowledge causes individual methods' performances to be differentially underestimated, resulting in misleading performance evaluations. We provide a benchmark gold standard for yeast mitochondria to complement current databases and an analysis of our experimental results in the hopes of mitigating these effects in future comparative evaluations.

AVAILABILITY:
The mitochondrial benchmark gold standard, as well as experimental results and additional data, is available at http://function.princeton.edu/mitochondria.

Show Abstract

Global Prediction of Tissue-Specific Gene Expression and Context-Dependent Gene Networks in Caenorhabditis Elegans

M. Chikina, C. Huttenhower, C. Murphy, O. Troyanskaya

Tissue-specific gene expression plays a fundamental role in metazoan biology and is an important aspect of many complex diseases. Nevertheless, an organism-wide map of tissue-specific expression remains elusive due to difficulty in obtaining these data experimentally. Here, we leveraged existing whole-animal Caenorhabditis elegans microarray data representing diverse conditions and developmental stages to generate accurate predictions of tissue-specific gene expression and experimentally validated these predictions. These patterns of tissue-specific expression are more accurate than existing high-throughput experimental studies for nearly all tissues; they also complement existing experiments by addressing tissue-specific expression present at particular developmental stages and in small tissues. We used these predictions to address several experimentally challenging questions, including the identification of tissue-specific transcriptional motifs and the discovery of potential miRNA regulation specific to particular tissues. We also investigate the role of tissue context in gene function through tissue-specific functional interaction networks. To our knowledge, this is the first study producing high-accuracy predictions of tissue-specific expression and interactions for a metazoan organism based on whole-animal data.

Show Abstract

Exploring the Human Genome with Functional Maps.

C. Huttenhower, E. Haley, M. Hibbs, V. Dumeaux, D. Barrett, H. Coller, O. Troyanskaya

Human genomic data of many types are readily available, but the complexity and scale of human molecular biology make it difficult to integrate this body of data, understand it from a systems level, and apply it to the study of specific pathways or genetic disorders. An investigator could best explore a particular protein, pathway, or disease if given a functional map summarizing the data and interactions most relevant to his or her area of interest. Using a regularized Bayesian integration system, we provide maps of functional activity and interaction networks in over 200 areas of human cellular biology, each including information from approximately 30,000 genome-scale experiments pertaining to approximately 25,000 human genes. Key to these analyses is the ability to efficiently summarize this large data collection from a variety of biologically informative perspectives: prediction of protein function and functional modules, cross-talk among biological processes, and association of novel genes and pathways with known genetic disorders. In addition to providing maps of each of these areas, we also identify biological processes active in each data set. Experimental investigation of five specific genes, AP3B1, ATP6AP1, BLOC1S1, LAMP2, and RAB11A, has confirmed novel roles for these proteins in the proper initiation of macroautophagy in amino acid-starved human fibroblasts. Our functional maps can be explored using HEFalMp (Human Experimental/Functional Mapper), a web interface allowing interactive visualization and investigation of this large body of information.

Show Abstract

Aneuploidy Prediction and Tumor Classification with Heterogeneous Hidden Conditional Random Fields

Z. Barutcuoglu, E. Airoldi, V. Dumeaux, R. Schapire, O. Troyanskaya

MOTIVATION:
The heterogeneity of cancer cannot always be recognized by tumor morphology, but may be reflected by the underlying genetic aberrations. Array comparative genome hybridization (array-CGH) methods provide high-throughput data on genetic copy numbers, but determining the clinically relevant copy number changes remains a challenge. Conventional classification methods for linking recurrent alterations to clinical outcome ignore sequential correlations in selecting relevant features. Conversely, existing sequence classification methods can only model overall copy number instability, without regard to any particular position in the genome.

RESULTS:
Here, we present the heterogeneous hidden conditional random field, a new integrated array-CGH analysis method for jointly classifying tumors, inferring copy numbers and identifying clinically relevant positions in recurrent alteration regions. By capturing the sequentiality as well as the locality of changes, our integrated model provides better noise reduction, and achieves more relevant gene retrieval and more accurate classification than existing methods. We provide an efficient L1-regularized discriminative training algorithm, which notably selects a small set of candidate genes most likely to be clinically relevant and driving the recurrent amplicons of importance. Our method thus provides unbiased starting points in deciding which genomic regions and which genes in particular to pursue for further examination. Our experiments on synthetic data and real genomic cancer prediction data show that our method is superior, both in prediction accuracy and relevant feature discovery, to existing methods. We also demonstrate that it can be used to generate novel biological hypotheses for breast cancer.

Show Abstract

Computationally Driven, Quantitative Experiments Discover Genes Required for Mitochondrial Biogenesis

D. Hess, C. Myers, C. Huttenhower, M. Hibbs, A. Hayes, J. Paw, J. Clore, R. Mendoza, B. San Luis, C. Nislow, G. Giaever, M. Costanzo, O. Troyanskaya, A. Caudy

Mitochondria are central to many cellular processes including respiration, ion homeostasis, and apoptosis. Using computational predictions combined with traditional quantitative experiments, we have identified 100 proteins whose deficiency alters mitochondrial biogenesis and inheritance in Saccharomyces cerevisiae. In addition, we used computational predictions to perform targeted double-mutant analysis detecting another nine genes with synthetic defects in mitochondrial biogenesis. This represents an increase of about 25% over previously known participants. Nearly half of these newly characterized proteins are conserved in mammals, including several orthologs known to be involved in human disease. Mutations in many of these genes demonstrate statistically significant mitochondrial transmission phenotypes more subtle than could be detected by traditional genetic screens or high-throughput techniques, and 47 have not been previously localized to mitochondria. We further characterized a subset of these genes using growth profiling and dual immunofluorescence, which identified genes specifically required for aerobic respiration and an uncharacterized cytoplasmic protein required for normal mitochondrial motility. Our results demonstrate that by leveraging computational analysis to direct quantitative experimental assays, we have characterized mutants with subtle mitochondrial defects whose phenotypes were undetected by high-throughput methods.

Show Abstract

Directing Experimental Biology: A Case Study in Mitochondrial Biogenesis

M. Hibbs, C. Myers, C. Huttenhower , D. Hess, K. Li, A. Caudy, O. Troyanskaya

Computational approaches have promised to organize collections of functional genomics data into testable predictions of gene and protein involvement in biological processes and pathways. However, few such predictions have been experimentally validated on a large scale, leaving many bioinformatic methods unproven and underutilized in the biology community. Further, it remains unclear what biological concerns should be taken into account when using computational methods to drive real-world experimental efforts. To investigate these concerns and to establish the utility of computational predictions of gene function, we experimentally tested hundreds of predictions generated from an ensemble of three complementary methods for the process of mitochondrial organization and biogenesis in Saccharomyces cerevisiae. The biological data with respect to the mitochondria are presented in a companion manuscript published in PLoS Genetics (doi:10.1371/journal.pgen.1000407). Here we analyze and explore the results of this study that are broadly applicable for computationalists applying gene function prediction techniques, including a new experimental comparison with 48 genes representing the genomic background. Our study leads to several conclusions that are important to consider when driving laboratory investigations using computational prediction approaches. While most genes in yeast are already known to participate in at least one biological process, we confirm that genes with known functions can still be strong candidates for annotation of additional gene functions. We find that different analysis techniques and different underlying data can both greatly affect the types of functional predictions produced by computational methods. This diversity allows an ensemble of techniques to substantially broaden the biological scope and breadth of predictions. We also find that performing prediction and validation steps iteratively allows us to more completely characterize a biological area of interest. While this study focused on a specific functional area in yeast, many of these observations may be useful in the contexts of other processes and organisms.

Show Abstract

Selected Proceedings of the First Summit on Translational Bioinformatics 2008

A. Butte, I. Sarkar, M. Ramoni, Y. Lussier, O. Troyanskaya

In 2005, Dr. Elias Zerhouni, Director of the National Institutes of Health (NIH), wrote:

"It is the responsibility of those of us involved in today's biomedical research enterprise to translate the remarkable scientific innovations we are witnessing into health gains for the nation... At no other time has the need for a robust, bidirectional information flow between basic and translational scientists been so necessary."

In that publication, Dr. Zerhouni introduced his ideas to re-engineer the way clinical research was performed in the United States. With the doubling of the NIH budget in the past decade, and coincident completion of the Human Genome Project, there is a perceived need to translate products of the genome era into products for clinical care.

Show Abstract