CCB: Publications

Selected Proceedings of the First Summit on Translational Bioinformatics 2008

A. Butte, I. Sarkar, M. Ramoni, Y. Lussier, O. Troyanskaya

In 2005, Dr. Elias Zerhouni, Director of the National Institutes of Health (NIH), wrote:

"It is the responsibility of those of us involved in today's biomedical research enterprise to translate the remarkable scientific innovations we are witnessing into health gains for the nation... At no other time has the need for a robust, bidirectional information flow between basic and translational scientists been so necessary."

In that publication, Dr. Zerhouni introduced his ideas to re-engineer the way clinical research was performed in the United States. With the doubling of the NIH budget in the past decade, and coincident completion of the Human Genome Project, there is a perceived need to translate products of the genome era into products for clinical care.

Show Abstract

Coordinated Concentration Changes of Transcripts and Metabolites in Saccharomyces Cerevisiae

P. Bradley, M. Bauer, J. Rabinowitz, O. Troyanskaya

Metabolite concentrations can regulate gene expression, which can in turn regulate metabolic activity. The extent to which functionally related transcripts and metabolites show similar patterns of concentration changes, however, remains unestablished. We measure and analyze the metabolomic and transcriptional responses of Saccharomyces cerevisiae to carbon and nitrogen starvation. Our analysis demonstrates that transcripts and metabolites show coordinated response dynamics. Furthermore, metabolites and gene products whose concentration profiles are alike tend to participate in related biological processes. To identify specific, functionally related genes and metabolites, we develop an approach based on Bayesian integration of the joint metabolomic and transcriptomic data. This algorithm finds interactions by evaluating transcript–metabolite correlations in light of the experimental context in which they occur and the class of metabolite involved. It effectively predicts known enzymatic and regulatory relationships, including a gene–metabolite interaction central to the glycolytic–gluconeogenetic switch. This work provides quantitative evidence that functionally related metabolites and transcripts show coherent patterns of behavior on the genome scale and lays the groundwork for building gene–metabolite interaction networks directly from systems-level data.

Show Abstract

Predicting Cellular Growth from Gene Expression Signatures

E. Airoldi, C. Huttenhower, D. Gresham, C. Lu, A. Caudy, M. Dunham, J. Broach, D. Botstein, O. Troyanskaya

Maintaining balanced growth in a changing environment is a fundamental systems-level challenge for cellular physiology, particularly in microorganisms. While the complete set of regulatory and functional pathways supporting growth and cellular proliferation are not yet known, portions of them are well understood. In particular, cellular proliferation is governed by mechanisms that are highly conserved from unicellular to multicellular organisms, and the disruption of these processes in metazoans is a major factor in the development of cancer. In this paper, we develop statistical methodology to identify quantitative aspects of the regulatory mechanisms underlying cellular proliferation in Saccharomyces cerevisiae. We find that the expression levels of a small set of genes can be exploited to predict the instantaneous growth rate of any cellular culture with high accuracy. The predictions obtained in this fashion are robust to changing biological conditions, experimental methods, and technological platforms. The proposed model is also effective in predicting growth rates for the related yeast Saccharomyces bayanus and the highly diverged yeast Schizosaccharomyces pombe, suggesting that the underlying regulatory signature is conserved across a wide range of unicellular evolution. We investigate the biological significance of the gene expression signature that the predictions are based upon from multiple perspectives: by perturbing the regulatory network through the Ras/PKA pathway, observing strong upregulation of growth rate even in the absence of appropriate nutrients, and discovering putative transcription factor binding sites, observing enrichment in growth-correlated genes. More broadly, the proposed methodology enables biological insights about growth at an instantaneous time scale, inaccessible by direct experimental methods. Data and tools enabling others to apply our methods are available at http://function.princeton.edu/growthrate.

Show Abstract

Computational Analysis of the Yeast Proteome: Understanding and Exploiting Functional Specificity in Genomic Data

C. Huttenhower, C. Myers, M. Hibbs, O. Troyanskaya

Modern experimental techniques have produced a wealth of high-throughput data that has enabled the ongoing genomic revolution. As the field continues to integrate experimental and computational analyzes of this data, it is essential that performance evaluations of high-throughput results be carried out in a consistent and biologically informative manner. Here, we present an overview of evaluation techniques for high-throughput experimental data and computational methods, and we discuss a number of potential pitfalls in this process. These primarily involve the biological diversity of genomic data, which can be masked or misrepresented in overly simplified global evaluations. We describe systems for preserving information about biological context during dataset evaluation, which can help to ensure that multiple different evaluations are more directly comparable. This biological variety in high-throughput data can also be taken advantage of computationally through data integration and process specificity to produce richer systems-level predictions of cellular function. An awareness of these considerations can greatly improve the evaluation and analysis of any high-throughput experimental dataset.

Show Abstract

A Genomewide Functional Network for the Laboratory Mouse

Y. Guan, C. Myers, R. Lu, I. Lemischka, C. Bult, O. Troyanskaya

Establishing a functional network is invaluable to our understanding of gene function, pathways, and systems-level properties of an organism and can be a powerful resource in directing targeted experiments. In this study, we present a functional network for the laboratory mouse based on a Bayesian integration of diverse genetic and functional genomic data. The resulting network includes probabilistic functional linkages among 20,581 protein-coding genes. We show that this network can accurately predict novel functional assignments and network components and present experimental evidence for predictions related to Nanog homeobox (Nanog), a critical gene in mouse embryonic stem cell pluripotency. An analysis of the global topology of the mouse functional network reveals multiple biologically relevant systems-level features of the mouse proteome. Specifically, we identify the clustering coefficient as a critical characteristic of central modulators that affect diverse pathways as well as genes associated with different phenotype traits and diseases. In addition, a cross-species comparison of functional interactomes on a genomic scale revealed distinct functional characteristics of conserved neighborhoods as compared to subnetworks specific to higher organisms. Thus, our global functional network for the laboratory mouse provides the community with a key resource for discovering protein functions and novel pathway components as well as a tool for exploring systems-level topological and evolutionary features of cellular interactomes. To facilitate exploration of this network by the biomedical research community, we illustrate its application in function and disease gene discovery through an interactive, Web-based, publicly available interface at http://mouseNET.princeton.edu.

Show Abstract

The Sleipnir Library for Computational Functional Genomics

C. Huttenhower, M. Schroeder, M. Chikina

MOTIVATION:
Biological data generation has accelerated to the point where hundreds or thousands of whole-genome datasets of various types are available for many model organisms. This wealth of data can lead to valuable biological insights when analyzed in an integrated manner, but the computational challenge of managing such large data collections is substantial. In order to mine these data efficiently, it is necessary to develop methods that use storage, memory and processing resources carefully.
RESULTS:
The Sleipnir C++ library implements a variety of machine learning and data manipulation algorithms with a focus on heterogeneous data integration and efficiency for very large biological data collections. Sleipnir allows microarray processing, functional ontology mining, clustering, Bayesian learning and inference and support vector machine tasks to be performed for heterogeneous data on scales not previously practical. In addition to the library, which can easily be integrated into new computational systems, prebuilt tools are provided to perform a variety of common tasks. Many tools are multithreaded for parallelization in desktop or high-throughput computing environments, and most tasks can be performed in minutes for hundreds of datasets using a standard personal computer.
AVAILABILITY:
Source code (C++) and documentation are available at http://function.princeton.edu/sleipnir and compiled binaries are available from the authors on request.

Show Abstract

Assessing the Functional Structure of Genomic Data

C. Huttenhower, O. Troyanskaya

Motivation: The availability of genome-scale data has enabled an abundance of novel analysis techniques for investigating a variety of systems-level biological relationships. As thousands of such datasets become available, they provide an opportunity to study high-level associations between cellular pathways and processes. This also allows the exploration of shared functional enrichments between diverse biological datasets, and it serves to direct experimenters to areas of low data coverage or with high probability of new discoveries.

Results: We analyze the functional structure of Saccharomyces cerevisiae datasets from over 950 publications in the context of over 140 biological processes. This includes a coverage analysis of biological processes given current high-throughput data, a data-driven map of associations between processes, and a measure of similar functional activity between genome-scale datasets. This uncovers subtle gene expression similarities in three otherwise disparate microarray datasets due to a shared strain background. We also provide several means of predicting areas of yeast biology likely to benefit from additional high-throughput experimental screens.

Availability: Predictions are provided in supplementary tables; software and additional data are available from the authors by request.

Show Abstract

Coordination of Growth Rate, Cell Cycle, Stress Response, and Metabolic Activity in Yeast

M. Brauer, C. Huttenhower, E. Airoldi, R. Rosenstein, J. Matese, D. Gresham, V. Boer, O. Troyanskaya, D.Botstein

We studied the relationship between growth rate and genome-wide gene expression, cell cycle progression, and glucose metabolism in 36 steady-state continuous cultures limited by one of six different nutrients (glucose, ammonium, sulfate, phosphate, uracil, or leucine). The expression of more than one quarter of all yeast genes is linearly correlated with growth rate, independent of the limiting nutrient. The subset of negatively growth-correlated genes is most enriched for peroxisomal functions, whereas positively correlated genes mainly encode ribosomal functions. Many (not all) genes associated with stress response are strongly correlated with growth rate, as are genes that are periodically expressed under conditions of metabolic cycling. We confirmed a linear relationship between growth rate and the fraction of the cell population in the G0/G1 cell cycle phase, independent of limiting nutrient. Cultures limited by auxotrophic requirements wasted excess glucose, whereas those limited on phosphate, sulfate, or ammonia did not; this phenomenon (reminiscent of the “Warburg effect” in cancer cells) was confirmed in batch cultures. Using an aggregate of gene expression values, we predict (in both continuous and batch cultures) an “instantaneous growth rate.” This concept is useful in interpreting the system-level connections among growth rate, metabolism, stress, and the cell cycle.

Show Abstract

Predicting Gene Function in a Hierarchical Context with an Ensemble of Classifiers

Y. Guan, C. Myers, D. Hess, Z. Barutcuoglu, A. Caudy, O. Troyanskaya

BACKGROUND:
The wide availability of genome-scale data for several organisms has stimulated interest in computational approaches to gene function prediction. Diverse machine learning methods have been applied to unicellular organisms with some success, but few have been extensively tested on higher level, multicellular organisms. A recent mouse function prediction project (MouseFunc) brought together nine bioinformatics teams applying a diverse array of methodologies to mount the first large-scale effort to predict gene function in the laboratory mouse.

RESULTS:
In this paper, we describe our contribution to this project, an ensemble framework based on the support vector machine that integrates diverse datasets in the context of the Gene Ontology hierarchy. We carry out a detailed analysis of the performance of our ensemble and provide insights into which methods work best under a variety of prediction scenarios. In addition, we applied our method to Saccharomyces cerevisiae and have experimentally confirmed functions for a novel mitochondrial protein.

CONCLUSION:
Our method consistently performs among the top methods in the MouseFunc evaluation. Furthermore, it exhibits good classification performance across a variety of cellular processes and functions in both a multicellular organism and a unicellular organism, indicating its ability to discover novel biology in diverse settings.

Show Abstract

A Critical Assessment of Mus Musculus Gene Function Prediction Using Integrated Genomic Evidence

L. Peña-Castillo , M. Tasan , C. Myers , H. Lee, T. Joshi , C. Zhang , Y. Guan , M. Leone , A. Pagnani , W. Kim, C. Krumpelman , W. Tian , G. Obozinski, Y. Qi Y, S. Mostafavi , G. Lin , G. Berriz , F. Gibbons , G. Lanckriet, J. Qiu , C. Grant , Z. Barutcuoglu , D. Hill , D. Warde-Farley , C. Grouios , D. Ray, J. Blake , M. Deng , M. Jordan , W. Noble , Q. Morris, J. Klein-Seetharaman , Z. Bar-Joseph, T. Chen , F. Sun F, O. Troyanskaya, E. Marcotte , D. Xu , T. Hughes, F. Roth

BACKGROUND:
Several years after sequencing the human genome and the mouse genome, much remains to be discovered about the functions of most human and mouse genes. Computational prediction of gene function promises to help focus limited experimental resources on the most likely hypotheses. Several algorithms using diverse genomic data have been applied to this task in model organisms; however, the performance of such approaches in mammals has not yet been evaluated.
RESULTS:
In this study, a standardized collection of mouse functional genomic data was assembled; nine bioinformatics teams used this data set to independently train classifiers and generate predictions of function, as defined by Gene Ontology (GO) terms, for 21,603 mouse genes; and the best performing submissions were combined in a single set of predictions. We identified strengths and weaknesses of current functional genomic data sets and compared the performance of function prediction algorithms. This analysis inferred functions for 76% of mouse genes, including 5,000 currently uncharacterized genes. At a recall rate of 20%, a unified set of predictions averaged 41% precision, with 26% of GO terms achieving a precision better than 90%.
CONCLUSION:
We performed a systematic evaluation of diverse, independently developed computational approaches for predicting gene function from heterogeneous data sources in mammals. The results show that currently available data for mammals allows predictions with both breadth and accuracy. Importantly, many highly novel predictions emerge for the 38% of mouse genes that remain uncharacterized.

Show Abstract