698 Publications

A Genomewide Functional Network for the Laboratory Mouse

Y. Guan, C. Myers, R. Lu, I. Lemischka, C. Bult, O. Troyanskaya

Establishing a functional network is invaluable to our understanding of gene function, pathways, and systems-level properties of an organism and can be a powerful resource in directing targeted experiments. In this study, we present a functional network for the laboratory mouse based on a Bayesian integration of diverse genetic and functional genomic data. The resulting network includes probabilistic functional linkages among 20,581 protein-coding genes. We show that this network can accurately predict novel functional assignments and network components and present experimental evidence for predictions related to Nanog homeobox (Nanog), a critical gene in mouse embryonic stem cell pluripotency. An analysis of the global topology of the mouse functional network reveals multiple biologically relevant systems-level features of the mouse proteome. Specifically, we identify the clustering coefficient as a critical characteristic of central modulators that affect diverse pathways as well as genes associated with different phenotype traits and diseases. In addition, a cross-species comparison of functional interactomes on a genomic scale revealed distinct functional characteristics of conserved neighborhoods as compared to subnetworks specific to higher organisms. Thus, our global functional network for the laboratory mouse provides the community with a key resource for discovering protein functions and novel pathway components as well as a tool for exploring systems-level topological and evolutionary features of cellular interactomes. To facilitate exploration of this network by the biomedical research community, we illustrate its application in function and disease gene discovery through an interactive, Web-based, publicly available interface at http://mouseNET.princeton.edu.

Show Abstract
September 26, 2008

The Sleipnir Library for Computational Functional Genomics

C. Huttenhower, M. Schroeder, M. Chikina

MOTIVATION:
Biological data generation has accelerated to the point where hundreds or thousands of whole-genome datasets of various types are available for many model organisms. This wealth of data can lead to valuable biological insights when analyzed in an integrated manner, but the computational challenge of managing such large data collections is substantial. In order to mine these data efficiently, it is necessary to develop methods that use storage, memory and processing resources carefully.
RESULTS:
The Sleipnir C++ library implements a variety of machine learning and data manipulation algorithms with a focus on heterogeneous data integration and efficiency for very large biological data collections. Sleipnir allows microarray processing, functional ontology mining, clustering, Bayesian learning and inference and support vector machine tasks to be performed for heterogeneous data on scales not previously practical. In addition to the library, which can easily be integrated into new computational systems, prebuilt tools are provided to perform a variety of common tasks. Many tools are multithreaded for parallelization in desktop or high-throughput computing environments, and most tasks can be performed in minutes for hundreds of datasets using a standard personal computer.
AVAILABILITY:
Source code (C++) and documentation are available at http://function.princeton.edu/sleipnir and compiled binaries are available from the authors on request.

Show Abstract

Assessing the Functional Structure of Genomic Data

C. Huttenhower, O. Troyanskaya

Motivation: The availability of genome-scale data has enabled an abundance of novel analysis techniques for investigating a variety of systems-level biological relationships. As thousands of such datasets become available, they provide an opportunity to study high-level associations between cellular pathways and processes. This also allows the exploration of shared functional enrichments between diverse biological datasets, and it serves to direct experimenters to areas of low data coverage or with high probability of new discoveries.

Results: We analyze the functional structure of Saccharomyces cerevisiae datasets from over 950 publications in the context of over 140 biological processes. This includes a coverage analysis of biological processes given current high-throughput data, a data-driven map of associations between processes, and a measure of similar functional activity between genome-scale datasets. This uncovers subtle gene expression similarities in three otherwise disparate microarray datasets due to a shared strain background. We also provide several means of predicting areas of yeast biology likely to benefit from additional high-throughput experimental screens.

Availability: Predictions are provided in supplementary tables; software and additional data are available from the authors by request.

Show Abstract
June 1, 2008

Coordination of Growth Rate, Cell Cycle, Stress Response, and Metabolic Activity in Yeast

M. Brauer, C. Huttenhower, E. Airoldi, R. Rosenstein, J. Matese, D. Gresham, V. Boer, O. Troyanskaya, D.Botstein

We studied the relationship between growth rate and genome-wide gene expression, cell cycle progression, and glucose metabolism in 36 steady-state continuous cultures limited by one of six different nutrients (glucose, ammonium, sulfate, phosphate, uracil, or leucine). The expression of more than one quarter of all yeast genes is linearly correlated with growth rate, independent of the limiting nutrient. The subset of negatively growth-correlated genes is most enriched for peroxisomal functions, whereas positively correlated genes mainly encode ribosomal functions. Many (not all) genes associated with stress response are strongly correlated with growth rate, as are genes that are periodically expressed under conditions of metabolic cycling. We confirmed a linear relationship between growth rate and the fraction of the cell population in the G0/G1 cell cycle phase, independent of limiting nutrient. Cultures limited by auxotrophic requirements wasted excess glucose, whereas those limited on phosphate, sulfate, or ammonia did not; this phenomenon (reminiscent of the “Warburg effect” in cancer cells) was confirmed in batch cultures. Using an aggregate of gene expression values, we predict (in both continuous and batch cultures) an “instantaneous growth rate.” This concept is useful in interpreting the system-level connections among growth rate, metabolism, stress, and the cell cycle.

Show Abstract

Predicting Gene Function in a Hierarchical Context with an Ensemble of Classifiers

Y. Guan, C. Myers, D. Hess, Z. Barutcuoglu, A. Caudy, O. Troyanskaya

BACKGROUND:
The wide availability of genome-scale data for several organisms has stimulated interest in computational approaches to gene function prediction. Diverse machine learning methods have been applied to unicellular organisms with some success, but few have been extensively tested on higher level, multicellular organisms. A recent mouse function prediction project (MouseFunc) brought together nine bioinformatics teams applying a diverse array of methodologies to mount the first large-scale effort to predict gene function in the laboratory mouse.

RESULTS:
In this paper, we describe our contribution to this project, an ensemble framework based on the support vector machine that integrates diverse datasets in the context of the Gene Ontology hierarchy. We carry out a detailed analysis of the performance of our ensemble and provide insights into which methods work best under a variety of prediction scenarios. In addition, we applied our method to Saccharomyces cerevisiae and have experimentally confirmed functions for a novel mitochondrial protein.

CONCLUSION:
Our method consistently performs among the top methods in the MouseFunc evaluation. Furthermore, it exhibits good classification performance across a variety of cellular processes and functions in both a multicellular organism and a unicellular organism, indicating its ability to discover novel biology in diverse settings.

Show Abstract

A Critical Assessment of Mus Musculus Gene Function Prediction Using Integrated Genomic Evidence

L. Peña-Castillo , M. Tasan , C. Myers , H. Lee, T. Joshi , C. Zhang , Y. Guan , M. Leone , A. Pagnani , W. Kim, C. Krumpelman , W. Tian , G. Obozinski, Y. Qi Y, S. Mostafavi , G. Lin , G. Berriz , F. Gibbons , G. Lanckriet, J. Qiu , C. Grant , Z. Barutcuoglu , D. Hill , D. Warde-Farley , C. Grouios , D. Ray, J. Blake , M. Deng , M. Jordan , W. Noble , Q. Morris, J. Klein-Seetharaman , Z. Bar-Joseph, T. Chen , F. Sun F, O. Troyanskaya, E. Marcotte , D. Xu , T. Hughes, F. Roth

BACKGROUND:
Several years after sequencing the human genome and the mouse genome, much remains to be discovered about the functions of most human and mouse genes. Computational prediction of gene function promises to help focus limited experimental resources on the most likely hypotheses. Several algorithms using diverse genomic data have been applied to this task in model organisms; however, the performance of such approaches in mammals has not yet been evaluated.
RESULTS:
In this study, a standardized collection of mouse functional genomic data was assembled; nine bioinformatics teams used this data set to independently train classifiers and generate predictions of function, as defined by Gene Ontology (GO) terms, for 21,603 mouse genes; and the best performing submissions were combined in a single set of predictions. We identified strengths and weaknesses of current functional genomic data sets and compared the performance of function prediction algorithms. This analysis inferred functions for 76% of mouse genes, including 5,000 currently uncharacterized genes. At a recall rate of 20%, a unified set of predictions averaged 41% precision, with 26% of GO terms achieving a precision better than 90%.
CONCLUSION:
We performed a systematic evaluation of diverse, independently developed computational approaches for predicting gene function from heterogeneous data sources in mammals. The results show that currently available data for mammals allows predictions with both breadth and accuracy. Importantly, many highly novel predictions emerge for the 38% of mouse genes that remain uncharacterized.

Show Abstract

Exploring the Functional Landscape of Gene Expression: Directed Search of Large Microarray Compendia

M. Hibbs, D. Hess, C. Myers, C. Huttenhower, K. Li, O. Troyanskaya

MOTIVATION:
The increasing availability of gene expression microarray technology has resulted in the publication of thousands of microarray gene expression datasets investigating various biological conditions. This vast repository is still underutilized due to the lack of methods for fast, accurate exploration of the entire compendium.

RESULTS:
We have collected Saccharomyces cerevisiae gene expression microarray data containing roughly 2400 experimental conditions. We analyzed the functional coverage of this collection and we designed a context-sensitive search algorithm for rapid exploration of the compendium. A researcher using our system provides a small set of query genes to establish a biological search context; based on this query, we weight each dataset's relevance to the context, and within these weighted datasets we identify additional genes that are co-expressed with the query set. Our method exhibits an average increase in accuracy of 273% compared to previous mega-clustering approaches when recapitulating known biology. Further, we find that our search paradigm identifies novel biological predictions that can be verified through further experimentation. Our methodology provides the ability for biological researchers to explore the totality of existing microarray data in a manner useful for drawing conclusions and formulating hypotheses, which we believe is invaluable for the research community.

AVAILABILITY:
Our query-driven search engine, called SPELL, is available at http://function.princeton.edu/SPELL.

SUPPLEMENTARY INFORMATION:
Several additional data files, figures and discussions are available at http://function.princeton.edu/SPELL/supplement.

Show Abstract

Context-Sensitive Data Integration and Prediction of Biological Networks

C. Myers, O. Troyanskaya

MOTIVATION:
Several recent methods have addressed the problem of heterogeneous data integration and network prediction by modeling the noise inherent in high-throughput genomic datasets, which can dramatically improve specificity and sensitivity and allow the robust integration of datasets with heterogeneous properties. However, experimental technologies capture different biological processes with varying degrees of success, and thus, each source of genomic data can vary in relevance depending on the biological process one is interested in predicting. Accounting for this variation can significantly improve network prediction, but to our knowledge, no previous approaches have explicitly leveraged this critical information about biological context.
RESULTS:
We confirm the presence of context-dependent variation in functional genomic data and propose a Bayesian approach for context-sensitive integration and query-based recovery of biological process-specific networks. By applying this method to Saccharomyces cerevisiae, we demonstrate that leveraging contextual information can significantly improve the precision of network predictions, including assignment for uncharacterized genes. We expect that this general context-sensitive approach can be applied to other organisms and prediction scenarios.
AVAILABILITY:
A software implementation of our approach is available on request from the authors.
SUPPLEMENTARY INFORMATION:
Supplementary data are available at http://avis.princeton.edu/contextPIXIE/

Show Abstract

Nearest Neighbor Networks: Clustering Expression Data Based on Gene Neighborhoods

C. Huttenhower, A. Flamholz, J. Landis, S. Sahi, C. Myers, K. Olszewski, M. Hibbs, N. Siemers, O. Troyanskaya, H. Coller

Background
The availability of microarrays measuring thousands of genes simultaneously across hundreds of biological conditions represents an opportunity to understand both individual biological pathways and the integrated workings of the cell. However, translating this amount of data into biological insight remains a daunting task. An important initial step in the analysis of microarray data is clustering of genes with similar behavior. A number of classical techniques are commonly used to perform this task, particularly hierarchical and K-means clustering, and many novel approaches have been suggested recently. While these approaches are useful, they are not without drawbacks; these methods can find clusters in purely random data, and even clusters enriched for biological functions can be skewed towards a small number of processes (e.g. ribosomes).

Results
We developed Nearest Neighbor Networks (NNN), a graph-based algorithm to generate clusters of genes with similar expression profiles. This method produces clusters based on overlapping cliques within an interaction network generated from mutual nearest neighborhoods. This focus on nearest neighbors rather than on absolute distance measures allows us to capture clusters with high connectivity even when they are spatially separated, and requiring mutual nearest neighbors allows genes with no sufficiently similar partners to remain unclustered. We compared the clusters generated by NNN with those generated by eight other clustering methods. NNN was particularly successful at generating functionally coherent clusters with high precision, and these clusters generally represented a much broader selection of biological processes than those recovered by other methods.

Conclusion
The Nearest Neighbor Networks algorithm is a valuable clustering method that effectively groups genes that are likely to be functionally related. It is particularly attractive due to its simplicity, its success in the analysis of large datasets, and its ability to span a wide range of biological functions with high precision.

Show Abstract
July 12, 2007
  • Previous Page
  • Viewing
  • Next Page
Advancing Research in Basic Science and MathematicsSubscribe to Flatiron Institute announcements and other foundation updates