661 Publications

Finding Function: Evaluation Methods for Functional Genomic Data

C. Myers, D. Barrett, M. Hibbs, Huttenhower, O. Troyanskaya

BACKGROUND:
Accurate evaluation of the quality of genomic or proteomic data and computational methods is vital to our ability to use them for formulating novel biological hypotheses and directing further experiments. There is currently no standard approach to evaluation in functional genomics. Our analysis of existing approaches shows that they are inconsistent and contain substantial functional biases that render the resulting evaluations misleading both quantitatively and qualitatively. These problems make it essentially impossible to compare computational methods or large-scale experimental datasets and also result in conclusions that generalize poorly in most biological applications.
RESULTS:
We reveal issues with current evaluation methods here and suggest new approaches to evaluation that facilitate accurate and representative characterization of genomic methods and data. Specifically, we describe a functional genomics gold standard based on curation by expert biologists and demonstrate its use as an effective means of evaluation of genomic approaches. Our evaluation framework and gold standard are freely available to the community through our website.
CONCLUSION:
Proper methods for evaluating genomic data and computational approaches will determine how much we, as a community, are able to learn from the wealth of available data. We propose one possible solution to this problem here but emphasize that this topic warrants broader community discussion.

Show Abstract

Visualization Methods for Statistical Analysis of Microarray Clusters

M. Hibbs, N. Dirksen, K. Li, O. Troyanskaya

BACKGROUND:
The most common method of identifying groups of functionally related genes in microarray data is to apply a clustering algorithm. However, it is impossible to determine which clustering algorithm is most appropriate to apply, and it is difficult to verify the results of any algorithm due to the lack of a gold-standard. Appropriate data visualization tools can aid this analysis process, but existing visualization methods do not specifically address this issue.
RESULTS:
We present several visualization techniques that incorporate meaningful statistics that are noise-robust for the purpose of analyzing the results of clustering algorithms on microarray data. This includes a rank-based visualization method that is more robust to noise, a difference display method to aid assessments of cluster quality and detection of outliers, and a projection of high dimensional data into a three dimensional space in order to examine relationships between clusters. Our methods are interactive and are dynamically linked together for comprehensive analysis. Further, our approach applies to both protein and gene expression microarrays, and our architecture is scalable for use on both desktop/laptop screens and large-scale display devices. This methodology is implemented in GeneVAnD (Genomic Visual ANalysis of Datasets) and is available at http://function.princeton.edu/GeneVAnD.
CONCLUSION:
Incorporating relevant statistical information into data visualizations is key for analysis of large biological datasets, particularly because of high levels of noise and the lack of a gold-standard for comparisons. We developed several new visualization techniques and demonstrated their effectiveness for evaluating cluster quality and relationships between clusters.

Show Abstract

Global Analysis of Gene Function in Yeast by Quantitative Phenotypic Profiling

We present a method for the global analysis of the function of genes in budding yeast based on hierarchical clustering of the quantitative sensitivity profiles of the 4756 strains with individual homozygous deletion of nonessential genes to a broad range of cytotoxic or cytostatic agents. This method is superior to other global methods of identifying the function of genes involved in the various DNA repair and damage checkpoint pathways as well as other interrogated functions. Analysis of the phenotypic profiles of the 51 diverse treatments places a total of 860 genes of unknown function in clusters with genes of known function. We demonstrate that this can not only identify the function of unknown genes but can also suggest the mechanism of action of the agents used. This method will be useful when used alone and in conjunction with other global approaches to identify gene function in yeast.

Show Abstract
January 17, 2006

Hierarchical Multi-Label Prediction of Gene Function

Z. Barutcuoglu, R. Schapire, O. Troyanskaya

Abstract
Motivation: Assigning functions for unknown genes based on diverse large-scale data is a key task in functional genomics. Previous work on gene function prediction has addressed this problem using independent classifiers for each function. However, such an approach ignores the structure of functional class taxonomies, such as the Gene Ontology (GO). Over a hierarchy of functional classes, a group of independent classifiers where each one predicts gene membership to a particular class can produce a hierarchically inconsistent set of predictions, where for a given gene a specific class may be predicted positive while its inclusive parent class is predicted negative. Taking the hierarchical structure into account resolves such inconsistencies and provides an opportunity for leveraging all classifiers in the hierarchy to achieve higher specificity of predictions.
Results: We developed a Bayesian framework for combining multiple classifiers based on the functional taxonomy constraints. Using a hierarchy of support vector machine (SVM) classifiers trained on multiple data types, we combined predictions in our Bayesian framework to obtain the most probable consistent set of predictions. Experiments show that over a 105-node subhierarchy of the GO, our Bayesian framework improves predictions for 93 nodes. As an additional benefit, our method also provides implicit calibration of SVM margin outputs to probabilities. Using this method, we make function predictions for multiple proteins, and experimentally confirm predictions for proteins involved in mitosis.

Show Abstract

Discovery of Biological Networks from Diverse Functional Genomic Data

C. Myers, D. Robson, A. Wible, M. Hibbs, C. Chiriac, C. Theesfeld, K. Dolinski , O. Troyanskaya

We have developed a general probabilistic system for query-based discovery of pathway-specific networks through integration of diverse genome-wide data. This framework was validated by accurately recovering known networks for 31 biological processes in Saccharomyces cerevisiae and experimentally verifying predictions for the process of chromosomal segregation. Our system, bioPIXIE, a public, comprehensive system for integration, analysis, and visualization of biological network predictions for S. cerevisiae, is freely accessible over the worldwide web.

Show Abstract
December 19, 2005

Putting the ‘Bio’ into Bioinformatics

A report on the 13th Annual Conference on Intelligent Systems for Molecular Biology (ISMB), Detroit, USA, 25-29 June 2005.

The annual meeting on computational methods for molecular biology brought together 1,731 attendees and covered a diversity of topics from sequence analysis and text mining to structural bioinformatics and pathway prediction. This year saw an increased emphasis on the biological problems that bioinformatic methods are being developed to solve; in addition to many novel developments in traditional areas of bioinformatics, a substantial number of talks focused on integrative approaches, pathway analysis, and comparative genomics. Also on the menu this year were ways of making bioinformatic methods more 'data-centric' and how to make new technologies easily accessible to biologists.

Show Abstract
September 29, 2005

Tools and Applications for Large-Scale Display Walls

G. Wallace, O. Anshus, P. Bi, H. Chen, HY. Chen, D. Clark, P. Cook, A. Finkelstein, T. Funkhouser, A. Gupta, M. Hibbs, K. Li, Z. Liu, R. Samanta, R. Sukthankar, O. Troyanskaya

Increased processor and storage capacities have supported the computational sciences, but have simultaneously unleashed a data avalanche on the scientific community. As a result, scientific research is limited by data analysis and visualization capabilities. These new bottlenecks have been the driving motivation behind the Princeton scalable display wall project. To create a scalable and easy-to-use large-format display system for collaborative visualization, the authors have developed various techniques, software tools, and applications.

Show Abstract

Putting Microarrays in a Context: Integrated Analysis of Diverse Biological Data

In recent years, multiple types of high-throughput functional genomic data that facilitate rapid functional annotation of sequenced genomes have become available. Gene expression microarrays are the most commonly available source of such data. However, genomic data often sacrifice specificity for scale, yielding very large quantities of relatively lower-quality data than traditional experimental methods. Thus sophisticated analysis methods are necessary to make accurate functional interpretation of these large-scale data sets. This review presents an overview of recently developed methods that integrate the analysis of microarray data with sequence, interaction, localisation and literature data, and further outlines current challenges in the field. The focus of this review is on the use of such methods for gene function prediction, understanding of protein regulation and modelling of biological networks.

Show Abstract

Visualization-Based Discovery and Analysis of Genomic Aberrations in Microarray Data

C. Myers, X. Chen, O. Troyanskaya

Background
Chromosomal copy number changes (aneuploidies) play a key role in cancer progression and molecular evolution. These copy number changes can be studied using microarray-based comparative genomic hybridization (array CGH) or gene expression microarrays. However, accurate identification of amplified or deleted regions requires a combination of visual and computational analysis of these microarray data.

Results
We have developed ChARMView, a visualization and analysis system for guided discovery of chromosomal abnormalities from microarray data. Our system facilitates manual or automated discovery of aneuploidies through dynamic visualization and integrated statistical analysis. ChARMView can be used with array CGH and gene expression microarray data, and multiple experiments can be viewed and analyzed simultaneously.

Conclusion
ChARMView is an effective and accurate visualization and analysis system for recognizing even small aneuploidies or subtle expression biases, identifying recurring aberrations in sets of experiments, and pinpointing functionally relevant copy number changes. ChARMView is freely available under the GNU GPL at http://function.princeton.edu/ChARMView.

Show Abstract
December 21, 2004

Accurate Detection of Aneuploidies in Array CGH and Gene Expression Microarray Data

C. Myers, M. Dunham, S.. Kung, O. Troyanskaya

MOTIVATION:
Chromosomal copy number changes (aneuploidies) are common in cell populations that undergo multiple cell divisions including yeast strains, cell lines and tumor cells. Identification of aneuploidies is critical in evolutionary studies, where changes in copy number serve an adaptive purpose, as well as in cancer studies, where amplifications and deletions of chromosomal regions have been identified as a major pathogenetic mechanism. Aneuploidies can be studied on whole-genome level using array CGH (a microarray-based method that measures the DNA content), but their presence also affects gene expression. In gene expression microarray analysis, identification of copy number changes is especially important in preventing aberrant biological conclusions based on spurious gene expression correlation or masked phenotypes that arise due to aneuploidies. Previously suggested approaches for aneuploidy detection from microarray data mostly focus on array CGH, address only whole-chromosome or whole-arm copy number changes, and rely on thresholds or other heuristics, making them unsuitable for fully automated general application to gene expression datasets. There is a need for a general and robust method for identification of aneuploidies of any size from both array CGH and gene expression microarray data.
RESULTS:
We present ChARM (Chromosomal Aberration Region Miner), a robust and accurate expectation-maximization based method for identification of segmental aneuploidies (partial chromosome changes) from gene expression and array CGH microarray data. Systematic evaluation of the algorithm on synthetic and biological data shows that the method is robust to noise, aneuploidal segment size and P-value cutoff. Using our approach, we identify known chromosomal changes and predict novel potential segmental aneuploidies in commonly used yeast deletion strains and in breast cancer. ChARM can be routinely used to identify aneuploidies in array CGH datasets and to screen gene expression data for aneuploidies or array biases. Our methodology is sensitive enough to detect statistically significant and biologically relevant aneuploidies even when expression or DNA content changes are subtle as in mixed populations of cells.
AVAILABILITY:
Code available by request from the authors and on Web supplement at http://function.cs.princeton.edu/ChARM/

Show Abstract
December 12, 2004
  • Previous Page
  • Viewing
  • Next Page
Advancing Research in Basic Science and MathematicsSubscribe to Flatiron Institute announcements and other foundation updates

privacy consent banner

Privacy preference

We use cookies to provide you with the best online experience. By clicking "Accept All," you help us understand how our site is used and enhance its performance. You can change your choice at any time here. To learn more, please visit our Privacy Policy.