Modeling Complex Genetic Interactions in a Simple Eukaryotic Genome: Actin Displays a Rich Spectrum of Complex Haploinsufficiencies
Multigenic influences are major contributors to human genetic disorders. Since humans are highly polymorphic, there are a high number of possible detrimental, multiallelic gene pairs. The actin cytoskeleton of yeast was used to determine the potential for deleterious bigenic interactions; approximately 4800 complex hemizygote strains were constructed between an actin-null allele and the nonessential gene deletion collection. We found 208 genes that have deleterious complex haploinsufficient (CHI) interactions with actin. This set is enriched for genes with gene ontology terms shared with actin, including several actin-binding protein genes, and nearly half of the CHI genes have defects in actin organization when deleted. Interactions were frequently seen with genes for multiple components of a complex or with genes involved in the same function. For example, many of the genes for the large ribosomal subunit (RPLs) were CHI with act1Delta and had actin organization defects when deleted. This was generally true of only one RPL paralog of apparently duplicate genes, suggesting functional specialization between ribosomal genes. In many cases, CHI interactions could be attributed to localized defects on the actin protein. Spatial congruence in these data suggest that the loss of binding to specific actin-binding proteins causes subsets of CHI interactions.
Gene expression microarrays are becoming increasingly widespread, especially as a way to rapidly identify putative functions of unknown genes. Accurate microarray data analysis, however, still remains a challenge. The recent availability of multiple types of high-throughput functional genomic data can facilitate accurate and effective analysis of microarray experiments and thereby accelerate functional annotation of sequenced genomes. But genomic data often sacrifice specificity for scale, yielding very large quantities of relatively lower quality data than traditional experimental methods. Advanced analysis methods are thus necessary to make accurate functional interpretation of these large-scale datasets. This chapter outlines recently developed methods that integrate the analysis of microarray data with sequence, interaction, localization, and literature data and further outlines specific problems in currently available integrated analysis technologies.
Accurate evaluation of the quality of genomic or proteomic data and computational methods is vital to our ability to use them for formulating novel biological hypotheses and directing further experiments. There is currently no standard approach to evaluation in functional genomics. Our analysis of existing approaches shows that they are inconsistent and contain substantial functional biases that render the resulting evaluations misleading both quantitatively and qualitatively. These problems make it essentially impossible to compare computational methods or large-scale experimental datasets and also result in conclusions that generalize poorly in most biological applications.
We reveal issues with current evaluation methods here and suggest new approaches to evaluation that facilitate accurate and representative characterization of genomic methods and data. Specifically, we describe a functional genomics gold standard based on curation by expert biologists and demonstrate its use as an effective means of evaluation of genomic approaches. Our evaluation framework and gold standard are freely available to the community through our website.
Proper methods for evaluating genomic data and computational approaches will determine how much we, as a community, are able to learn from the wealth of available data. We propose one possible solution to this problem here but emphasize that this topic warrants broader community discussion.
The most common method of identifying groups of functionally related genes in microarray data is to apply a clustering algorithm. However, it is impossible to determine which clustering algorithm is most appropriate to apply, and it is difficult to verify the results of any algorithm due to the lack of a gold-standard. Appropriate data visualization tools can aid this analysis process, but existing visualization methods do not specifically address this issue.
We present several visualization techniques that incorporate meaningful statistics that are noise-robust for the purpose of analyzing the results of clustering algorithms on microarray data. This includes a rank-based visualization method that is more robust to noise, a difference display method to aid assessments of cluster quality and detection of outliers, and a projection of high dimensional data into a three dimensional space in order to examine relationships between clusters. Our methods are interactive and are dynamically linked together for comprehensive analysis. Further, our approach applies to both protein and gene expression microarrays, and our architecture is scalable for use on both desktop/laptop screens and large-scale display devices. This methodology is implemented in GeneVAnD (Genomic Visual ANalysis of Datasets) and is available at http://function.princeton.edu/GeneVAnD.
Incorporating relevant statistical information into data visualizations is key for analysis of large biological datasets, particularly because of high levels of noise and the lack of a gold-standard for comparisons. We developed several new visualization techniques and demonstrated their effectiveness for evaluating cluster quality and relationships between clusters.
We present a method for the global analysis of the function of genes in budding yeast based on hierarchical clustering of the quantitative sensitivity profiles of the 4756 strains with individual homozygous deletion of nonessential genes to a broad range of cytotoxic or cytostatic agents. This method is superior to other global methods of identifying the function of genes involved in the various DNA repair and damage checkpoint pathways as well as other interrogated functions. Analysis of the phenotypic profiles of the 51 diverse treatments places a total of 860 genes of unknown function in clusters with genes of known function. We demonstrate that this can not only identify the function of unknown genes but can also suggest the mechanism of action of the agents used. This method will be useful when used alone and in conjunction with other global approaches to identify gene function in yeast.
Motivation: Assigning functions for unknown genes based on diverse large-scale data is a key task in functional genomics. Previous work on gene function prediction has addressed this problem using independent classifiers for each function. However, such an approach ignores the structure of functional class taxonomies, such as the Gene Ontology (GO). Over a hierarchy of functional classes, a group of independent classifiers where each one predicts gene membership to a particular class can produce a hierarchically inconsistent set of predictions, where for a given gene a specific class may be predicted positive while its inclusive parent class is predicted negative. Taking the hierarchical structure into account resolves such inconsistencies and provides an opportunity for leveraging all classifiers in the hierarchy to achieve higher specificity of predictions.
Results: We developed a Bayesian framework for combining multiple classifiers based on the functional taxonomy constraints. Using a hierarchy of support vector machine (SVM) classifiers trained on multiple data types, we combined predictions in our Bayesian framework to obtain the most probable consistent set of predictions. Experiments show that over a 105-node subhierarchy of the GO, our Bayesian framework improves predictions for 93 nodes. As an additional benefit, our method also provides implicit calibration of SVM margin outputs to probabilities. Using this method, we make function predictions for multiple proteins, and experimentally confirm predictions for proteins involved in mitosis.
We have developed a general probabilistic system for query-based discovery of pathway-specific networks through integration of diverse genome-wide data. This framework was validated by accurately recovering known networks for 31 biological processes in Saccharomyces cerevisiae and experimentally verifying predictions for the process of chromosomal segregation. Our system, bioPIXIE, a public, comprehensive system for integration, analysis, and visualization of biological network predictions for S. cerevisiae, is freely accessible over the worldwide web.
A report on the 13th Annual Conference on Intelligent Systems for Molecular Biology (ISMB), Detroit, USA, 25-29 June 2005.
The annual meeting on computational methods for molecular biology brought together 1,731 attendees and covered a diversity of topics from sequence analysis and text mining to structural bioinformatics and pathway prediction. This year saw an increased emphasis on the biological problems that bioinformatic methods are being developed to solve; in addition to many novel developments in traditional areas of bioinformatics, a substantial number of talks focused on integrative approaches, pathway analysis, and comparative genomics. Also on the menu this year were ways of making bioinformatic methods more 'data-centric' and how to make new technologies easily accessible to biologists.
Increased processor and storage capacities have supported the computational sciences, but have simultaneously unleashed a data avalanche on the scientific community. As a result, scientific research is limited by data analysis and visualization capabilities. These new bottlenecks have been the driving motivation behind the Princeton scalable display wall project. To create a scalable and easy-to-use large-format display system for collaborative visualization, the authors have developed various techniques, software tools, and applications.