In the symphony of life, if DNA is the conductor, proteins make the music. By controlling everything from the contraction of a muscle before a leap to the digestion of a bite of food, to the expansion of a pupil in the darkness, proteins drive the actions that define behavior, health and the ability of a species to propagate over generations. Though the music they create is critical to every aspect of life, the ‘how’ remains a mystery: For the majority of proteins, a complete picture of how they function remains unknown.
A key reason for this enduring mystery is that even though high-throughput genomics experiments can quickly call out associations between proteins and disease, drilling into the detailed operations of one protein still requires hundreds of lab experiments — and it may still provide an incomplete picture. Most assays are only able to identify a single protein function. And most proteins have many.
“Answering biological questions around protein function requires knowing a huge amount of context: proteins in the context of other proteins and organisms in the context of different species,” says Meet Barot, a doctoral student at the Center for Data Science at New York University. “This multimodal nature of protein function is part of what makes the problem so challenging.”
The genomics revolution has further complicated matters. “Lab experiments that can help determine protein function cannot keep pace with the high-throughput technologies discovering the genes,” says Richard Bonneau, a group leader for systems biology at the Flatiron Institute’s Center for Computational Biology (CCB) in New York City.
To avoid the complexity of deducing a protein’s function directly, scientists have come to increasingly rely on protein function prediction. This comprises a host of computational methods that aim to leverage what we do know about protein functions and extend this knowledge to proteins with unknown roles.
New software called NetQuilt is making important inroads into the field of protein function prediction. With a deep learning approach that mines different kinds of protein data from multiple species, NetQuilt carefully stitches patches of information together, like squares on a quilt, to robustly predict what proteins do. Developed by Bonneau and Barot, along with Vladimir Gligorijević, a research scientist at the CCB, and Kyunghyun Cho of NYU, NetQuilt is the first such method to integrate information about what proteins are made of and how they interact with each other, across different species.
The details of the software, published in February in Bioinformatics, stand to radically improve the quality of protein function prediction, in part by unveiling aspects of how proteins interact that have been invisible to researchers until now. “Proteins play an important role in many complex diseases. Knowing protein functions is the first step in designing successful drugs and therapies,” says Gligorijević.
A computational approach to the protein function problem
Determining protein function experimentally is challenging because it places a premium on computation. The first computational methods to predict protein function relied on straightforward comparison. Algorithms would take the amino acid sequence of a protein with unknown function and compare it to sequences of proteins with known functions. A match meant that the two proteins had similar functions. Yet, as high-throughput methods increased the number of available sequences to compare, problems arose, wrote Iddo Friedberg, an associate professor of vet microbiology and preventive medicine at Iowa State University, in a 2006 paper. Errors stemming from the assumption that sequence similarity means function similarity bubbled to the surface. As the diversity of sequences grew, the statistical power in connecting sequence similarity to function dimmed. “Errors in protein function attribution based on sequence similarity propagated throughout databases and remains a problem today,” says Friedberg. In response to this growing problem, scientists began to supplement sequences with new data sources, like networks that show how proteins interact with each other.
Broadly speaking, three sources of information reveal pieces of the protein function puzzle: sequence, structure and protein-protein interactions (PPIs). These different data sources complement each other. For example, a sequence can reveal that a protein acts as a catalyst in a chemical reaction, while the PPI network can reveal additional context in which the reaction takes place — where in the cell it occurs, and what other molecules are involved. “We see that sequence-based methods tend to work better for molecular function,” says Friedberg, whereas other approaches better illuminate biological pathways. “Any method that is single-source will inherently miss a lot, especially since sequence and protein-protein interactions are the biggest predictors of function,” adds Barot.
A software program called deepNF, built in 2018, sought to solve for these challenges. Developed by Barot, Gligorijević and Bonneau, deepNF predicted protein function by integrating different types of protein network information, but the software was limited to a single species. “We thought, wouldn’t it be better to extend this to multiple species?” says Barot. “We were always limited by the single-organism approach,” adds Gligorijević. “Say you have an organism that’s not well characterized. If you can integrate networks from similar species into the analysis, that would be a huge advantage.” With the goal of integrating sequence and PPI data from different species into one protein function prediction method, the idea for NetQuilt was born.
Setting a new standard for integrating diverse sources of data
Combining PPI networks from different species was far more complicated than just building one network onto another. “PPI networks are so different,” explains Bonneau. “In humans, you have 20,000 genes. In yeast, you have 8,000. Each network has a totally different topology.” Gligorijević adds, “The challenge was, how do you put all of the networks into the same space?”
As a start, the scientists applied an alignment algorithm called IsoRank, developed in 2008 in part by Bonnie Berger, a professor of mathematics, engineering and computer science at the Massachusetts Institute of Technology. The algorithm allowed the researchers to align both protein sequence and PPI network information from different species. “We make an N × N quilt, where N is the number of proteins,” explains Barot. Integrating information across about 10 species yielded a space with about 140,000 proteins, a sevenfold increase over the number of proteins in humans. Network patches were stitched together in ways that maximized meaningful similarities between datasets of different species. Using both sequence and network information was especially important in obtaining a comprehensive picture of protein function. In cases where sequence similarity was limited, networks could add important information, and vice versa.
This ‘meta-network’ of proteins, built from sequence and PPI network information from multiple species, was then used as an input to train the deep learning algorithms to predict function, in a unique and direct way. Other methods will typically cluster proteins, learn which proteins are similar to others, and then predict function and interactions. “Instead, in NetQuilt, Meet and Vlad said, ‘Let’s directly predict all of the functions.’ They connected the map directly to [protein] function prediction,” says Bonneau.
Specifically, the algorithm selected functions from Gene Ontology (GO), a repository of terms that is the gold standard among biologists for representing protein function. Some GO terms are well represented in the repository. In the human genome, for example, the term ‘monosaccharide binding,’ a molecular function, describes 70 distinct genes. Other terms appear rarely, like the biological process term ‘positive regulation of optic nerve formation,’ which describes just one human gene. The power of the deep learning algorithm used by NetQuilt increases with the number of training examples, or proteins, the algorithm can learn on. Increasing the training examples to include proteins from multiple species allowed the algorithm to train on rarer GO terms, while also improving its prediction power on the more abundant terms.
Putting NetQuilt to the test
In test cases with the bacteria E.coli, humans and mice, NetQuilt substantially outperformed four other leading methods for protein function prediction. “‘This cannot be possible,’ we thought, when we saw how good [NetQuilt] was,” says Gligorijević. “We were really surprised at the strength of the predictions.” Even when used to predict protein functions within a single species, the large number and diversity of examples that came from multiple species in the deep learning training boosted the power of NetQuilt’s predictions. “We saw the biggest jump in performance when we patched different species together,” says Barot.
The scientists designed a test case modeled on the increasingly common scenario in which an organism is newly sequenced, but its PPIs remain a black box. NetQuilt analyzed the proteins of related species in order to predict the functions of a test organism, that is, an organism with its PPI information left out of consideration. NetQuilt first determined the sequence similarity between the test organism’s proteins and the proteins from a set of similar organisms. Then, by examining the known networks, NetQuilt predicted a network for the test organism’s proteins.
Using the suggested network, NetQuilt then predicted the protein functions of the test organism. “The model was very robust in predicting function,” says Gligorijević, referring to NetQuilt’s performance, which he says matches or exceeds that of three other protein function prediction methods. As DNA sequencing supplements more traditional species discovery approaches, such as in the International Barcode of Life project, more organisms with unknown protein functions and interaction networks are cropping up in databases. NetQuilt could be especially useful for connecting the dots between proteins and using these predicted networks to predict protein function. For example, the Microbiome Immunity Project, which seeks to discover new bacteria and understand what their proteins do, could be an ideal candidate for protein network prediction, says Barot.
The future of computational approaches to biological problems
Computational approaches to protein prediction have one challenge in common: access to computing power. In fact, the Microbiome Immunity Project has issued a call for volunteers to help run analyses using the spare processing power on their own personal computers. The computational demands will only increase as the number of proteins in a dataset increase. With the development of NetQuilt, the Simons Foundation has offered unparalleled computing opportunities, says Gligorijević. “The Flatiron Institute has the best computational infrastructure I’ve experienced,” he says. A computing network dedicated to Flatiron researchers, known as the Scientific Computing Core (SCC), directed by Ian Fisk and Nick Carriero, enabled the NetQuilt team to run numerous experiments at once and train the model in different evaluation settings. “The SCC has been critical to our work,” says Bonneau. “They have built a whole new architecture for us.”
Through efforts like the ones from Bonneau’s group, computational biology is moving from similarity matching in protein function determination to extracting the maximum value from a dataset. Indeed, predicting protein interaction networks with no prior knowledge of them catapults our understanding far beyond the dataset and offers a window into a cellular symphony that is otherwise inaccessible. Computational methods like NetQuilt are poised to close the gap between the ever-increasing number of genomic sequences and our knowledge of what those sequences do. “Computational biology is crossing new thresholds of utility,” says Bonneau.
Numerous biological problems require protein function prediction for their answers. The more species we discover, and the more DNA sequences we generate, the wider the range of possible protein functions. “Scientists are constantly discovering new functions,” says Barot. “Plus, you can have designer proteins with new functions.” One compelling example of this phenomenon is an experimental nasal spray treatment for COVID-19 based on designer proteins crafted by scientists at the Institute for Protein Design at the University of Washington in Seattle. “There’s really no end to function discovery,” says Barot. And so there’s no end to the need for computational biology approaches that can propel protein function prediction forward.