Z. Pan, Chandra L. Theesfeld
Genetic diagnosis promises to guide treatment and manage expectations for patients and physicians. Yet even when a variant in a disease gene is identified, the assignment of pathogenic impact is not always possible.1 Of the 215 million possible substitutions in approximately 19,900 genes, 71 million are missense mutations that result in an amino acid substitution rather than a stop codon or a frameshift.2 Only 4 million missense variants have been observed, of which approximately 2% have been clinically classified as pathogenic or benign by testing companies and collected in the public ClinVar repository. The rest are classified as variants of uncertain significance (VUS) due to the dearth of information on the functional impact or pathogenic consequences of the mutation.
A key challenge is to understand how changes in protein sequence affect function and contribute to disease. While the development of mutational scanning assays enables scientists to test thousands of substitutions at a time in cell lines, it is not possible to experimentally test all mutations, let alone assess fitness in humans. To meet this challenge, computational approaches that integrate many types of information and can predict functional impacts are becoming increasingly more sophisticated in their ability to accurately classify variants.
The early and powerful strategy for modeling the pathogenic impacts of variants involved employing evolutionary sequence information through the use of multiple sequence alignments (MSA). This approach examines sequence conservation across species and within humans, as demonstrated in models like PolyPhen and SIFT.3 The integration of functional insights related to protein domains and functions further enhances these models, coupled with artificial intelligence.3 Prediction of a correct 3-dimensional protein structure has long been a grail in research. Marks et al.4 suggested a global statistical model to massively reduce the search space of protein conformations by linking the pairwise correlations from MSA to fold a protein into a correct 3-dimensional structure (directly from Marks et al.4). AlphaFold5 marked a significant advancement in the field by using a large language model (LLM) to associate protein structure with MSA with unprecedented accuracy, effectively solving the “protein folding problem.” The ability of protein LLMs to learn not just amino acid relationships in linear sequences but also extremely rich relationships in any number of dimensions and contexts powers such models.
Show Abstract