Presenter: Jamie Morton, Ph.D., Flatiron Fellow, Systems Biology
Title: Biological sequence alignments from language representations
Computing sequence similarity is a fundamental task in biology, with alignment forming the basis for the annotation of genes and genomes and providing the core data structures for all evolutionary analysis. Standard approaches rely on variations of edit distance to obtain explicit alignments between pairs of biological sequences. However, these approaches fail to account for long-range interactions across sequences, and cannot account for evolutionary events that permute sequence modules; taking such events into account is critical for comparing protein sequences that have diverged across larger timescales. We provide a means to obtain explicit alignments from residue embeddings learned from a deep language model. We show that a bipartite matching of residue embeddings for pairs of proteins can be used to detect and generate accurate protein sequence alignments and that this pairwise latent alignment, coupled with LSTMs and/or attention models, is competitive with state-of-the-art alignment tools. This work provides a novel way to interpret sequence representations learned from these language models and presents opportunities to improve the scaling and sensitivity of widely used alignment methods.