Systems Biology Group Meeting

Date & Time

Presenter: Daniel Berenberg

Topic: Language modeling for DNA sequences

Abstract: Next-generation sequencing efforts have resulted in massive deposits of reliable whole-genome data, namely coding and non-coding nucleotide sequences. Ultra-deep, near (and exceeding!) billion parameter language models have shown unprecedented performance on a variety of input domains including natural language text and protein sequences. Critically, the post-training learned representations for both domains have been regarded as general purpose featurizations, capable of extending to state-of-the-art performance in property prediction, such as protein function classification, structural alignment, and binding affinity. In this work, we intend to leverage the abundance of genomic sequence data and the power of large language models to develop meaningful feature extractors for nucleotide sequences. If fruitful, this process will result in a so-called ‘neural metagenomics pipeline’, allowing biologists to analyze genomic samples and obtain valuable information quickly and automatically from an entirely data-driven perspective. In this talk, I will describe our progress, current challenges, and future work.

This is a joint project with Tymor Hamamsy advised by professors Rich Bonneau and Kyunghyun Cho (NYU).

Advancing Research in Basic Science and MathematicsSubscribe to Flatiron Institute announcements and other foundation updates