CCB: Publications

Protein Structural Alignments From Sequence

J. Morton, C. E.M. Strauss, R. Blackwell, D. Berenberg, V. Gligorijevic, R. Bonneau

Computing sequence similarity is a fundamental task in biology, with alignment forming the basis for the annotation of genes and genomes and providing the core data structures for evolutionary analysis. Standard approaches are a mainstay of modern molecular biology and rely on variations of edit distance to obtain explicit alignments between pairs of biological sequences. However, sequence alignment algorithms struggle with remote homology tasks and cannot identify similarities between many pairs of proteins with similar structures and likely homology. Recent work suggests that using machine learning language models can improve remote homology detection. To this end, we introduce DeepBLAST, that obtains explicit alignments from residue embeddings learned from a protein language model integrated into an end-to-end differentiable alignment framework. This approach can be accelerated on the GPU architectures and outperforms conventional sequence alignment techniques in terms of both speed and accuracy when identifying structurally similar proteins.

Show Abstract

Specificities of modeling membrane proteins using multi-template homology modeling

J. Koehler, R. Bonneau

Structures of membrane proteins are challenging to determine experimentally and currently represent only about 2% of the structures in the ProteinDataBank. Because of this disparity, methods for modeling membrane proteins are fewer and of lower quality than those for modeling soluble proteins. However, better expression, crystallization, and cryo-EM techniques have prompted a recent increase in experimental structures of membrane proteins, which can act as templates to predict the structure of closely related proteins through homology modeling. Because homology modeling relies on a structural template, it is easier and more accurate than fold recognition methods or de novo modeling, which are used when the sequence similarity between the query sequence and the sequence of related proteins in structural databases is below 25%. In homology modeling, a query sequence is mapped onto the coordinates of a single template and refined. With the increase in available templates, several templates often cover overlapping segments of the query sequence. Multi-template modeling can be used to identify the best template for local segments and join them into a single model. Here we provide a protocol for modeling membrane proteins from multiple templates in the Rosetta software suite. This approach takes advantage of several integrated frameworks, namely RosettaScripts, RosettaCM, and RosettaMP with the membrane scoring function.

Show Abstract

Computational design of mixed chirality peptide macrocycles with internal symmetry

V. Mulligan, C Kang, M Sawaya, S Rettie, X Li, I Antselovich, T Craven, A Watkins, J Labonte, F DiMaio, T Yeates, D Baker

Cyclic symmetry is frequent in protein and peptide homo‐oligomers, but extremely rare within a single chain, as it is not compatible with free N‐ and C‐termini. Here we describe the computational design of mixed‐chirality peptide macrocycles with rigid structures that feature internal cyclic symmetries or improper rotational symmetries inaccessible to natural proteins. Crystal structures of three C2‐ and C3‐symmetric macrocycles, and of six diverse S2‐symmetric macrocycles, match the computationally‐designed models with backbone heavy‐atom RMSD values of 1 Å or better. Crystal structures of an S4‐symmetric macrocycle (consisting of a sequence and structure segment mirrored at each of three successive repeats) designed to bind zinc reveal a large‐scale zinc‐driven conformational change from an S4‐symmetric apo‐state to a nearly inverted S4‐symmetric holo‐state almost identical to the design model. These symmetric structures provide promising starting points for applications ranging from design of cyclic peptide based metal organic frameworks to creation of high affinity binders of symmetric protein homo‐oligomers. More generally, this work demonstrates the power of computational design for exploring symmetries and structures not found in nature, and for creating synthetic switchable systems.

Show Abstract

Identification of new therapeutic targets in CRLF2-overexpressing B-ALL through discovery of TF-gene regulatory interactions

S. Badri, B. Carella, P. Lhoumaud, D. Castro, C. Skok Gibbs, R. Raviram, S. Narang, N. Evensen, A. Watters, W. Carroll, R. Bonneau, J. Skok

Although genetic alterations are initial drivers of disease, aberrantly activated transcriptional regulatory programs are often responsible for the maintenance and progression of cancer. CRLF2-overexpression in B-ALL patients leads to activation of JAK-STAT, PI3K and ERK/MAPK signaling pathways and is associated with poor outcome. Although inhibitors of these pathways are available, there remains the issue of treatment-associated toxicities, thus it is important to identify new therapeutic targets. Using a network inference approach, we reconstructed a B-ALL specific transcriptional regulatory network to evaluate the impact of CRLF2-overexpression on downstream regulatory interactions.

Comparing RNA-seq from CRLF2-High and other B-ALL patients (CRLF2-Low), we defined a CRLF2-High gene signature. Patient-specific chromatin accessibility was interrogated to identify altered putative regulatory elements that could be linked to transcriptional changes. To delineate these regulatory interactions, a B-ALL cancer-specific regulatory network was inferred using 868 B-ALL patient samples from the NCI TARGET database coupled with priors generated from ATAC-seq peak TF-motif analysis. CRISPRi, siRNA knockdown and ChIP-seq of nine TFs involved in the inferred network were analyzed to validate predicted TF-gene regulatory interactions.

In this study, a B-ALL specific regulatory network was constructed using ATAC-seq derived priors. Inferred interactions were used to identify differential patient-specific transcription factor activities predicted to control CRLF2-High deregulated genes, thereby enabling identification of new potential therapeutic targets.

Show Abstract

A convolutional neural network for common coordinate registration of high-resolution histology images

A. Daly, K. Geras, R. Bonneau

Registration of histology images from multiple sources is a pressing problem in large-scale studies of spatial -omics data. Researchers often perform “common coordinate registration,” akin to segmentation, in which samples are partitioned based on tissue type to allow for quantitative comparison of similar regions across samples. Accuracy in such registration requires both high image resolution and global awareness, which mark a difficult balancing act for contemporary deep learning architectures. We present a novel convolutional neural network (CNN) architecture that combines (1) a local classification CNN that extracts features from image patches sampled sparsely across the tissue surface, and (2) a global segmentation CNN that operates on these extracted features. This hybrid network can be trained in an end-to-end manner, and we demonstrate its relative merits over competing approaches on a reference histology dataset as well as two published spatial transcriptomics datasets. We believe that this paradigm will greatly enhance our ability to process spatial -omics data, and has general purpose applications for the processing of high-resolution histology images on commercially available GPUs.

Show Abstract

Context-aware dimensionality reduction deconvolutes gut microbial community dynamics

C. Martino, L. Shenhav, C. Marotz, G. Armstrong, D. McDonald, Y. Vásquez-Baeza, J. Morton, L. Jiang, M. Dominguez-Bello, A. Swafford, E. Halperin, R. Knight

The translational power of human microbiome studies is limited by high interindividual variation. We describe a dimensionality reduction tool, compositional tensor factorization (CTF), that incorporates information from the same host across multiple samples to reveal patterns driving differences in microbial composition across phenotypes. CTF identifies robust patterns in sparse compositional datasets, allowing for the detection of microbial changes associated with specific phenotypes that are reproducible across datasets.

Show Abstract

Microbe-metabolite associations linked to the rebounding murine gut microbiome post-colonization with vancomycin resistant Enterococcus faecium

A. Mu, G. Carter, L. Li, N. Isles, A. Vrbanac, J. Morton, A. Jarmusch, D. De Souza, V. Narayana, K. Kanojia, B. Nijagal, M. McConville, R. Knight, B. Howden, T. Stinear

Vancomycin-resistant Enterococcus faecium (VREfm) is an emerging antibiotic-resistant pathogen. Strain-level investigations are beginning to reveal the molecular mechanisms used by VREfm to colonize regions of the human bowel. However, the role of commensal bacteria during VREfm colonization, in particular following antibiotic treatment, remains largely unknown. We employed amplicon 16S rRNA gene sequencing and metabolomics in a murine model system to try and investigate functional roles of the gut microbiome during VREfm colonization. First-order taxonomic shifts between Bacteroidetes and Tenericutes within the gut microbial community composition were detected both in response to pretreatment using ceftriaxone and to subsequent VREfm challenge. Using neural networking approaches to find cooccurrence profiles of bacteria and metabolites, we detected key metabolome features associated with butyric acid during and after VREfm colonization. These metabolite features were associated with Bacteroides, indicative of a transition toward a preantibiotic naive microbiome. This study shows the impacts of antibiotics on the gut ecosystem and the progression of the microbiome in response to colonization with VREfm. Our results offer insights toward identifying potential nonantibiotic alternatives to eliminate VREfm through metabolic reengineering to preferentially select for Bacteroides.

Show Abstract

Alternative Activation of Macrophages Is Accompanied by Chromatin Remodeling Associated with Lineage-Dependent DNA Shape Features Flanking PU.1 Motifs

M Tang, E Miraldi, N Girgis, R. Bonneau, P Loke

IL-4 activates macrophages to adopt distinct phenotypes associated with clearance of helminth infections and tissue repair, but the phenotype depends on the cellular lineage of these macrophages. The molecular basis of chromatin remodeling in response to IL-4 stimulation in tissue-resident and monocyte-derived macrophages is not understood. In this study, we find that IL-4 activation of different lineages of peritoneal macrophages in mice is accompanied by lineage-specific chromatin remodeling in regions enriched with binding motifs of the pioneer transcription factor PU.1. PU.1 motif is similarly associated with both tissue-resident and monocyte-derived IL-4-induced accessible regions but has different lineage-specific DNA shape features and predicted cofactors. Mutation studies based on natural genetic variation between C57BL/6 and BALB/c mouse strains indicate that accessibility of these IL-4-induced regions can be regulated through differences in DNA shape without direct disruption of PU.1 motifs. We propose a model whereby DNA shape features of stimulation-dependent genomic elements contribute to differences in the accessible chromatin landscape of alternatively activated macrophages on different genetic backgrounds that may contribute to phenotypic variations in immune responses.

Show Abstract

CRISPR-Decryptr reveals cis-regulatory elements from noncoding perturbation screens

A. Rasmussen, T. Äijö, M. Gabitto, N. Carriero, N. Sanjana, J. Skok, R. Bonneau

Clustered Regularly Interspace Short Palindromic Repeats (CRISPR)-Cas9 genome editing methods provide the tools necessary to examine phenotypic impacts of targeted perturbations in high-throughput screens. While these technologies have the potential to reveal functional elements with direct therapeutic applications, statistical techniques to analyze noncoding screen data remain limited. We present CRISPR-Decryptr, a computational tool for the analysis of CRISPR noncoding screens. Our method leverages experimental design: accounting for multiple conditions, controls, and replicates to infer the regulatory landscape of noncoding genomic regions. We validate our method on a variety of mutagenesis, CRISPR activation, and CRISPR interference screens, extracting new insights from previously published data.

Show Abstract

NetQuilt: Deep Multispecies Network-based Protein Function Prediction using Homology-informed Network Similarity

M. Barot, V. Gligorijevic, K. Cho, R. Bonneau

Transferring knowledge between species is challenging: different species contain distinct proteomes and cellular architectures, which cause their proteins to carry out different functions via different interaction networks. Many approaches to proteome and biological network functional annotation use sequence similarity to transfer knowledge between species. These similarity-based approaches cannot produce accurate predictions for proteins without homologues of known function, as many functions require cellular or organismal context for meaningful function prediction. In order to supply this context, network-based methods use protein-protein interaction (PPI) networks as a source of information for inferring protein function and have demonstrated promising results in function prediction. However, the majority of these methods are tied to a network for a single species, and many species lack biological networks. In this work, we integrate sequence and network information across multiple species by applying an IsoRank-derived network alignment algorithm to create a meta-network profile of the proteins of multiple species. We then use this integrated multispecies meta-network as input features to train a maxout neural network with Gene Ontology terms as target labels. Our multispecies approach takes advantage of more training examples, and more diverse examples from multiple organisms, and consequently leads to significant improvements in function prediction performance. Further, we evaluate our approach in a setting in which an organism’s PPI network is left out, using other organisms’ network information and sequence homology in order to make predictions for the left-out organism, to simulate cases in which a newly sequenced species has no network information available.

Show Abstract