CCB: Publications

4C-ker: a method to reproducibly identify genome-wide interactions captured by 4C-Seq experiments

R Raviram, P Rocha, C. Müller, E. Miraldi, S Badri, Y Fu, E Swanzey, C Proudhon, V Snetkova, R. Bonneau, J Skok

4C-Seq has proven to be a powerful technique to identify genome-wide interactions with a single locus of interest (or “bait”) that can be important for gene regulation. However, analysis of 4C-Seq data is complicated by the many biases inherent to the technique. An important consideration when dealing with 4C-Seq data is the differences in resolution of signal across the genome that result from differences in 3D distance separation from the bait. This leads to the highest signal in the region immediately surrounding the bait and increasingly lower signals in far-cis and trans. Another important aspect of 4C-Seq experiments is the resolution, which is greatly influenced by the choice of restriction enzyme and the frequency at which it can cut the genome. Thus, it is important that a 4C-Seq analysis method is flexible enough to analyze data generated using different enzymes and to identify interactions across the entire genome. Current methods for 4C-Seq analysis only identify interactions in regions near the bait or in regions located in far-cis and trans, but no method comprehensively analyzes 4C signals of different length scales. In addition, some methods also fail in experiments where chromatin fragments are generated using frequent cutter restriction enzymes. Here, we describe 4C-ker, a Hidden-Markov Model based pipeline that identifies regions throughout the genome that interact with the 4C bait locus. In addition, we incorporate methods for the identification of differential interactions in multiple 4C-seq datasets collected from different genotypes or experimental conditions. Adaptive window sizes are used to correct for differences in signal coverage in near-bait regions, far-cis and trans chromosomes. Using several datasets, we demonstrate that 4C-ker outperforms all existing 4C-Seq pipelines in its ability to reproducibly identify interaction domains at all genomic ranges with different resolution enzymes.

Show Abstract

Environmental gene regulatory influence networks in rice (Oryza sativa): response to water deficit, high temperature and agricultural environments

O. Wilkins, C. Hafemiester, A. Plessis, M.-M. Holloway-Phillips, G. Pham, A.B. Nicotra, G.B. Gregorio, S.V.K. Jagadish, E.M. Septiningsih, R. Bonneau, M. Purugganan

We inferred an environmental gene regulatory influence network (EGRIN) of the response of tropical Asian rice (Oryza sativa) to high temperatures, water deficit and agricultural environments. This network integrates transcriptome data (RNA-seq) and chromatin accessibility measurements (ATAC-seq) from five rice cultivars that were grown in controlled experiments and in agricultural fields. We identified open chromatin regions covering ~2% of the genome. These regions were highly overrepresented proximal to the transcriptional start sites of genes and were used to define the promoters for all genes. We used the occurrences of known cis-regulatory motifs in the promoters to generate a network prior comprising 77,071 interactions. We then estimated the regulatory activity of each TF (TFA;143 TFs) based on the expression of its target genes in the network prior across 360 experimental conditions. We inferred an EGRIN using the estimated TFA, rather than the TF expression, as the regulator. The EGRIN identified hypotheses for 4,052 genes regulated by 113 TFs; of these, 18% were in the network prior. We resolved distinct regulatory roles for members of a large TF family, including a putative regulatory connection between abiotic stress and the circadian clock, as well as specific regulatory functions for TFs in the drought response. We find that TFA estimation is an effective way of incorporating multiple genome-scale measurements into network inference and that supplementing data from controlled experimental conditions with data from outdoor field conditions increases the resolution of EGRIN inference.

Show Abstract

Tweeting identity? Ukrainian, Russian, and# Euromaidan

M MacDuffee Metzger, R. Bonneau, J Nagler, J Tucker

Why and when do group identities become salient? Existing scholarship has suggested that insecurity and competition over political and economic resources as well as increased perceptions of threat from the out-group tend to increase the salience of ethnic identities. Most of the work on ethnicity, however, is either experimental and deals with how people respond once identity has already been primed, is based on self-reported measures of identity, or driven by election results. In contrast, here we examine events in Ukraine from late 2013 (the beginning of the Euromaidan protests) through the end of 2014 to see if particular moments of heightened political tension led to increased identification as either “Russian” or “Ukrainian” among Ukrainian citizens. In tackling this question, we use a novel methodological approach by testing the hypothesis that those who prefer to use Ukrainian to communicate on Twitter will use Ukrainian (at the expense of Russian) following moments of heightened political awareness and those who prefer to use Russian will do the opposite. Interestingly, our primary finding in is a negative result: we do not find evidence that key political events in the Ukrainian crisis led to a reversion to the language of choice at the aggregate level, which is interesting given how much ink has been spilt on the question of the extent to which Euromaidan reflected an underlying Ukrainian vs. Russian conflict. However, we unexpectedly find that both those who prefer Russian and those who prefer Ukrainian begin using Russian with a greater frequency following the annexation of Crimea, thus contributing a whole new set of puzzles – and a method for exploring these puzzles – that can serve as a basis for future research.

Show Abstract

Robust classification of protein variation using structural modelling and large-scale data integration

E Baugh, R Simmons-Edler, C. Müller, R Alford, N. Volfovsky, R. Bonneau

Existing methods for interpreting protein variation focus on annotating mutation pathogenicity rather than detailed interpretation of variant deleteriousness and frequently use only sequence-based or structure-based information. We present VIPUR, a computational framework that seamlessly integrates sequence analysis and structural modelling (using the Rosetta protein modelling suite) to identify and interpret deleterious protein variants. To train VIPUR, we collected 9477 protein variants with known effects on protein function from multiple organisms and curated structural models for each variant from crystal structures and homology models. VIPUR can be applied to mutations in any organism's proteome with improved generalized accuracy (AUROC .83) and interpretability (AUPR .87) compared to other methods. We demonstrate that VIPUR's predictions of deleteriousness match the biological phenotypes in ClinVar and provide a clear ranking of prediction confidence. We use VIPUR to interpret known mutations associated with inflammation and diabetes, demonstrating the structural diversity of disrupted functional sites and improved interpretation of mutations associated with human diseases. Lastly, we demonstrate VIPUR's ability to highlight candidate variants associated with human diseases by applying VIPUR to de novo variants associated with autism spectrum disorders.

Show Abstract

A Miniature Protein Stabilized by a Cation− π Interaction Network

T Craven, M Cho, N Traaseth, R. Bonneau, K Kirschenbaum

The design of folded miniature proteins is predicated on establishing noncovalent interactions that direct the self-assembly of discrete thermostable tertiary structures. In this work, we describe how a network of cation−π interactions present in proteins containing “WSXWS motifs” can be emulated to stabilize the core of a miniature protein. This 19-residue protein sequence recapitulates a set of interdigitated arginine and tryptophan residues that stabilize a distinctive β-strand:loop:PPII-helix topology. Validation of the compact fold determined by NMR was carried out by mutagenesis of the cation−π network and by comparison to the corresponding disulfide-bridged structure. These results support the involvement of a coordinated set of cation−π interactions that stabilize the tertiary structure.

Show Abstract

Text Classification for Automatic Detection of E-Cigarette Use and Use for Smoking Cessation from Twitter: A Feasibility Pilot

Y. Aphinyanaphongs, A. Lulejian, D.P. Brown, R. Bonneau, P. Krebs

Rapid increases in e-cigarette use and potential exposure to harmful byproducts have shifted public health focus to e-cigarettes as a possible drug of abuse. Effective surveillance of use and prevalence would allow appropriate regulatory responses. An ideal surveillance system would collect usage data in real time, focus on populations of interest, include populations unable to take the survey, allow a breadth of questions to answer, and enable geo-location analysis. Social media streams may provide this ideal system. To realize this use case, a foundational question is whether we can detect ecigarette use at all. This work reports two pilot tasks using text classification to identify automatically Tweets that indicate e-cigarette use and/or e-cigarette use for smoking cessation. We build and define both datasets and compare performance of 4 state of the art classifiers and a keyword search for each task. Our results demonstrate excellent classifier performance of up to 0.90 and 0.94 area under the curve in each category. These promising initial results form the foundation for further studies to realize the ideal surveillance solution.

Show Abstract

Antibiotic perturbation of the murine gut microbiome enhances the adiposity, insulin resistance, and liver disease associated with high-fat diet

D Mahana, C Trent, Z Kurtz, N Bokulich, T Battaglia, J Chung, C. Müller, H Li, R. Bonneau, M Blaser

Background
Obesity, type 2 diabetes, and non-alcoholic fatty liver disease (NAFLD) are serious health concerns, especially in Western populations. Antibiotic exposure and high-fat diet (HFD) are important and modifiable factors that may contribute to these diseases.

Methods
To investigate the relationship of antibiotic exposure with microbiome perturbations in a murine model of growth promotion, C57BL/6 mice received lifelong sub-therapeutic antibiotic treatment (STAT), or not (control), and were fed HFD starting at 13 weeks. To characterize microbiota changes caused by STAT, the V4 region of the 16S rRNA gene was examined from collected fecal samples and analyzed.

Results
In this model, which included HFD, STAT mice developed increased weight and fat mass compared to controls. Although results in males and females were not identical, insulin resistance and NAFLD were more severe in the STAT mice. Fecal microbiota from STAT mice were distinct from controls. Compared with controls, STAT exposure led to early conserved diet-independent microbiota changes indicative of an immature microbial community. Key taxa were identified as STAT-specific and several were found to be predictive of disease. Inferred network models showed topological shifts concurrent with growth promotion and suggest the presence of keystone species.

Conclusions
These studies form the basis for new models of type 2 diabetes and NAFLD that involve microbiome perturbation.

Show Abstract

Breaking TADs: insights into hierarchical genome organization

P.P Rocha, R. Raviram, R. Bonneau, J.A. Skok

The 3D organization of chromosomes enables cells to balance the biophysical constraints of the crowded nucleus with the functional dynamics of gene regulation. Physical contacts between genes and their regulatory elements are essential for proper transcriptional control and maintenance of these interactions is critical for preventing aberrations in physiological processes that could manifest as disease states. The first insights into global nuclear organization came from imaging studies using FISH (fluorescent in-situ hybridization) analyses, which demonstrated that chromosomes occupy individual territories in the nucleus with minimal intermingling between them [1]. The development of chromosome conformation capture (3C) in which chromatin fragments in close physical proximity can be detected enabled the characterization of molecular interactions between different loci [2]. When 3C-based techniques incorporated massive parallel sequencing (such as in Hi-C) the description of molecular chromatin interactions at a genome-wide scale was finally possible [3]. Hi-C was the first unbiased approach aimed at capturing all interactions in the nucleus thereby providing a snapshot of nuclear organization at the global scale. The first Hi-C study revealed that each chromosomal territory is further divided into large domains of 5–10Mb that physically separate two compartments (A and B), which strongly correlate with active and inactive chromatin, respectively [3]. Furthermore, this study demonstrated that interactions between loci in the same compartment occur at a higher frequency than between loci in different compartments [3]. With the progressive decrease in sequencing costs, higher-resolution Hi-C revealed a new level of nuclear organization where compartments A and B can be further divided into “topologically associated domains” (TADs) [4–6]. In mammalian cells these domains range in size from a few 100kbs to 5Mbs in size (with an average of 1MB). Since they exhibit a high degree of conservation between cell types and species it was proposed that TADs represent the fundamental unit of physical organization of the genome [5].

Show Abstract

Biophysically Motivated Regulatory Network Inference: Progress and Prospects

R. Bonneau

Thanks to the confluence of genomic technology and computational developments, the possibility of network inference methods that automatically learn large comprehensive models of cellular regulation is closer than ever. This perspective focuses on enumerating the elements of computational strategies that, when coupled to appropriate experimental designs, can lead to accurate large-scale models of chromatin state and transcriptional regulatory structure and dynamics. We highlight 4 research questions that require further investigation in order to make progress in network inference: (1) using overall constraints on network structure such as sparsity, (2) use of informative priors and data integration to constrain individual model parameters, (3) estimation of latent regulatory factor activity under varying cell conditions, and (4) new methods for learning and modeling regulatory factor interactions. We conclude that methods combining advances in these 4 categories of required effort with new genomic technologies will result in biophysically motivated dynamic genome-wide regulatory network models for several of the best-studied organisms and cell types.

Show Abstract

Bacillus subtilis Systems Biology: Applications of -Omics Techniques to the Study of Endospore Formation

A.R. Bate, R. Bonneau, P. Eichenberger

The principal B. subtilis laboratory strain, strain 168, is derived from a parent strain isolated in Marburg, Germany, following a mutagenesis procedure (1). The popularity of this strain arose after it was shown to be competent for genetic transformation (2, 3), which paved the way for myriad molecular genetics analyses that led to a detailed understanding of the biology of B. subtilis and related Gram-positive bacteria.

Show Abstract