Publications

Self-calibrating neural networks for dimensionality reduction

Recently, a novel family of biologically plausible online algorithms for reducing the dimensionality of streaming data has been derived from the similarity matching principle. In these algorithms, the number of output dimensions can be determined adaptively by thresholding the singular values of the input data matrix. However, setting such threshold requires knowing the magnitude of the desired singular values in advance. Here we propose online algorithms where the threshold is self-calibrating based on the singular values computed from the existing observations. To derive these algorithms from the similarity matching cost function we propose novel regularizers. As before, these online algorithms can be implemented by Hebbian/anti-Hebbian neural networks in which the learning rule depends on the chosen regularizer. We demonstrate both mathematically and via simulation the effectiveness of these online algorithms in various settings.

Show Abstract

Do retinal ganglion cells project natural scenes to their principal subspace and whiten them?

R. Abbasi-Asl, C. Pehlevan, B. Yu, D. Chklovskii

Several theories of early sensory processing suggest that it whitens sensory stimuli. Here, we test three key predictions of the whitening theory using recordings from 152 ganglion cells in salamander retina responding to natural movies. We confirm the previous finding that firing rates of ganglion cells are less correlated compared to natural scenes, although significant correlations remain. We show that while the power spectrum of ganglion cells decays less steeply than that of natural scenes, it is not completely flattened. Finally, we find evidence that only the top principal components of the visual stimulus are transmitted.

Show Abstract

4C-ker: a method to reproducibly identify genome-wide interactions captured by 4C-Seq experiments

R Raviram, P Rocha, C. Müller, E. Miraldi, S Badri, Y Fu, E Swanzey, C Proudhon, V Snetkova, R. Bonneau, J Skok

4C-Seq has proven to be a powerful technique to identify genome-wide interactions with a single locus of interest (or “bait”) that can be important for gene regulation. However, analysis of 4C-Seq data is complicated by the many biases inherent to the technique. An important consideration when dealing with 4C-Seq data is the differences in resolution of signal across the genome that result from differences in 3D distance separation from the bait. This leads to the highest signal in the region immediately surrounding the bait and increasingly lower signals in far-cis and trans. Another important aspect of 4C-Seq experiments is the resolution, which is greatly influenced by the choice of restriction enzyme and the frequency at which it can cut the genome. Thus, it is important that a 4C-Seq analysis method is flexible enough to analyze data generated using different enzymes and to identify interactions across the entire genome. Current methods for 4C-Seq analysis only identify interactions in regions near the bait or in regions located in far-cis and trans, but no method comprehensively analyzes 4C signals of different length scales. In addition, some methods also fail in experiments where chromatin fragments are generated using frequent cutter restriction enzymes. Here, we describe 4C-ker, a Hidden-Markov Model based pipeline that identifies regions throughout the genome that interact with the 4C bait locus. In addition, we incorporate methods for the identification of differential interactions in multiple 4C-seq datasets collected from different genotypes or experimental conditions. Adaptive window sizes are used to correct for differences in signal coverage in near-bait regions, far-cis and trans chromosomes. Using several datasets, we demonstrate that 4C-ker outperforms all existing 4C-Seq pipelines in its ability to reproducibly identify interaction domains at all genomic ranges with different resolution enzymes.

Show Abstract

Environmental gene regulatory influence networks in rice (Oryza sativa): response to water deficit, high temperature and agricultural environments

O. Wilkins, C. Hafemiester, A. Plessis, M.-M. Holloway-Phillips, G. Pham, A.B. Nicotra, G.B. Gregorio, S.V.K. Jagadish, E.M. Septiningsih, R. Bonneau, M. Purugganan

We inferred an environmental gene regulatory influence network (EGRIN) of the response of tropical Asian rice (Oryza sativa) to high temperatures, water deficit and agricultural environments. This network integrates transcriptome data (RNA-seq) and chromatin accessibility measurements (ATAC-seq) from five rice cultivars that were grown in controlled experiments and in agricultural fields. We identified open chromatin regions covering ~2% of the genome. These regions were highly overrepresented proximal to the transcriptional start sites of genes and were used to define the promoters for all genes. We used the occurrences of known cis-regulatory motifs in the promoters to generate a network prior comprising 77,071 interactions. We then estimated the regulatory activity of each TF (TFA;143 TFs) based on the expression of its target genes in the network prior across 360 experimental conditions. We inferred an EGRIN using the estimated TFA, rather than the TF expression, as the regulator. The EGRIN identified hypotheses for 4,052 genes regulated by 113 TFs; of these, 18% were in the network prior. We resolved distinct regulatory roles for members of a large TF family, including a putative regulatory connection between abiotic stress and the circadian clock, as well as specific regulatory functions for TFs in the drought response. We find that TFA estimation is an effective way of incorporating multiple genome-scale measurements into network inference and that supplementing data from controlled experimental conditions with data from outdoor field conditions increases the resolution of EGRIN inference.

Show Abstract

Tweeting identity? Ukrainian, Russian, and# Euromaidan

M MacDuffee Metzger, R. Bonneau, J Nagler, J Tucker

Why and when do group identities become salient? Existing scholarship has suggested that insecurity and competition over political and economic resources as well as increased perceptions of threat from the out-group tend to increase the salience of ethnic identities. Most of the work on ethnicity, however, is either experimental and deals with how people respond once identity has already been primed, is based on self-reported measures of identity, or driven by election results. In contrast, here we examine events in Ukraine from late 2013 (the beginning of the Euromaidan protests) through the end of 2014 to see if particular moments of heightened political tension led to increased identification as either “Russian” or “Ukrainian” among Ukrainian citizens. In tackling this question, we use a novel methodological approach by testing the hypothesis that those who prefer to use Ukrainian to communicate on Twitter will use Ukrainian (at the expense of Russian) following moments of heightened political awareness and those who prefer to use Russian will do the opposite. Interestingly, our primary finding in is a negative result: we do not find evidence that key political events in the Ukrainian crisis led to a reversion to the language of choice at the aggregate level, which is interesting given how much ink has been spilt on the question of the extent to which Euromaidan reflected an underlying Ukrainian vs. Russian conflict. However, we unexpectedly find that both those who prefer Russian and those who prefer Ukrainian begin using Russian with a greater frequency following the annexation of Crimea, thus contributing a whole new set of puzzles – and a method for exploring these puzzles – that can serve as a basis for future research.

Show Abstract

Robust classification of protein variation using structural modelling and large-scale data integration

E Baugh, R Simmons-Edler, C. Müller, R Alford, N. Volfovsky, R. Bonneau

Existing methods for interpreting protein variation focus on annotating mutation pathogenicity rather than detailed interpretation of variant deleteriousness and frequently use only sequence-based or structure-based information. We present VIPUR, a computational framework that seamlessly integrates sequence analysis and structural modelling (using the Rosetta protein modelling suite) to identify and interpret deleterious protein variants. To train VIPUR, we collected 9477 protein variants with known effects on protein function from multiple organisms and curated structural models for each variant from crystal structures and homology models. VIPUR can be applied to mutations in any organism's proteome with improved generalized accuracy (AUROC .83) and interpretability (AUPR .87) compared to other methods. We demonstrate that VIPUR's predictions of deleteriousness match the biological phenotypes in ClinVar and provide a clear ranking of prediction confidence. We use VIPUR to interpret known mutations associated with inflammation and diabetes, demonstrating the structural diversity of disrupted functional sites and improved interpretation of mutations associated with human diseases. Lastly, we demonstrate VIPUR's ability to highlight candidate variants associated with human diseases by applying VIPUR to de novo variants associated with autism spectrum disorders.

Show Abstract

Inferring causal molecular networks: empirical assessment through a community-based effort

Steven M Hill, Laura M Heiser, Thomas Cokelaer, Michael Unger, Nicole K Nesser , Daniel E Carlin, Yang Zhang, Artem Sokolov, Evan O Paull , Chris K Wong, C. Müller, et al.

It remains unclear whether causal, rather than merely correlational, relationships in molecular networks can be inferred in complex biological settings. Here we describe the HPN-DREAM network inference challenge, which focused on learning causal influences in signaling networks. We used phosphoprotein data from cancer cell lines as well as in silico data from a nonlinear dynamical model. Using the phosphoprotein data, we scored more than 2,000 networks submitted by challenge participants. The networks spanned 32 biological contexts and were scored in terms of causal validity with respect to unseen interventional data. A number of approaches were effective, and incorporating known biology was generally advantageous. Additional sub-challenges considered time-course prediction and visualization. Our results suggest that learning causal relationships may be feasible in complex settings such as disease states. Furthermore, our scoring approach provides a practical way to empirically assess inferred molecular networks in a causal sense.

Show Abstract

Probabilistic Modelling of Chromatin Code Landscape Reveals Functional Diversity of Enhancer-like Chromatin States

J Zhou, O. Troyanskaya

Interpreting the functional state of chromatin from the combinatorial binding patterns of chromatin factors, that is, the chromatin codes, is crucial for decoding the epigenetic state of the cell. Here we present a systematic map of Drosophila chromatin states derived from data-driven probabilistic modelling of dependencies between chromatin factors. Our model not only recapitulates enhancer-like chromatin states as indicated by widely used enhancer marks but also divides these states into three functionally distinct groups, of which only one specific group possesses active enhancer activity. Moreover, we discover a strong association between one specific enhancer state and RNA Polymerase II pausing, linking transcription regulatory potential and chromatin organization. We also observe that with the exception of long-intron genes, chromatin state transition positions in transcriptionally active genes align with an absolute distance to their corresponding transcription start site, regardless of gene length. Using our method, we provide a resource that helps elucidate the functional and spatial organization of the chromatin code landscape.

Show Abstract

Fast Direct Methods for Gaussian Processes

Sivaram Ambikasaran, Daniel Foreman-Mackey, L. Greengard, David W. Hogg, Michael O'Neil

A number of problems in probability and statistics can be addressed using the multivariate normal (Gaussian) distribution. In the one-dimensional case, computing the probability for a given mean and variance simply requires the evaluation of the corresponding Gaussian density. In the $n$-dimensional setting, however, it requires the inversion of an $n \times n$ covariance matrix, $C$, as well as the evaluation of its determinant, $\det(C)$. In many cases, such as regression using Gaussian processes, the covariance matrix is of the form $C = \sigma^2 I + K$, where $K$ is computed using a specified covariance kernel which depends on the data and additional parameters (hyperparameters). The matrix $C$ is typically dense, causing standard direct methods for inversion and determinant evaluation to require $\mathcal O(n^3)$ work. This cost is prohibitive for large-scale modeling. Here, we show that for the most commonly used covariance functions, the matrix $C$ can be hierarchically factored into a product of block low-rank updates of the identity matrix, yielding an $\mathcal O (n\log^2 n) $ algorithm for inversion. More importantly, we show that this factorization enables the evaluation of the determinant $\det(C)$, permitting the direct calculation of probabilities in high dimensions under fairly broad assumptions on the kernel defining $K$. Our fast algorithm brings many problems in marginalization and the adaptation of hyperparameters within practical reach using a single CPU core. The combination of nearly optimal scaling in terms of problem size with high-performance computing resources will permit the modeling of previously intractable problems. We illustrate the performance of the scheme on standard covariance kernels.

Show Abstract

Actomyosin-driven left-right asymmetry: from molecular torques to chiral self organization.

S. Naganathan, T. Middelkoop, S. Fürthauer, S. Grill

Chirality or mirror asymmetry is a common theme in biology found in organismal body plans, tissue patterns and even in individual cells. In many cases the emergence of chirality is driven by actin cytoskeletal dynamics. Although it is well established that the actin cytoskeleton generates rotational forces at the molecular level, we are only beginning to understand how this can result in chiral behavior of the entire actin network in vivo. In this review, we will give an overview of actin driven chiralities across different length scales known until today. Moreover, we evaluate recent quantitative models demonstrating that chiral symmetry breaking of cells can be achieved by properly aligning molecular-scale torque generation processes in the actomyosin cytoskeleton.

Show Abstract