The convergence of experimental neuroscience with data science is a fundamental goal of the Simons Collaboration on the Global Brain. The collaboration supports projects that use new technologies to record the activity of large neural populations at single-cell resolution, in combination with mathematical analyses, to investigate how neural coding and dynamics represent and process information relevant to internal cognitive states and behavior. In this same collaborative spirit, CodeNeuro is a unique event that brings together neuroscientists and data scientists to develop new approaches to analyzing data.
Hacking the Brain
It’s 10 p.m. in New York City, and a group of coders huddles around a table, laptops glowing green and black. They are participating in CodeNeuro, a two-day event that brings together neuroscientists and data scientists to address some of the most vexing data-analysis problems in neuroscience today. Twelve hours in, the hackathon is still going strong, as teams of neuroscientists and programmers compete to build the best algorithms to address a particularly challenging problem in neuroscience.
The event has been spearheaded by Jeremy Freeman, a group leader at the Howard Hughes Medical Institute’s Janelia Research Campus and an investigator for the Simons Collaboration on the Global Brain (SCGB). Freeman hosted the event along with co-organizers Nick Sofroniew at Janelia, Michael Broxton and Logan Grosenick at Stanford, Matt Conlen at NewInc, and Jeff Hammerbacher at Mount Sinai.da Freeman got the idea for CodeNeuro about a year and a half ago, when he realized that neuroscience would need better tools for computing and analysis as the field generated larger and more complicated datasets. He had begun to use cloud-based computing technologies in his lab to analyze his group’s data, eventually turning to a platform called Spark, which has been gaining traction in the business world as a tool to analyze the massive datasets generated by, for instance, social-media users.
Realizing the potential for synergy between neuroscience and data science, Freeman gave a presentation at the 2013 Spark Summit conference about his group’s use of Spark to analyze large-scale neuroscience data. He described how new technology has enabled scientists to observe the activity of neurons in the entire brains of animals: In a zebra fish, for example, researchers can see the simultaneous activity of nearly 100,000 neurons for several minutes. One experiment could easily generate a terabyte of data.
Freeman was the only neuroscientist at the conference, and his talk generated a lot of excitement in the Spark community. He decided that the next step was to get some of the Spark Summit people in the same room with neuroscientists, and so CodeNeuro was born. The first CodeNeuro took place last fall in San Francisco, and the New York conference, held in April, was an even bigger event, featuring an evening of speakers, ranging from neuroscientists to geneticists to industry data analysts, followed by a day of software tutorials and a hackathon.
“There was no other venue to get these communities talking together,” said Freeman. “Most of neuroscience uses a simple workflow: Collect data, then analyze it in MATLAB on a single computer,” he added, referring to a commonly used scientific software. “But this wasn’t a scalable solution. We realized our needs lined up with what was being offered by Spark.”
Launched in 2009 at Berkeley’s AMPLab, Spark is now the fastest processing engine for large-scale data, a hundred times faster than any competitor’s platform. It’s open source, meaning anyone can view and alter its code, and it operates on datasets typically stored in a distributed cloud system, such as Amazon S3. Spark is catching on rapidly in business and has been used in a variety of industries, including insurance, media and healthcare. Anyone with basic programming skills can use an interface to tell Spark how to analyze data, and Spark responds at speeds single computers can’t match.
Freeman has developed a suite of Spark-based tools, collectively called Thunder, tailored to the needs of neuroscientists and containing a variety of common data-processing tools and algorithms. He hopes others will add to it, enabling Thunder to serve as a repository for such techniques so that anyone in the neuroscience community can benefit from the best analytical tools available, instead of having to invent them in isolation. “We need to move away from a world where labs independently solve problems,” Freeman said.
Patrick Kaifosh, a graduate student mentored by Larry Abbott, an SCGB investigator who directs Columbia University’s Center for Theoretical Neuroscience, sat quietly at a table listening to the CodeNeuro ground rules. It was Saturday morning, and the hackathon was about to begin.
The problem the teams were assigned is pervasive and notoriously difficult. It has to do with data generated by an increasingly prevalent technique called ‘calcium imaging,’ a method used by many SCGB investigators. In calcium imaging, neurons are genetically modified to light up whenever there’s an increase in calcium ions inside them. An uptick in calcium is thought to reflect an increase in the activity of the neuron, and a specially designed microscope can detect the calcium-triggered bursts of light from hundreds or thousands of neurons at once. The tricky part is that the neurons’ cell bodies are difficult to isolate. Flickering fibers from other neurons light up in the background, making it surprisingly difficult to identify the signal of a single neuron. This problem, called ‘source extraction,’ is what the teams gathered to solve.
Each team worked on the same datasets, donated by Adam Packer, a postdoctoral fellow working with Michael Hausser at University College London, and Simon Peron, a postdoctoral fellow working with another SCGB investigator, Karel Svoboda, at Janelia. Because of the particular way these datasets were collected, the actual location of each neuron was known, allowing each team to assess its algorithm’s performance against what they call ‘ground truth.’
Kaifosh is particularly attuned to problems of calcium imaging analysis, since he has designed a suite of algorithms, called SIMA, which addresses problems such as motion correction and source extraction using a variety of sophisticated mathematical techniques. The program has been downloaded by dozens of labs around the world. But, he said, SIMA “is written to run on a single computer.”
He pointed out that one issue with simply downloading an algorithm and pressing ‘go’ is that the data from labs are “highly variable.” That is, the resolution of the data in time and space varies, the type of neuron varies, and the type of calcium indicator used to generate the bursts of light varies. This means there may not be a single algorithm that best suits all datasets.
For this reason, the groups also set out to create metrics to evaluate how well an algorithm performs for a given set of data. The idea is that neuroscientists can run each algorithm on their dataset and use the metrics to determine which one is the best for their analysis. As Kaifosh noted, this is meant to ensure that “every grad student doesn’t have to design and implement every algorithm.”
In other words, the hackathon wasn’t so much about creating one ‘best’ algorithm, but about creating a variety of algorithms, each with its own strengths and weaknesses. In addition to the teams building algorithms, a small group of programmers worked on the front and back ends of a Web interface that would allow scientists to submit an algorithm or use one. By the time all the efforts begun at the hackathon are complete, there should be an easily accessible set of algorithms and benchmarks for neuroscientists to use.
The algorithms developed at CodeNeuro may have neuroscience applications beyond calcium imaging data. Logan Grosenick, a graduate student at Stanford University, thinks these algorithms could “apply to voltage-sensing data, or imaging neurotransmitters. The data will be similar.” He added that in multiple areas of neuroscience, “our ability collect data has outstripped our ability to compute it.” Felipe Gerhard’s work with epilepsy data illustrates this. Gerhard, a postdoctoral fellow at Brown University, analyzes data taken from neurons in patients undergoing a neurosurgical procedure to remove a portion of the brain known to be a source of seizures. At the onset of the procedure, a team inserts an array of electrodes to record brain activity in single neurons. “Clinicians don’t have the expertise to analyze the data,” he said. To help them, he combines concepts from data science, statistics and machine learning.
Sean Skwerer, a postdoctoral fellow at the Yale School of Public Health, studies how genetic variants influence schizophrenia. Because so many genes are involved, he said, “it’s very high-dimensional data.” He added, “The problems you see in calcium imaging are problems you see in all kinds of data. These are fundamental problems.”
The applications of the CodeNeuro algorithms may reach beyond biology, a prospect that excited a lot of the non-neuroscientists in the crowd. Paco Nathan, who works at Databricks, a company supporting Spark development, noted, “These same techniques are used in analyzing satellite images, and even in finance.” Maxwell Rebo, a programmer at Factr, which curates and analyzes data for a variety of organizations, including the United Nations, said that his company uses “similar methods for extracting signals in social networks. We want to find when users are in synchrony, just like neuroscientists want to find when neurons are in synchrony.” So it’s possible that advances made by neuroscientists could influence large-scale analytics everywhere.
While the CodeNeuro crowd may be poised to accelerate data analysis and provide new tools for neuroscience and beyond, it’s still difficult to say precisely how these new tools will lead to real scientific advances. But that says more about the state of neuroscience than the abilities of the CodeNeuro teams. At the end of the opening night of CodeNeuro, Freeman moderated a panel of leading neuroscientists: Eve Marder of Brandeis University, a member of the advisory committee to President Obama’s BRAIN Initiative; Larry Abbott; and Anthony Movshon of New York University, an SCGB investigator known for his pathbreaking work in the visual system.
All three cautioned that just having new tools wasn’t enough. “We now have all of these parallel technologies for acquiring data from many cells at the same time,” said Movshon. “But I don’t think as a field we’ve come to grips with what we’re doing with those large sets of data. We aren’t asking the right questions yet.”
Marder was enthusiastic about the possibilities of new and complex analytical tools, but she also expressed the fear that “the next generation may be using tools to process and analyze their data that they fundamentally do not understand.”
Abbott elaborated, stating that in so-called big data, “we have a visualization problem. We’re trying to characterize something in a 2-D graph that occurs in a high dimension. It’s an unusual time in neuroscience: The methodologies are way ahead of the ideas.”
But the types of tools that the CodeNeuro crowd is creating could eventually lead to greater understanding. In a sense, the process is about creating a platform just as much as it is about creating the specific tools to put on it. As Freeman says, “The competition is just a fun way to get everyone together, get them on the same page.” What he’s really aiming for is a way “to bring together different approaches in a common framework that’s flexible and scalable.”
Neuroscience has always been an interdisciplinary effort, and there’s no telling where the next major advance will come from. But if history is any guide, it’s necessary to have the best minds from diverse fields and technologies on the same page.