In the past five years, machine learning and neural networks have exploded in size and popularity. But now researchers in the field are facing growing pains. With bigger and more powerful models, these systems are becoming unwieldly to handle, especially for scientists who are trying to understand their results in scientifically rigorous ways.
Flatiron Research Fellow Brett Larsen has been studying ways to make deep learning models more efficient by developing better ways to train them and better datasets to train them on. This work helps make large deep learning models more accessible for scientific applications and could help improve these models’ accuracy and efficiency for many uses.
Larsen has had a joint appointment at the Flatiron Institute’s Centers for Computational Neuroscience (CCN) and Computational Mathematics (CCM) since 2022. Prior to that, he earned a doctorate from Stanford University and worked as a visiting researcher at Sandia National Laboratories. He also holds a master’s degree in statistics from Stanford, a master’s degree in scientific computing from the University of Cambridge, and a bachelor’s degree in physics and electrical engineering from Arizona State University.
Larsen recently spoke to the Simons Foundation about his work and the future of deep learning.
What are you currently working on?
I work on topics in efficient deep learning, which is a branch of machine learning. Right now, we have more and more computational power so we’re leveraging this to train bigger and bigger models. This approach has proven incredibly successful in terms of improving our models’ capabilities, but it’s also led to us often ignore questions of efficiency for both training and deploying neural networks. The problem is that you can only scale up for so long — it’s not sustainable forever. Efficiency provides another avenue for improvement when there is some computing limit. We’re also already at a point where state-of-the-art models are inaccessible to research groups who don’t have multi-million-dollar budgets for computing, especially those in academia.
Additionally, this approach to deep learning makes it very hard to do science because it takes a lot of repeat experiments to understand the model. This is probably the most important part for me. When you train a model for, say, image classification, you give it a bunch of images for it to label. If you want to publish your results, you have to repeat the experiment multiple times to verify that what you were observing is generally true and not a fluke of that single trial — very similar to how we repeat experiments in the natural sciences. That worked when we had smaller models. But now, because the networks we train are so big and expensive, you’re lucky to get the money or time to do one run. By making models more efficient, we can reduce costs and actually be able to study and understand what the model is doing.
How does your work make the models more efficient?
There are three main approaches you can take to improve efficiency: make a better training algorithm (an area called optimization); reduce the number of parameters (or alternatively improve the architecture without increasing parameters); or train the model with less data. I’m studying the latter two.
I recently finished a project focusing on the parameter issue. This project looked at a topic called ‘neural network pruning.’ When you train a model, it has a bunch of weights that define the function that your network has learned. In 2018, one of my collaborators, Jonathan Frankle, and his advisor, Michael Carbin, proposed an idea called the ‘lottery ticket hypothesis.’ It proposes that you can get rid of a bunch of these weights in a specific way and still be able to train the model to have the same performance on a task like classifying images. Essentially, there’s a smaller network hiding in the overall network that can achieve the same accuracy, and this turned out to be true for many common tasks in machine learning. This smaller network can be an order of magnitude smaller, so finding it has the potential to greatly reduce the computational budget needed to run the network. This is especially helpful if we have a network that we want to run on as little power as possible after training, like one that analyzes photos on your phone.
The process used to find the smaller network is known as iterative magnitude pruning (IMP). Basically, you take the big network, train it and then delete — or prune — some of the weights. You then retrain the model without these weights and keep repeating this process with smaller and smaller networks until the performance falls off. It’s a pretty time-consuming process because you have to keep retraining to identify the next set of weights to prune. Remarkably, though, it turns out you can prune 80 percent to 90 percent of the weights and still get identical performance compared to the original on certain image classification tasks. We wanted to understand the mechanism behind why this works and what eventually stops you from pruning more weights.
The recent work that my colleagues and I just published in ICLR 2023 provides answers to these questions. We asked whether, as you’re retraining these smaller and smaller networks, are you going to the same solution or a similar solution? You can think about training a network like travelling through a landscape with mountains and valleys. You want to get to the lowest valley, which represents a certain combination of weights for the model’s function, but there can be many valleys in the landscape at the same height. We wanted to know if the pruning takes you to the same optimal valley or just a similar valley. If it ends up being the same valley, that tells you something about how the pruning method is working; it means that we’re deleting the weights in such a way that there still exists a path back to the same valley. Through our experiments, we found the models were indeed going back to the same valley repeatedly, and thus it’s likely not possible to remove the retraining piece of this pruning algorithm to speed it up.
We also showed why IMP eventually stops working. At a high level, it’s eventually not possible to find a smaller network with a path back to the same valley because the valley becomes too ‘steep.’ This means we can look at a trained network and determine the maximum percentage of weights that can be deleted to retain the same performance, which significantly speeds up the process.
And what projects are you working on the data side of making networks more efficient?
I’ve been fascinated lately with how we can curate better datasets. Often models are trained with datasets that basically scrape everything from the internet and throw it at the model. It can be a pain to go through your data and actually check things, so people often ignore this. But it’s incredibly important not to feed random nonsense into your model because then it will learn and regurgitate nonsense; the data defines the patterns that your model learns. So, one of the things I’m working on is a way to quantify what a quality dataset looks like and improve ways to filter datasets so we can better shape the patterns we show our models.
I’m also identifying datasets that can train a model more efficiently. For example, if you’re training a model to identify cats and dogs, there are some images you can use that are easier to identify than others. A picture of a dog standing in a field is an easy-to-identify image compared to a dog in the middle of a crowd or one obstructed by a fence. It turns out that in some parts of training, a model will train faster with the easy-to-identify images, and in other cases it’s the more challenging images that really help.
Lastly, in contrast to image classification tasks where the model is trained on the same image many times, it is very common to do only one pass through the data in language models. But as we scale up these models, there are more and more cases where you are data limited. I’m also looking at what affects how many times a model can see an example before you see diminishing returns and how this depends on your mix of different types of data, such as natural language versus code. Ultimately, it all comes down to what is the right data to feed the model and how can we provide guidance for people to help select and curate that data.
How could this work aid some of the larger issues facing deep learning networks today, such as biases and wrong answers?
Addressing these challenges will require the deep learning community to develop an array of tools, but I certainly expect curating and augmenting the data you use in training to play a significant role. For example, even simply appending tags to certain data so the model learns that it’s an example of something you don’t want has been shown to be an effective way to shape the behavior of your model.
For my work, the overall goal is to better translate between what you want the model to do and how you shape your dataset. This remains an ambitious challenge, but I hope that one of the end results is to give people tools that help with these larger issues.