Achieving More With Less: Optimizing Efficiency in Supercomputing

Flatiron Institute software engineer Géraud Krawezik works to minimize supercomputer energy use — with a focus on improving software performance for scientists.

Géraud Krawezik is a software engineer in the Scientific Computing Core. Credit: John Smock

Supercomputers, which harness the power of multiple interconnected processing cores, use an immense amount of energy, with some running at as much as 30 megawatts. In a year, these supercomputers consume as much power as a small city.

Software engineer Géraud Krawezik works to make sure supercomputing power is used efficiently at the Flatiron Institute’s Scientific Computing Core, or SCC. With projects from developing code to testing new hardware, Krawezik aims to optimize the center’s computing. Through this work, he helps scientists write better code to speed up scientific discovery and moderate the center’s energy consumption with innovative techniques to better allocate power resources.

Before joining the Flatiron Institute, Krawezik worked at several corporations and startups on a range of research, computing and development projects, including scientific software development, graph algorithms for data analysis, and projects in finance and consumer electronics. He earned his Ph.D. in high-performance computing from Paris-Sud University and completed a postdoctoral research position at the University of Illinois Urbana-Champaign.

 

What are you currently working on?

Right now, I have two main projects. The first is helping the structural and molecular biophysics group, which is a collaboration between the institute’s Center for Computational Biology and Center for Computational Mathematics. These researchers work to create molecular dynamics simulations of protein folding and compare them to cryogenic electron microscopy data. These types of simulations and observations are useful for understanding a whole host of biological processes at the molecular level — from how certain proteins react to temperature changes to how the coronavirus uses its spike protein to infiltrate our cells. I help them by making sure everything on the computing end runs smoothly. This means helping optimize software, configuring computer nodes, and installing and configuring programs.

Coronovirus spike protein. Simulated (left) and as seen under cryoEM after classification (right). Importantly, the conformation obtained in the simulation is similar to what is seen under a microscope. Left: Simulation by SMBp, rendering by Phu Tang. Right: Géraud Krawezik, original micrographs from Pilar Cossio.

In my experience, scientific software efficiency varies dramatically between different packages. There are actually very few programs that can utilize supercomputers to their full potential, and that’s largely because most of them were not developed by software specialists. Often when scientists write a code, they just want it to work and do not — or don’t have time to — focus on making it efficient. By handling the technical challenges of computing and working with large datasets, my team allows Flatiron Institute scientists to focus purely on their scientific research.

This folds into the second part of my work, where I essentially make machine learning algorithms play nice with disk storage. Whenever people train neural networks with machine learning, they start with large datasets. These datasets take up a lot of room and consist of millions of files which have to be randomly accessed by the computers. This is inefficient for the storage mechanisms used in computing centers, which were often designed for single large files. I’m working to streamline this process by improving the software code so the computers can be more efficient as they compute and use storage.

 

Why is efficiency so important in computing?

I think computers should be used optimally no matter what — to give an analogy, a hammer might work if you hit a nail with the handle instead of the head, but that’s a poor use of its capabilities. More importantly, efficiency is important for the environment. There’s some irony in that if you’re simulating climate change on a supercomputer, you could be contributing to the problem by using so much energy in your computing. For example, if your code is poorly optimized, it might be running at 10% of its full performance potential, meaning it has to run 10 times as long. Ninety percent of the performance lost is 90% of wasted electricity, so if we optimize the code, we can save a lot of energy.

We can also save energy by paying more attention to how we use our hardware. A while back I did a test on one of our small graphics processing unit (GPU) clusters that was running at 400 watts per card. I found that lowering this to just 225 watts still gave us 90% of the performance. It’s such a small performance difference that it’s not worth running the GPUs at 400 watts. As a result, we now cap the power of these GPUs, which saves a lot of energy. And since we’re limited in how much power we can put into our data center, it also helps us stretch the power we have to accommodate more hardware.

Performance per unit of energy used in the computing world is on the rise. At the Flatiron Institute, one of our supercomputers — which can perform 65.091 billion flops, or floating-point operations per second, per watt of power — is now the most energy-efficient in the world. But compared to the theoretical peak, there is still a lot of room for improvement. Our aggregate computing centers use about a megawatt of power, but others, like Frontier, can use 20 or 30 megawatts. Sadly, many of the new processors and attached graphics cards, which are growing more and more energy-hungry to handle more information and computation, are making the problem worse. For these very large systems, we have seen an increase in power consumption by 10 times over 20 years. It makes me wonder: Will the supercomputers of 2040 require their own nuclear power plants?

Whenever new computers or new types of processors come out, I like to run benchmark tests to see how they behave. This helps us see if it’s worth upgrading to the latest and greatest or keeping what we already have.

 

Are the steps you’re taking to optimize computing at SCC applicable elsewhere?

Yes, certainly. As other groups outside the Flatiron Institute use the same common software our scientists use, they’ll encounter the same problems. Any tweaks and tips we come up with will help them as well. In the coming year I have some software improvement projects I’d like to develop and make available to the scientific research and computing community through a blog or some online platform because I think it can be widely useful to other groups looking to improve their computing efficiency. It’s increasingly common for computer scientists to collaborate and share information between supercomputing centers. A goal of the Flatiron Institute is to spread the knowledge, and as a part of that we like to share what we’re learning with other computing centers.