By Gary Grider
Tucked in the foothills of the Jemez mountains in northern New Mexico, among the ponderosa pines and endless blue skies, sits one of the world’s fastest computers. Trinity is a 42-petaflop supercomputer (that’s one quadrillion floating point operations per second, in case you’re counting) that resides at Los Alamos National Laboratory and can perform complex 3D simulations of everything from ocean currents to asteroid impacts.
While a remote mountain town might seem to be an odd place for this computer to call home, it makes sense when you consider Los Alamos’ history. Founded during World War II as the location of the top-secret Manhattan Project, scientists toiled away to build the first atomic bomb. What they didn’t realize is that, in the process, they were pioneering the advent of Big Science. Today, Big Science brings together theory, modeling, experiments that produce massive amounts of data, and supercomputers to run incredibly sophisticated simulations providing feedback and validation to those theories and models. When J. Robert Oppenheimer led his all-star team of scientists to unravel the secrets of the atom, they were embarking on an integrated research program at a scale the world had rarely seen.
Computers were here from the start. During the war years, the term “computers” applied to mathematicians—mostly women—who worked the differential equations by hand, with help from mechanical desktop calculators and simple punch-card machines from IBM. These were the first steps in the process of inventing how to use computers. Los Alamos scientists went on to run the first production job on the world’s first general-purpose electronic digital computer, ENIAC, and Nicholas Metropolis spearheaded development of the Lab’s own computer, playfully dubbed MANIAC, in 1952, to continue the work of modeling nuclear processes.
Working with corporate collaborators, the Lab has been stretching the boundaries of computing ever since, with innovation following innovation as the Lab’s computers often topped the list of the fastest in the world. In 2008, the Lab’s Roadrunner supercomputer became the first to break the petaflop barrier, processing a thousand trillion floating point operations each second. That kind of speed enables resolution in simulations that would have been unimaginable 70 years ago. In a global ocean climate model, for example, scientists can look at individual eddies in an ocean current. (See image below.)
None of these computers were “plug and play.” For each one, the Lab and its corporate partners developed new software and hardware to make it run. Those innovations benefited public and private computer users everywhere, from how best to network very large clusters of computer processors to how to manage the data they produced.
Roadrunner’s petaflop speed, for instance, was spinning out data at unprecedented rates during simulations running many months. Storage technology in that era struggled to keep up with the technology’s ability to generate and consume data. During long-running calculations at very large scale, with thousands of processors operating for weeks to months, failures occur—several per day, potentially. A method for dealing with this recurring and somewhat random failure is checkpoint-restart, where the application periodically saves a snapshot of its current state to guard against impending failure. The program can restart from these checkpoints and thereby continue for long periods, making forward progress towards a meaningful scientific result.
If the stable storage that holds checkpoints is too slow, then computing time is lost either through spending too much time checkpointing, which bogs down the program, or by not checkpointing, which amplifies the effect of each failure.
The challenge intensified with Trinity, with its Cray architecture and two kinds of Intel processors. When fully installed, it will run about 40 times faster than Roadrunner and has memory roughly equal to the amount of memory of all the laptops in New Mexico. That performance would only make the check-point problems worse. But several years ago, we invented burst buffers, paving the way for Trinity. Using solid-state flash memory, similar to memory in the average smart phone, burst buffers take the rapid-fire data off the supercomputer processors and dole it out to slower disk drives while keeping the data handy for a restart. Performance improves, and flash memory for burst buffers is cheaper when bandwidth, basically access speed, is taken into account compared to disk drives.
Burst buffers were installed for the first time on Trinity to support its crucial nuclear stockpile simulations. Other Department of Energy laboratories, academia, corporations, and European supercomputer user sites are rapidly adopting this new technology. Our software engineers also went on to develop another storage tool that allows supercomputers to save extremely large data sets for years on relatively inexpensive devices similar to those used by cloud-based businesses like Amazon. Cloud-style inexpensive disk storage had not been applied to high performance computing before.
Trinity and these storage tools continue the tradition of close collaboration between Los Alamos and computer vendors on the very latest developments in computing technology. Big Science and its constant companion, Big Data, rely on the most advanced computers to simulate how the world works or to solve a mystery whose solution hides in a vast sea of data. We take on challenges at a grand scale, from climate modeling to genetics, earthquakes to cancer, black holes to nuclear physics—work that tests the limits of computing superpower. The computing innovations we develop to solve these problems gives others the tools to address more everyday problems, such as by simulating a car crash as a means of improving real-world safety—research that ultimately enriches everyone’s life.
Gary Grider is division leader of High Performance Computing at Los Alamos National Laboratory and a recognized international expert on supercomputing.