Using 1 trillion files helps scientist find a needle in a haystack

3 min readDec 5, 2018

By Bradley Wade Settlemyer

The Trinity supercomputer broke a world record in how quickly they generated a trillion files. A Los Alamos National Laboratory scientist, having trouble solving a stubborn research problem, needed some help — his scientific simulations had generated a sea of data, but it took so long to search the data that he couldn’t find the information he needed. He found himself looking for the proverbial needle in a haystack. At the same time, the lab’s storage research team had been hard at work on another classic big data problem: creating massive numbers of files as quickly as possible. The day the team met with the scientist, you could say that Big Science and Big Data put their heads together — and now they’re making history.

A Los Alamos National Laboratory scientist having trouble solving a stubborn research problem needed some help — his scientific simulations had generated a sea of data, but it took so long to search the data that he couldn’t find the information he needed. He found himself looking for the proverbial needle in a haystack. At the same time, the lab’s storage research team had been hard at work on another classic big data problem: creating massive numbers of files as quickly as possible. The day the team met with the scientist, you could say that Big Science and Big Data put their heads together — and now they’re making history.

A modern laptop computer typically has just short of a million files in all of its folders. And the Trinity supercomputer can create a million files in just about 5 seconds. But in extreme scale simulation, science researchers often deal with quantities far beyond a million. In fact, in this physicist’s simulation, he needed to generate trillions of particles — a million times larger than a million — and then look at the trajectory of only a few of them. Imagine you’re standing in the Sahara looking at trillions of grains of sand around your feet. Your challenge is to locate just one of them, then track its every movement as a dust devil whips through.

If the lab scientist tried to create a file for each of those trillion particles using Trinity, it would take 57 days just to create the files in that folder — and the supercomputer wouldn’t be doing anything else during that time. Trinity is too important to the lab’s stockpile stewardship mission to simply perform this one task for 57 days. A typical day in the life of Trinity supports multiple scientists, each pursuing important research projects in materials science, plasma physics, fluid dynamics — you name it. There had to be a better solution.

The Ultrascale Systems Research Center in the lab’s High Performance Computing Division is tasked with realizing the next generation of supercomputing. With efforts in storage research, novel computer architectures and extreme scale platform management, the center is uniquely positioned to tackle these seemingly impossible computing challenges. In particular, a collaboration between the center and Carnegie Mellon University had developed an experimental file system designed to support unprecedented numbers of files and folders. It wasn’t obvious it would work, but it seemed like a chance worth taking.

In February and March this year, the scientist began using the experimental file system to track particles on Trinity. It had been a long journey with many obstacles to overcome, but success was finally in sight. Still, it was not until May 2018 that Trinity churned out a trillion files in about two minutes for the first time. That staggering rate translates to about 7 billion files a second, approximately 20,000 times faster than running on Trinity without the new file system. Days later, the pace had jumped to two trillion files in two and a half minutes. The team never set out to create a trillion files. They simply wanted to improve data management for scientists. But when they looked down and saw the trillion files, they felt a brief moment of satisfaction: High-performance computing at Los Alamos continues to lead the way on extreme scale science.

In the high-performance computing universe, speed and efficiency in handling mind-boggling amounts of information are everything. Supercomputers enable previously impossible science, turning lifetimes of data-gathering into minutes. With new tools, scientists can manage ever-growing data streams faster and more efficiently than ever before. And the future of research depends on it.

Next challenge? So-called exascale computing. Running 50 times faster than today’s fastest supercomputers, exascale machines will help scientists simulate complex natural and engineered systems that range from the atomic to the cosmic. That research will include grand challenges in biology, astrophysics, materials and earth systems. Projects like the trillion-file effort are steps toward that exascale goal, and it’s almost here. The U.S. Department of Energy Exascale Project plans to have the next superfast generation of computers running by 2021.

Stay tuned.

Using 1 trillion files helps scientist find a needle in a haystack

Written by Los Alamos National Lab

No responses yet