HPC in Transition: Hunter and Herder Will Bring New Opportunities, New Challenges

Illustration von Eisenbahnschienen mit Licht am Horizont.
Bild ©HLRS, Generiert mit Firefly von Groothuis.

HLRS’s coming supercomputers will not only enable traditional high-performance computing to reach new heights, but will also support complementary approaches involving AI, deep learning, and data analytics.

The announcement of HLRS’s next-generation supercomputers, Hunter and Herder, in December 2023 marked a significant moment in the history of high-performance computing (HPC) at the University of Stuttgart, charting a path for HLRS to reach exascale performance. This major leap in computing power will help to maintain the center’s status as a leading institute for high-performance computing in Europe.

At the same time, the announcement marks a noteworthy technological shift. Whereas HLRS has offered a predominantly CPU-based architecture for many years, Hunter and Herder will achieve their speedup gains by including large numbers of graphics processing units (GPUs). The new systems will make it possible for scientists and engineers who use numerical methods to run larger simulations faster than ever before. At the same time, the shift to GPUs will increase HLRS’s ability to support new approaches involving artificial intelligence, machine learning, deep learning, and high-performance data analytics.

This transitional moment at HLRS is emblematic of a range of changes that are taking place in high-performance computing across the world. As the center navigates this new terrain, users can expect both exciting new capabilities and new challenges.

Why HPC is transitioning to GPUs

When Gordon Moore articulated his eponymous Moore’s Law in 1975, he predicted that the number of components that could be packed onto an integrated circuit would double approximately every two years. For decades the steady, rapid increase in computing power was evidence of the accuracy of this prediction. Today, however, is a different story. Moore’s Law appears to be approaching its end.

In high-performance computing, most CPU-based supercomputers have been built using an architecture called x86, but this framework has reached its limit. Potential strategies such as adding more cores, shrinking component sizes, or increasing processor frequency have either become or will likely soon become impossible. Meanwhile, simply building larger supercomputers with even more CPUs would be financially and environmentally unsustainable due to the power and material resources they would require.

Computer manufacturers have been aware of this impending limit for some time, and have been developing new architectures to meet ever-increasing demand for higher computing speeds. One of the most popular involves shifting from CPUs to an architecture that uses graphics processing units as accelerators. Most of the world’s fastest supercomputers now include GPUs, including the Frontier and Aurora systems by Hewlett Packard Enterprise / Cray, currently in places 1 and 2 on the Top500 List.

GPUs are simpler in construction than CPUs, but provide a much faster and more energy-efficient way to execute large numbers of calculations in parallel. Because GPU clock speeds are lower, they require less power. Still, GPUs run faster than CPUs by packing many more cores on each computing node, making data transfer among cores much more efficient.

The upcoming Hunter supercomputer, arriving by 2025, will be based on the AMD MI300A accelerated processing unit (APU), which combines CPU and GPU processors on a single chip with a shared memory. Hunter will offer at least a 50% increase in speed over HLRS’s current flagship machine, Hawk. At the same time, the HPE Cray EX-4000 system will consume the equivalent of just 20% of Hawk’s power at peak performance. When it arrives in 2027, Herder will expand on Hunter’s GPU-accelerated concept in the form of a much larger exascale machine.

Once they enter production, Hunter and Herder will enable HLRS to support more powerful simulations. For scientific users, this will make it possible to investigate complex phenomena at higher accuracy, to simulate larger systems than in the past, or to more efficiently run repeated simulations that reveal effects of specific parameters or provide greater statistical power for data analysis. The combination of CPUs and GPUs on APU chips will also make it possible to integrate simulation and data analytics more efficiently than was possible in the past, permitting a number of exciting potential applications.

Opportunities at the intersection of simulation and artificial intelligence

In conventional high-performance computing, researchers run large-scale simulations, receive results as data, and then analyze their results for interesting features and patterns by visualizing or processing the data further. In most cases, data post-processing can be very time consuming, taking significantly longer than running the simulation itself. This challenge will become even more severe as larger supercomputers make it possible to generate even larger datasets.

Dr. Matthias Meinke at the RWTH Aachen University Institute of Aerodynamics (AIA) has used HLRS’s supercomputers for many years to run numerical simulations of turbulent flows. “The problems we want to investigate are getting more complex and will produce larger and larger amounts of data,” he says. “This means that it will be important to investigate how high-performance data analytics and artificial intelligence could help.”

Using APUs, Hunter could enable researchers to integrate artificial intelligence methods into their applications and workflows. Here, AI could be utilized to browse simulation data in real-time, reducing the need for later data post-processing.

Graduate student Xiang Xu of the University of Stuttgart’s Institute of Materials Science (IMW) foresees advantages of this approach. Recently, he has been using Hawk to perform molecular dynamics simulations of nickel aluminide crystals. This metal has properties that make it attractive for use in high-temperature environments like engines, and Xu’s research focuses on properties at the atomic level that can affect its brittleness. In addition to being able to simulate larger systems of atoms than is possible today, he says that Hunter and Herder will help to save significant time in analysis.

“GPU systems will not only enable larger simulations, but also speed up data post-processing and analysis.” 

— Xiang Xu, IMW, University of Stuttgart

“When doing molecular dynamics simulations we don’t need all of the data for all of the atoms, but are looking for evidence of dislocations, vacancies, or defects that have already been measured at the macro scale,” Xu says. Real-time data analytics tools, he anticipates, will help identify and isolate features of interest more quickly and make it possible to download and store just the data he needs.

In addition, approaches using artificial intelligence could help to synthesize and analyze the massive datasets that can accumulate at a lab or research institute over many years. AI could provide a global perspective that is often not relevant when researching specific problems or parameters, identifying patterns hidden in data that might not have been important in the original simulations but become interesting in the context of larger datasets. Using AI, data accumulated over many years could become a rich resource for generating new hypotheses and making new discoveries.

At the same time that Dr. Meinke looks forward to exploring possibilities offered by data analytics approaches — in developing surrogate models of complex simulations, for example — he also emphasizes that they should be seen as a complement to traditional methods and not their replacement. “We will need to be careful in terms of our expectations and to be critical in determining whether AI provides a general solution or only functions under specific conditions,” he cautions. Research using Hunter and Herder could help in testing these strategies.

Surrogate models and their challenges

Most of HLRS’s supercomputing resources go to scientists in academia with large grants of computing time. For engineers in industry, however, it is often not possible or even desirable to run simulations at such large scales. For this reason, researchers have been investigating how machine learning and artificial intelligence could be used to develop surrogate models of complex systems. Based on data generated from first-principles simulations, surrogate models accurately replicate relevant features of complex systems in a simplified manner that can be run on a conventional desktop computer. Such models provide engineers in industry with practical tools that are firmly grounded in the best possible research.

According to Prof. Bernhard Weigand, director of the University of Stuttgart’s Institute of Aerospace Thermodynamics (ITLR), the convergence of supercomputing power, new experimental techniques, and machine learning offers exciting new opportunities to develop better surrogate models. “Measurement technologies have improved in parallel with computational methods and technologies, making it possible to see many details that we couldn’t see numerically,” he says. “The exciting challenge for the coming years is to synchronize these developments, using machine learning to assimilate experimental data and numerical methods to develop better models.”

“Machine learning could help to assimilate experimental data and numerical methods to develop better models.”

— Bernhard Weigand, ITLR, University of Stuttgart

Such methods will also benefit from an emerging approach called physics-informed neural networks. Here, training algorithms incorporate fundamental physical laws to ensure that they are grounded in physical reality. This approach will ensure not only that the resulting surrogate models are faster than traditional simulations using HPC, but also that that the predictions they generate correspond to the real world and are reliable in a wide range of situations.

Achieving this goal does not mean that high-performance computing in its traditional form will disappear. Rather, GPU-accelerated systems will be more critical than ever for running highly accurate simulations and providing advanced AI capabilities. Prof. Wolfgang Schröder, who heads the RWTH Aachen University Institute of Aerodynamics (AIA) sees great potential in AI methods but emphasizes that numerical methods will continue to be the gold standard for computational research. Moreover, AI approaches will have a cost. “When people say that in the future everything will be done using machine learning and AI, they assume that the necessary data are available. In computational fluid dynamics, this isn’t true,” Schröder says. “We need large amounts of computing time to generate enough data — in some cases we might need to run thousands of simulations to come up with a sustainable surrogate model. This will not happen by itself, and it is one reason why physics disciplines will need exascale systems like Herder.”

Training and user support will help in the transition to GPUs

For a variety of technical reasons, algorithms written for CPU-based architectures cannot simply be ported onto GPUs and be expected to run efficiently. Instead, users of HLRS’s supercomputers will, in many cases, need to look closely at how their codes are structured and written in order to ensure that they operate at high performance on the new hardware.

“In recent years we have continually optimized our code to improve its performance CPUs,” says ITLR scientist Matthias Ibach. “Having access to GPUs offers new opportunities and will be much more effective, but we need to make sure that we continue to utilize the full potential of the new system. If we continue to think in the same ways we have in the past, the performance of the new system could turn out to be worse than what we have now.”

HLRS’s user community will not be alone in making this transition. An important consideration in the center’s decision to contract with Hewlett Packard Enterprise for its next-generation systems is the manufacturer’s commitment to user support. In the coming years, HPE staff with expert knowledge of their systems, working in collaboration with the HLRS user support team, will assist HLRS’s users in this effort.

“We need large amounts of computing time to generate enough data to train AI models. This is one reason why physics disciplines will need exascale systems like Herder.”

— Wolfgang Schröder, AIA, RWTH Aachen

HLRS Director Prof. Michael Resch points out that this is not the first time that the HLRS community has had to migrate to a new technology. “Since its founding in 1996, HLRS has periodically needed to upgrade to new supercomputing architectures to stay at the cutting edge of high-performance computing,” he explains. “In this sense, the challenge that we are currently facing with GPUs is not new, although the specific challenge is real because many of our users have not utilized this type of system before. This is why Hunter is conceived as a transitional system. Supporting our users in making this jump is imperative for us.”

Over the past several years, HLRS has been laying the groundwork for the transition to GPUs by expanding its HPC training program to address skills that users will need. In collaboration with other HPC centers across Europe, HLRS has offered “bootcamp” courses introducing programming models that users can apply to port their codes onto GPUs. Additional courses have also focused on approaches for deep learning and artificial intelligence.

While some research groups are expected to find quickly that their codes can run much faster on GPUs, others might require more support. Methods that simply run clearly defined groups of repetitive calculations will have an easier time, for example, than algorithms that use object-oriented programming to track changing relationships among distinct objects, like particles. Looking toward the coming years, Dr. Schröder explains, “There is a tension we have to accept between what is expected with respect to our research in fluid dynamics and what is expected in numerical analysis. It will be difficult, but we like such a challenge because we know that writing the next generation of ‘superpower code’ is necessary not just for us but also for HLRS.”

Reliable resources and new opportunities for industry

Whereas many academic scientists actively develop their own advanced codes, researchers in industry who use HPC for simulation are typically reliant on commercially produced software packages that they modify for specific applications. In the near term, this could present some challenges in transitioning to GPUs. In some cases software packages are already available and will work just fine, but this is not universally the case.

For this reason, HLRS plans to continue offering its industrial users access to x86 nodes on its Vulcan cluster. This long-running, heterogeneous system has been updated over the years to offer industry a multitude of CPU processor generations. It will continue to support HLRS’s industrial user community, including small and medium-sized enterprises (SMEs) in running the simulations they need to innovate.

At the same time, the availability of GPUs on Hunter and Herder will open the doors to HLRS for new industrial user communities. Following the storm of public interest surrounding ChatGPT and generative AI, many companies are currently exploring how artificial intelligence and data analytics could support their activities, and some are already using applications designed to run on GPUs.

“GPUs offer new opportunities and will be much more effective, but we need to make sure that we utilize the full potential of the new system.”

— Matthias Ibach, ITLR, University of Stuttgart

According to HLRS Managing Director Dr. Bastian Koller, “We have been talking about artificial intelligence for a long time, but with Hunter and Herder HLRS we will gain powerful new platforms to support it. Many users from industry who know what they want to do say they don’t need massive systems, that access even to a relatively modest number of processors could move their work forward. The benefit of HLRS’s approach is that we can work together closely with our technology vendors on a small scale in order to find solutions for our users’ specific problems.”

Ready for the future

With Hunter and Herder, HLRS will first and foremost continue to offer its user community world-class tools and services for cutting edge computational research. Although the transition will present challenges, the ability to perform larger scale simulations while also having access to powerful AI and data analytics capabilities will offer a flexible platform for science, engineering, public administration, and other communities.

“Because so much is changing right now in high-performance computing it’s hard to predict exactly how our users will utilize the new systems,” says Prof. Resch. “We are very confident, though, that Hunter and Herder put HLRS on a good course for the future. As our users make the transition to exascale computing with us, we look forward to seeing the exciting new applications and discoveries will become possible.”

— Christopher Williams