High Performance Computing

Just like the first machines 60 years ago, the most powerful computers on the planet fill large rooms, only they are a Trillion times faster. Achieving this performance in real world applications requires special techniques and intimate knowledge of HPC.

Performance of the fastest supercomputer in the world, from Top500.org

The performance of the largest computers in the world has been steadily increasing over many decades. These machines consist of many thousand nodes, each with the computing power of large workstation, connected with a high performance interconnect, like a fat tree or Dragonfly network. Today, large systems easily reach a Petaflop in the benchmarks. Within the next 5-8 years, supercomputers are expected to breach the exaflop barrier, 10^18 floating point operations per second.

Cray Inc. is currently developing the first exaflop system for the Department of Energy in the USA. Much of the performance gain has been driven by accelerators in the last years, but HPC may become more diverse:

"In the next years, I expect that we will see a 'cambian explosion' of system architectures in high performance computing"

- P. Mendygral

Thus codes have to be designed to run on many architectures, which for us demands open standards and excludes some techniques like CUDA.

However, real world applications rarely even approach this level of performance. In particular, cosmological simulations are inherently computationally imbalanced (clustering), which makes good performance hard to achieve. Moreover, Astrophysicists are usually not trained in HPC and software development.

The Cray CX40 supercomputer "Hazel Hen" at HLRS Stuttgart, Germany. It's a main target machine for our project.

The requirements of next-generation codes can be roughly summarized as:

Single core performance: The code approaches a significant fraction of the peak floating point performance of the processors used.
Node performance: The code is able to use multicore processors efficiently. The operation speeds increases linearly when including more cores, hardware limitations are exposed.
Multi-node performance: Code efficiency is maintained when thousands of nodes are combined. Despite communication, the majority of the code remains parallel.

The challenge for exascale computations is to excel in all of these points. Especially good multi-node performance demands inherently local algorithms that do not contain global inter-node communication. Many techniques used in current cosmological codes still rely on global algorithms. For example, virtually all cosmological codes use Fourier transform (FFT) techniques to compute gravitational interactions, which contains an inversion step that requires global inter-node communication and does not run efficiently on more than a few 1000 nodes.

In this project, we are developing the fluid code WOMBAT to fulfill all requirements for even largest simulations. The reason is not simply compute time, scalability is the crucial numerical ingredient to enable the numerical solution of the modeling problem posed by modern radio interferometers. Efficient computation on more nodes means more available memory, thus higher resolution simulations with more concurrent physics modules. Good single node performance means that we can afford high fidelity in the crucial MHD solver - like our current WENO solver.