by Julius Donnert

Our first task during the hackathon at the WUG meeting last month (see the talks here) was tuning WENO-Wombat on a single Intel Broadwell core at Cray. I used the pretty awesome Cray tools to get performance data:

CrayPat output for a WENO5 run with Wombat.

We got 20% of DP peak on the CPU, or roughly 5 GFLOPs. The rather high D2 cache miss rate indicates that there is likely some room for optimization. Most of time is spend in the WENO5 routine and the eigenvectors (60%), and only 12% in memory operations:

CrayPat profile of the WENO solver

A short search in the runtime profile revealed that the bulk of the computer time is spend in two loops, the transformation of the state vectors and the fluxes via the left eigenvectors into the system:

Optimization annotations of the Cray Compiler to the hottest loops in the WENO solver. Every loop must vectorize to reach even a fraction of peak performance on modern CPUs.

Cray's Fortran compiler auto-vectorizes the loops of course, but the assembly shows quite a few spills, likely because the loop uses too many arrays at the same time. We played with the loop iterations, but could not get higher performance out of code. Maybe this is a case for a compiler engineer ...

Our PRACE allocation (preparatory access) on the HLRS "Hazel Hen" supercomputer in Germany finally started, so I was able to run first performance tests of the new WENO5 solver. I ran a weak scaling test. Here the computational problem grows as the machine grows - so one starts with a small problem on a workstation sized part of the super computer and then moves towards larger parts of the machine with equally increased problem size. According to Amdahl's law, this test exposes the increase in overhead (communication, imbalance ...) as more and more nodes work on the problem. As the parallel portion of the work stays the same, the increase in the non-parallel part will show up in the run time.

Here is the weak scaling of the current WENO-Wombat master branch:

Its important to remember that the left of the graph represents a workstation class problem, while the right is 4.5 Billion resolution elements on 96.000+ cores using half of a Top 50 supercomputer (Hazel Hen is currently number 27).

For the test we used a rather small problem and the machine was not dedicated to the run, i.e. network traffic from other users slows down our simulation. Nonetheless, WENO-Wombat scales very efficiently to large computers and problems. At 4096 nodes, the simulation already approaches our target resolution of a cosmological run, with 4096^3 zones of WENO MHD. The throughput for Broadwell/Haswell CPUs seems to be 0.5 Million zones per second per node for the WENO solver. So we achieved a performance of ~150 TFLOPs on Hazel Hen, which should be about 5% of its peak performance. This peak is reached only in synthetic benchmarks, so for a real world application, 5% is quite good - we achieved 20% on a single core.

In terms of throughput, Wombat's second order TVD code is 10 times faster than WENO5, but the solution is of course much worse. I will discuss this trade-off a bit more in an upcoming proceeding that will be submitted to the Journal "Galaxies" next week. Look out on ArXiv ...

J