On Weak Scaling

by Julius Donnert

Our first task during the hackathon at the WUG meeting last month (see the talks here) was tuning WENO-Wombat on a single Intel Broadwell core at Cray.  I used the pretty awesome Cray tools to get performance data:   

CrayPat output for a WENO5 run with Wombat.

CrayPat output for a WENO5 run with Wombat.

We got 20% of DP peak on the CPU, or roughly 5 GFLOPs. The rather high D2 cache miss rate indicates that there is likely some room for optimization. Most of time is spend in the WENO5 routine and the eigenvectors (60%), and only 12% in memory operations: 

CrayPat profile of the WENO solver

CrayPat profile of the WENO solver

A short search in the  runtime profile revealed that the bulk of the computer time is spend in two loops, the transformation of the state vectors and the fluxes via the left eigenvectors into the  system:

Optimization annotations of the Cray Compiler to the hottest loops in the WENO solver. Every loop must vectorize to reach even a fraction of peak performance on modern CPUs.

Optimization annotations of the Cray Compiler to the hottest loops in the WENO solver. Every loop must vectorize to reach even a fraction of peak performance on modern CPUs.

Cray's Fortran compiler auto-vectorizes the loops of course, but the assembly shows quite a few spills, likely because the loop uses too many arrays  at the same time. We played with the loop iterations, but could not get higher performance out of code. Maybe this is a case for a compiler engineer ... 

Our PRACE allocation (preparatory access) on the HLRS "Hazel Hen" supercomputer in Germany finally started, so I was able to run first performance tests of the new WENO5 solver.  I ran a weak scaling test. Here the computational problem grows as the machine grows - so one starts with a small problem on a workstation sized part of the super computer and then moves towards larger parts of the machine with equally increased problem size. According to Amdahl's law, this test exposes the increase in overhead (communication, imbalance ...) as more and more nodes work on the problem. As the parallel portion of the work stays the same, the increase in the non-parallel part will show up in the run time.

Here is the weak scaling of the current WENO-Wombat master branch: 

Screen Shot 2018-08-23 at 17.04.29.png

Its important to remember that the left of the graph represents a workstation class problem, while the right is 4.5 Billion resolution elements on 96.000+ cores using half of a Top 50 supercomputer (Hazel Hen is currently number 27).

For the test we used a rather small problem and the machine was not dedicated to the run, i.e. network traffic from other users slows down our simulation. Nonetheless, WENO-Wombat scales very efficiently to large computers and problems. At 4096 nodes, the simulation already approaches our target resolution of a cosmological  run, with 4096^3 zones of WENO MHD. The throughput for Broadwell/Haswell CPUs seems to be 0.5 Million zones per second per node for the WENO solver. So we achieved a performance of ~150 TFLOPs on Hazel Hen, which should be about 5% of its peak performance. This peak is reached only in synthetic benchmarks, so for a real world application, 5% is quite good - we achieved 20% on a single core.

In terms of throughput, Wombat's second order TVD code is 10 times faster than WENO5, but the solution is of course much worse. I will discuss this trade-off a bit more in an upcoming proceeding that will be submitted to the Journal "Galaxies" next week. Look out on ArXiv ...

J

Advection

by Julius Donnert

This week we are holding the first Wombat User Group meeting at IRA Bologna. Before I start posting about it (also watch out on Twitter), here is another lovely test that pushes Eulerian methods -  the advection of a density square. It consists of advecting a density square in hydrostatic equilibrium across the grid with high speed (Mach 3) many hundred times. Of course with Lagrangian methods like moving FV, MFV or SPH, this test is trivial. With Wombat, here is the result: 

As you can see, none of the methods do really well in this. TVD completely loses the shape of the square. Standard WENO is not too much better, mostly because WENO is only third order in critical points. An easy fix is available in the from of the WENO-Z method (Borges et al. 2008), which is also cheap to compute. Unfortunately, it is also less robust, so for actual applications it is not always clear if WENO-Z is beneficial. In this test it is a huge improvement though. Here is the L1 error over time.

Screen Shot 2018-07-16 at 22.05.55.png

Can one do better ? Yes, but it is computationally very expensive.

In discontinuous Galerkin methods, the scheme evolves a full n-th order polynomial in the zone in time. This leads to a significant improvement in the advection test. However, the timestep reduces  by a factor of the order of the scheme, and the scheme needs to save several coefficients per zone. A good way to think about this is a Fourier sub-grid method. 

Here are the results from such an implementation, the TENET code by Schaal et al. 2016, which needs CFL=0.2 (WENO runs at CFL=0.8 !): 

Square Advection test with a third order Discontinuous Galerkin scheme from Schaal et al. 2015 .

Square Advection test with a third order Discontinuous Galerkin scheme from Schaal et al. 2015 .

Clearly this is much better. A comparison with their L1 Error  shows that our WENO5-Z performs a little better than a second order DG scheme, and a worse than a third order DG scheme, at a fraction of the cost. An honorable mention goes to the TVD scheme, which performs much better than Arepo's PLM scheme on a fixed mesh.

So for problems that are not strongly advection dominated, WENO-Z remains a good option. For advection it is much better than TVD or PPM, but not as good as DG. For a good recent review on high order Eulerian schemes, I can recommend Balsara 2017

Finally, our PRACE project on the Cray XC40 "Hazel Hen" at HLRS has started yesterday, so look forward to scaling plots in the next weeks ...

 

 

The implosion test: Noise and the Illusion of formal correctness.

by Julius Donnert

This is a fun one ! The implosion test sets up an over density in a corner of a periodic box and let the shocks run through the computational domain. With our new WENO solver, it looks like this at 200^2 zones: 

This test is interesting, because the flow is highly non-linear. That means tiny differences in the algorithm and its implementation become apparent extremely quickly. In particular, the test exposes noise in the computation, which are tiny fluctuations in the fluid quantities around the "correct" solution. This is likely why Sijacki et al. 2012 chose it to compare a traditional SPH code with a moving mesh code. SPH performs pretty badly at the test, because of its inherent sub grid noise. Here is their result at t=0.7: 

Implosion test at t = 0.7 from Sijacki et al 2012.

AREPO performs quite well for a second order code. There is as much structure as in our WENO result, but also more noise. Here is the result from a second order grid code (TVD, courtesy of T. Jones): 

2D implosion test at t=0.7 with a (rather diffusive) second order TVD code (T. Jones, unpublished)

2D implosion test at t=0.7 with a (rather diffusive) second order TVD code (T. Jones, unpublished)

AREPO clearly has more structure at the same resolution (all tests are 200^2 zones). 

So which result is "correct"?  WENO just looks like a sharper version of the TVD result, which is a good sign for the grid codes. Between AREPO and TVD/WENO, there are clear differences  in the structure of the "blob" on the bottom left, e.g. in its arms.

We should re-compute the test at very high resolution and compare again. We expect that all codes should converge to the same "correct" solution, or do we ? 

Consequently,  here is WENO at 1024^2 (excuse the rotation) : 

Implosion test at 1024^2 with WENO.

Implosion test at 1024^2 with WENO.

The high resolution solution of the test shows features from both codes, other features are completely different ! AREPO seems to capture the blobs in the diagonal jet better than WENO, while with interfaces it seems to be the other way around. However, the central RT instability is not captured correctly by both codes. It grows faster at high resolution, that is the curse of non-linear solutions.

To make matters worse, if I had a 1024^2 solution from AREPO to show, it would again look different from the 1024^2 WENO solution above. So there is really no straight forward way to establish formal correctness in this problem ! 

In scientific applications, we never fully resolve the complex problems we are interested in. That means we will always miss some features in the flow. On the other hand, more small structure in the solution does not mean it is more "correct". The fixed-mesh TVD result is a good representation of the high-resolution solution, even though it show less structure than the moving mesh code.

Finally this test was very helpful for us to find an issue in the eigenvector calculation that injected noise at the level of 1e-8 in the density. That is the real meaning of the test. Here is what we were getting with one of the codes: 

WENO5 with numerical noise in the density at the 1e-8 level at t = 0.7.

WENO5 with numerical noise in the density at the 1e-8 level at t = 0.7.

The solution is clearly not correct. The problem here was the compiler injecting a value of 1e-16  inside a square root computation, where it should be 0. Peter Mendygral might write another blog post about it, because this is really about HPC technology. 

I am taking away from this week long exercise that finite volume schemes are incredible sensitive to changes in the implementation - all the seemingly boring code comparison papers are well justified and we should really be careful to not over-interpret our results.