LOFAR Paper Splash 2018 - A Cluster-Centric Digest

by Julius Donnert

Over the last few weeks, the LOFAR collaboration dropped a bunch (20+) new papers on ArXiv - we wanted the data, now we’ve got them. This was part of the initial data release of the LOFAR Two Meter Sky survey. It is worth taking a closer look at a few of these new observations related to diffuse cluster emission. The images are spectacular and unveil the complexity of extra-galactic radio emission - a field that once was a niche in astronomy.

Before you click your way through the gallery of observations, let’s appreciate that every single image was reduced from >17 TB of complex radio data by a PhD student in Leiden, Bologna or Hamburg. Radio astronomy has a steep learning curve, this is rocket science ...

So thank you to the students for the hard work and the frustration that it takes to gnaw on the data until the noise level is micro Jansky, the bright sources clean beautifully and the flux scale is accurate to a few percent. This his hard …

This is a fantastic new glimpse into the faint radio sky at 150 MHz. Click on the image to go to ADS and read the paper.

Of course many other papers were part of the LOFAR splash, some of them very technical, other focussing on radio galaxies (some of them always active), quasars, AGN, blazars, recombination lines, famous fields, redshifts, optical counterparts, reionization and higher radio frequency follow-ups with the Indian GMRT - pick your favorite ! The instrument pipeline especially of the LBA (70MHz) is still under heavy development, so the data quality will become even better in the next years.

As a final remark, let’s remember that the first LOTSS data release covers only ~5 percent of the northern sky, so we will see some amazing science in the next years. Stay tuned !

Current sky coverage of the LOTSS (Shimwell et al 2018). Green is published, red is observed, yellow is scheduled, black is unobserved.

Quo Vadis Magnetic Dynamo ?

by Julius Donnert

Recently a colleague asked me: “Why are you actually doing this ? We have lots of cluster simulations, also with MHD. This push for resolution and fidelity, what will we get out of this ?” This is a very valid question, after all our work costs tax payer money.

My answer was somewhere along the following lines. We are seeing an incredible flood of radio observations, from instrument that each are 100s of Millions Euros/Dollars of public investment. JVLA, MWA and LOFAR are producing absolutely amazing data: Very sensitive (micro Jy), high S/N, high resolution (arcsec) observations of large parts of the sky. These observations show incredible detail and complexity. If we don’t model the data, we will never understand them.

My personal poster child is the Sausage relic, because I have worked extensively on the object. Let’s have a look at the current data: In the image below you first see an “old” GMRT map of the relic at 610 MHz (van Weeren et al. 2010) above more recent maps with JVLA (di Gennaro et al. 2018) and LOFAR (Hoang et al. 2017).

The Sausage Relic. Top: at 610 MHz with GMRT (van Weeren et al. 2010). Middle: with JVLA at 1.5 GHz (di Gennaro et al. 2018). Bottom: with LOFAR HBA at 150 MHz (Hoang et al. 2017). Notice how the relic get “thicker” with decreasing frequency, because the lifetime of cosmic-ray electrons is longer at lower energies.

Isn’t that amazing ? This is a 2 Mpc shock monster at the virial radius of a galaxy cluster resolved at 5 kpc ! The GMRT observation 8 years ago was very good already, but JVLA and LOFAR are even better. That dot in the white box in the middle image, that’s actually the JVLA beam. And more data is incoming: Dr. Andra Stroe (CfA) has a JVLA project running on the Sausage to get data above 10 GHz - absolutely exciting stuff !

In the JVLA data the relic dissolves into filaments: Are these flux tubes ? Or is the shock intrinsically irregular ? Is this just MHD we are seeing, or do we need more plasma physics ?

Using a cosmological simulation, we could find it out ! But we need a simulation that actually matches the performance of the radio interferometers: shocks resolved to <5 kpc, i.e. a net resolution of 1 kpc at the cluster outskirts. Nothing we have comes close at the moment, even though computers are fast enough and can handle the storage and memory requirements.

But the data will get even better, because SKA is coming in strong for rotation measures. Its precursors ASKAP, MeerKAT and the VLASS survey will improve data from ~10 RM sources to ~40 RM sources in nearby clusters. SKA itself will observe a few 100 sources in nearby clusters (Bonafede et al. 2015). SKA will likely observe RMs in filaments (Giovannini et al. 2015) and even polarization in radio haloes, allowing Faraday tomography studies to resolve the Alfven scale, where the magnetic cascade turns over (Govoni et al. 2015). This data directly tests the amplification mechanism for large scale magnetic fields, the turbulent magnetic dynamo. Plasma theorists are thrilled to find the Alfven scale in clusters, because it can rule out a basic model for the ICM plasma.

This is extremely hard to simulate accurately. Not necessarily because simulations have to match observations reaching kpc resolution, but because turbulence and magnetic field growth need accurate, expensive MHD fluid algorithms to be resolved close to the grid scale. Second order MHD algorithms behave like honey for weak motions close to the grid resolution and dissipate velocity and magnetic energy into heat. So we have to do better.

To make matters worse, the growth rate, i.e. the speed with which the turbulent dynamo grows magnetic fields, depends on resolution as well. It is set by the smallest eddy turnover time in the system (Beresnyak et al. 2016). Resolution adaptive simulations might bias magnetic field growth to high density regions and can introduce a density dependency into field distributions. Existing fancy galaxy formation techniques might get the field intrinsically wrong, because they are all adaptive. I am writing “might”, because at the moment there is no unbiased Eulerian simulation to compare to, so we can’t be sure ! We are not even sure how Lagrangian galaxy formation codes behave in idealized dynamo simulations …

If you’re interested in the nitty-gritty details, I’ve lead a review article about magnetic field growth in clusters and the large scale structure, where we discuss these issues at length : Donnert, Vazza, Brueggen, zuHone SSRv 2018.

Finally, this is why high-order cosmological simulations with WOMBAT are the obvious next step. We have the problem, we have the computers for it, all we need are the codes. If you look at the current lineup of supercomputers, we can do these simulations. Performance will reach PFlops and data is going to be in the PByte regime, just like the LOFAR survey produces a PByte a year (Shimwell priv. com.).

So let’s go for it, the simulator community will have to re-write all their codes for the exa-scale anyway !

UPDATE: Our last paper was featured on HPC Wire for MHD at the exa-scale.

Wombat Design Principles - Part 1: Parallelism

by Pete Mendygral

Parallel programming at high performance has never been easy.  HPC researchers and domain scientists go through great lengths to achieve "bare metal" performance from modern HPC systems.  We all look forward to a day when we can write code simply as the mathematics dictates and rely on a sophisticated software stack figure out and optimize everything; parallelism, dependencies, data layout, etc.. While that day is not yet here, there are powerful tools available to us that can allow for productivity and performance is only we can start with a broader view of the problem we are trying to solve.

This is the first of a multipart blog where I will give more context and motivation to the design principles in Wombat.  Starting an HPC code effort from scratch is not very common, and I hope these entries are useful to others thinking about new ways to solve their scientific/HPC problems.  In this first part I introduce the approach to parallelism in Wombat and how we map to the common building blocks of modern HPC systems.

Find the Parallelism

An ideal place to start thinking about programming any HPC application is with the question, "Where is the parallelism?"  For a code like Wombat, the answer is rather easy given the suite of physics and solvers we wish to incorporate.  Everything can be decomposed in the spatial dimensions to expose parallelism; MHD, gravity, dark matter, cosmic rays, and passive particles.  The computational units that leverage parallelism are vector units, cores, and CPUs.  Our job is to identify a spatial decomposition that maps well to these computational units (measured by a high percent of possible performance).

Map the Parallelism to the Hardware

The available computational units are packaged into a hierarchy.  Vector units are controlled by cores.  Cores are packaged into CPUs, typically with one or more levels of private cache memory.  CPUs are packaged into nodes typically with coherent addressable memory.  A node might have more than one CPU with some relatively high speed link between them.  With these three levels in the hierarchy, it is natural to envision a multi-level spatial decomposition that maps to the hierarchy.  We can roughly outline the type of spatial decomposition and relevant memory constraint that the computational units impose:

  • Largest spatial sub-domain: CPUs have distinct NUMA (Non-Uniform Memory Access) regions.  A decomposition sub-domain that spans more than one NUMA region may require explicit communication or have a heavy performance penalty.  Spatial sub-domains should not extend beyond a NUMA region.
  • Intermediate spatial sub-domain: Cores typically have high performance cache memory (e.g., private level 1, private level 2, and likely shared level 3 caches) that is significantly smaller than the memory available at the CPU NUMA level.  Intermediate spatial sub-domains should not extend beyond a single core's level 3 cache.
  • Smallest spatial sub-domain: Vector units use vector registers as input and output.  Vector registers can hold anywhere from two to eight double precision values depending on the CPU architecture (for example, Intel Sandybridge processors support AVX 128 bit instructions and Intel Knights Landing processors support AVX 512 bit instructions), all of which get updated in a single cycle of the vector unit.  For vector units to be efficient they require hundreds or perhaps thousands of vector registers worth of data, but that data should exist in the controlling core's L1 (level 1) or L2 (level 2) cache.

Implementing the Constraints

If this is starting to sound complicated it is because it is complicated.  To make matters worse, not all CPU architectures are the same.  We can outline some general rules for our decomposition, but the same code running on two different HPC systems may require very different tuning for the size of each level of decomposition.  What can be done to make the easier?

The good news is that in practice with some planning it is not too hard to design an implementation that leverages these constraints but is also adaptable to specific HPC architectures.  The first thing to do is decide on a parallel programming model for each level of the decomposition hierarchy.  I will defer the discussion of our specific choice to later entries in this series.  For now our I will just list them:

  • Largest spatial sub-domains are managed by individual MPI ranks
  • Intermediate spatial sub-domains are operated on by OpenMP threads within an MPI rank
  • Smallest spatial sub-domains are operated on by vector units controlled by the programming language (Fortran) and compiler.  A respectable Fortran compiler will have auto-vectorizing capabilities so long as the programmer writes reasonable code.

The largest and intermediate sub-domains are explicitly programmed in Wombat's data structures.  The smallest sub-domains are explicitly programmed by the loop structures in any of the solvers in Wombat.  The figure below shows a two-dimensional example of the largest and intermediate decomposition levels are organized in Wombat.

decomp.png

Since we do not know a priori what size any of these levels should be without knowledge of of the target HPC architecture we must design Wombat to be flexible.  A proper optimizing compiler will know how best to generate vector instructions from our loop structures that define the smallest decomposition.  The largest and intermediate sub-domains, however, must be optimized on the target architecture through experimentation.

Side Constraints

There are two side constraints that should be highlighted as they directly impact decisions on the sizes of the largest and intermediate sub-domains.  The first is load imbalance and the second is efficient interaction of our parallel programming models, specifically MPI and OpenMP.

Load imbalance comes in two forms.  The first is algorithmically induced.  Examples of this are mesh refinement and dark matter.  The second is hardware induced load imbalance from non-uniformly performing hardware.  The effect of either form is similar.

Mesh refinement creates new, smaller, spatial elements that carry with them extra work for a process.  If processes are not uniformly handed new work the overall parallel execution will be limited by the rate at which the most overloaded process can complete its work.  This can be resolved by re-balancing the work among processes so that it is again even.  A better way to handle this is to reduce the likelihood that it even happens.  For cosmology simulations this can be achieved by noting at some scale the simulation box is approximately uniform.  If we can make our largest sub-domains as large as possible (still constrained by the hardware) it becomes a lot less likely that any one process is heavily loaded compared to the rest.

The problem with making our largest sub-domains as large as possible is that traditionally MPI+OpenMP programming models and library implementations are not designed to work well together.  Increasing the size of the largest sub-domains also means making MPI ranks "fatter" with more OpenMP threads using the cores within a CPU.  All of the usual rules associated with Amdahl's Law apply, and for lots of OpenMP threads to be efficient we need nearly all of the code in Wombat to be multi-threaded, including MPI communication.  When we started writing Wombat all MPI libraries implemented thread safety with a global thread lock.  This forces MPI calls to serialize across OpenMP threads, which dramatically limits scaling to large numbers of threads.  This problem was a major focus of our initial research into Wombat's design and led to new features in Cray MPICH and OpenMPI.  In a later blog post I will describe this problem and its resolution in more detail.