Wombat Design Principles - Part 1: Parallelism

by Pete Mendygral

Parallel programming at high performance has never been easy.  HPC researchers and domain scientists go through great lengths to achieve "bare metal" performance from modern HPC systems.  We all look forward to a day when we can write code simply as the mathematics dictates and rely on a sophisticated software stack figure out and optimize everything; parallelism, dependencies, data layout, etc.. While that day is not yet here, there are powerful tools available to us that can allow for productivity and performance is only we can start with a broader view of the problem we are trying to solve.

This is the first of a multipart blog where I will give more context and motivation to the design principles in Wombat.  Starting an HPC code effort from scratch is not very common, and I hope these entries are useful to others thinking about new ways to solve their scientific/HPC problems.  In this first part I introduce the approach to parallelism in Wombat and how we map to the common building blocks of modern HPC systems.

Find the Parallelism

An ideal place to start thinking about programming any HPC application is with the question, "Where is the parallelism?"  For a code like Wombat, the answer is rather easy given the suite of physics and solvers we wish to incorporate.  Everything can be decomposed in the spatial dimensions to expose parallelism; MHD, gravity, dark matter, cosmic rays, and passive particles.  The computational units that leverage parallelism are vector units, cores, and CPUs.  Our job is to identify a spatial decomposition that maps well to these computational units (measured by a high percent of possible performance).

Map the Parallelism to the Hardware

The available computational units are packaged into a hierarchy.  Vector units are controlled by cores.  Cores are packaged into CPUs, typically with one or more levels of private cache memory.  CPUs are packaged into nodes typically with coherent addressable memory.  A node might have more than one CPU with some relatively high speed link between them.  With these three levels in the hierarchy, it is natural to envision a multi-level spatial decomposition that maps to the hierarchy.  We can roughly outline the type of spatial decomposition and relevant memory constraint that the computational units impose:

  • Largest spatial sub-domain: CPUs have distinct NUMA (Non-Uniform Memory Access) regions.  A decomposition sub-domain that spans more than one NUMA region may require explicit communication or have a heavy performance penalty.  Spatial sub-domains should not extend beyond a NUMA region.
  • Intermediate spatial sub-domain: Cores typically have high performance cache memory (e.g., private level 1, private level 2, and likely shared level 3 caches) that is significantly smaller than the memory available at the CPU NUMA level.  Intermediate spatial sub-domains should not extend beyond a single core's level 3 cache.
  • Smallest spatial sub-domain: Vector units use vector registers as input and output.  Vector registers can hold anywhere from two to eight double precision values depending on the CPU architecture (for example, Intel Sandybridge processors support AVX 128 bit instructions and Intel Knights Landing processors support AVX 512 bit instructions), all of which get updated in a single cycle of the vector unit.  For vector units to be efficient they require hundreds or perhaps thousands of vector registers worth of data, but that data should exist in the controlling core's L1 (level 1) or L2 (level 2) cache.

Implementing the Constraints

If this is starting to sound complicated it is because it is complicated.  To make matters worse, not all CPU architectures are the same.  We can outline some general rules for our decomposition, but the same code running on two different HPC systems may require very different tuning for the size of each level of decomposition.  What can be done to make the easier?

The good news is that in practice with some planning it is not too hard to design an implementation that leverages these constraints but is also adaptable to specific HPC architectures.  The first thing to do is decide on a parallel programming model for each level of the decomposition hierarchy.  I will defer the discussion of our specific choice to later entries in this series.  For now our I will just list them:

  • Largest spatial sub-domains are managed by individual MPI ranks
  • Intermediate spatial sub-domains are operated on by OpenMP threads within an MPI rank
  • Smallest spatial sub-domains are operated on by vector units controlled by the programming language (Fortran) and compiler.  A respectable Fortran compiler will have auto-vectorizing capabilities so long as the programmer writes reasonable code.

The largest and intermediate sub-domains are explicitly programmed in Wombat's data structures.  The smallest sub-domains are explicitly programmed by the loop structures in any of the solvers in Wombat.  The figure below shows a two-dimensional example of the largest and intermediate decomposition levels are organized in Wombat.

decomp.png

Since we do not know a priori what size any of these levels should be without knowledge of of the target HPC architecture we must design Wombat to be flexible.  A proper optimizing compiler will know how best to generate vector instructions from our loop structures that define the smallest decomposition.  The largest and intermediate sub-domains, however, must be optimized on the target architecture through experimentation.

Side Constraints

There are two side constraints that should be highlighted as they directly impact decisions on the sizes of the largest and intermediate sub-domains.  The first is load imbalance and the second is efficient interaction of our parallel programming models, specifically MPI and OpenMP.

Load imbalance comes in two forms.  The first is algorithmically induced.  Examples of this are mesh refinement and dark matter.  The second is hardware induced load imbalance from non-uniformly performing hardware.  The effect of either form is similar.

Mesh refinement creates new, smaller, spatial elements that carry with them extra work for a process.  If processes are not uniformly handed new work the overall parallel execution will be limited by the rate at which the most overloaded process can complete its work.  This can be resolved by re-balancing the work among processes so that it is again even.  A better way to handle this is to reduce the likelihood that it even happens.  For cosmology simulations this can be achieved by noting at some scale the simulation box is approximately uniform.  If we can make our largest sub-domains as large as possible (still constrained by the hardware) it becomes a lot less likely that any one process is heavily loaded compared to the rest.

The problem with making our largest sub-domains as large as possible is that traditionally MPI+OpenMP programming models and library implementations are not designed to work well together.  Increasing the size of the largest sub-domains also means making MPI ranks "fatter" with more OpenMP threads using the cores within a CPU.  All of the usual rules associated with Amdahl's Law apply, and for lots of OpenMP threads to be efficient we need nearly all of the code in Wombat to be multi-threaded, including MPI communication.  When we started writing Wombat all MPI libraries implemented thread safety with a global thread lock.  This forces MPI calls to serialize across OpenMP threads, which dramatically limits scaling to large numbers of threads.  This problem was a major focus of our initial research into Wombat's design and led to new features in Cray MPICH and OpenMPI.  In a later blog post I will describe this problem and its resolution in more detail.