Tuesday 29 November 2016

Performance of DOLFIN on Intel Knight Landing

Recently, Intel has released the next generation of  Xeon Phi's preocessor, code name Knights Landing (KNL). The hardware is currently shipping to developers who can pony up the cash...

There are several articles floating around with a lot of information and insights into KNL (here is one good example). Being a student at Boston University, I got access to XSEDE, and eventually an account with Texas Advanced Computing Center, the University of Texas, AT. Coincidentally(!), they have Intel Xeon Phi 7250 compute nodes available on their cluster - Stampede - for usage.

Each KNL node available at Stampede has 68 cores per node and 4 hardware threads per core. The node includes 112GB of solid state drive (SSD), which 96Gb is DDR4. I have run the weak-scaling Poisson solver with DOLFIN (source code available on Chris' BitBucket page). The result is shown below.

Though there is a certain amount of noise, the run time remains fairly constant, showing that the program scales well in parallel. The main challenge we are now facing is memory constraints - with 112Gb over 68 cores, this means that each core will have a maximum of roughly 1.65Gb (rounded up). This restricts the maximum number of degrees of freedoms (DOF) we can allocate per core, before 68 of them accumulate the available memory. Consequently, the global problem size is also constrained, making it currently hard to solve very large problems. 

In the graph above, when I reached closer to the maximum number of core per node, I had to reduce the number of DOF per core (from originally 640,000 to roughly 300,000). As expected, this may have affected the results. However, since running these jobs took a significant amount of time I decided not to waste any piece of results...

Casting that cheat aside, the potential direction to look at is to utilise the available Multi-Channel RAM (MCRAM). Knights Landing integrates this form of high-bandwidth memory, which greatly enhances performance. The default configuration is using MCDRAM as cache, but we can also use MCDRAM as two distinct Non-Uniform Memory Access (NUMA) nodes. Basically, we can specify what can be stored in MCDRAM (better performance, but only limitted to 16Gb per node) while keeping the others in DDR4 (worse performance, but each node has 96Gb). This mode is referred as Flat mode, which could hold potential for some 'hack' of memory allocation.

Another things to look at would be examining the different clusters mode. Each cluster mode can simply be understood as a way for KNL's core to quickly access the data it needs. Again, different cluster modes may improve the running time and available problem size for our parallel algorithm. And there is a reason why I couldn't explore this aspect any further... 

At this time of writing, I have successfully caused several nodes at Stampede to render non-responsive after sending several jobs of weak-scaling Poisson problem for benchmarking. The sysadmin over there has decided to put my account on hold (today...) before he can figure out what was going on... I truly hope I have provided him all the required information that will reduce my punishment...

Thursday 17 November 2016

Cray CC for FEniCS

Cray CC is c++11 compliant…

I never thought this was going to happen, but I just randomly checked the latest Cray C++ compiler, and bar a few minor details, it is now c++11 compliant! When you’re using a HPC machine, it is always nice to be able to use the manufacturer’s own compiler - I guess it gives you some kind of satisfaction, and a (probably misplaced) sense that any optimisations will be targeted to the hardware. Building FEniCS with Cray CC is at least now possible. How does it work?

Well, I just installed all the pure python modules of FEniCS as usual (see blog posts passim), and then load the PrgEnv-cray environment. Now, the fun begins. First of all, the Cray c++ compiler CC does print a lot of nonsense to the console, so we need to suppress that with -h msglevel_4, which is supposed to stop “warnings” and only print “errors”. Something is a bit wrong with this, as it still prints “Warning 11709” (quite a lot). OK, so -h nomessage=11709.
Everything is running inside CMake, so it’s just a question of adding a line like:

 -DCMAKE_CXX_FLAGS="-h std=c++11 -h msglevel_4 -h nomessage=11709"

Hmm… another problem… the FEniCS developers have used restrict as a function name, and it is a reserved word for Cray CC. So - I’ll have to rename it to something else. Anyway, if we do that, it all kind-of works. It is very slow though. It might take gcc 10 minutes to plough through the complete FEniCS dolfin build, but it takes Cray CC about an hour and a half… I’m sure it’s working hard on optimising… and at the end of it… you get: libdolfin.a and it is huge. Oh, wait, I wanted a shared library…

So now we can set export CRAYPE_LINK_TYPE=dynamic and try again, but it will fail linking unless you enable -h PIC (position independent code). And that now causes another problem with Eigen3 which has some inline asm which fails. OK - add options, add options…

-DCMAKE_CXX_FLAGS="-DEIGEN_NO_CPUID -h pic -h std=c++11 -h msglevel_4 -h nomessage=11709"

I tried this on two systems, one with Cray CC 8.4, another with CC 8.5 - I guess there must be some bug fixes, because the 8.4 build crashed a few times with Segmentation Faults (never great in a compiler).

CC-2116 crayc++: INTERNAL  
  "/opt/cray/cce/8.4.1/CC/x86-64/lib/ccfe" was terminated due to receipt of signal 013:  Segmentation fault.

Well, the CC 8.5 install completed. I have still to check if it actually works… next time.

Written with StackEdit.

Monday 12 September 2016

Latest stable DOLFIN on Cray system... HOWTO...

Compiling on Cray… again

I’m just doing a bit of an update on how to install DOLFIN on Cray systems, as the question seems to keep coming up. Maybe one day, we’ll do away with this by using containers everywhere [http://arxiv.org/abs/1608.07573] but I’m not holding my breath, well, at least not until 2017.

Let’s get started. First off, it’s probably best to switch to the gnu programming environment: module swap PrgEnv-intel PrgEnv-gnu (assuming the default is intel). With luck, your Cray will have a lot of nice modules already available, such as cmake, boost, swig, eigen3. If you are really lucky, they will also be up to date. If not, then you’ll either have to compile them yourself, or bang on to your sysadmin until they install them for you. Cray helpfully package PETSc, ParMETIS and SCOTCH as cray-petsc and cray-tpsl. Also, you will find cray-hdf5-parallel. Let’s load all these modules.

module load boost/1.59
module load python/2.7.9
module load eigen3/3.2.0
module load cmake/3.3.2
module load swig/3.0.5
module load git
module load cray-tpsl
module load cray-hdf5-parallel
module load cray-petsc/3.7.2.0

Maybe it’s enough? We also need to set a few environment variables, if they are not already set.

export SCOTCH_DIR=$CRAY_TPSL_PREFIX_DIR
export PARMETIS_DIR=$CRAY_TPSL_PREFIX_DIR
export EIGEN_DIR=$EIGEN3_DIR/include

Maybe I should also say, that if you want to have JIT Python working on the compute nodes, then all these packages need to work on the compute nodes too. Some systems only install packages like cmake or even the compilers, on the login nodes.

Now download all the FEniCS code. You can either get them as tar.gz files from bitbucket:

wget https://bitbucket.org/fenics-project/dolfin/downloads/dolfin-2016.1.0.tar.gz

or just use git to get the latest version:

git clone https://git@bitbucket.org/fenics-project/dolfin

Repeat for instant, fiat, ffc, dijitso and ufl.

The Python packages (instant, fiat, ffc, dititso and ufl) can all be simply installed with: python setup.py install --prefix=/your/install/dir. You may have to make some directories, and set the $PYTHONPATH in order to get them to install. Either way, you will need to set the $PYTHONPATH and adjust $PATH before compiling DOLFIN. A good test is to see if ffc is working:

chris@Cray.loginNode> ffc
*** FFC: Missing file.

OK. DOLFIN itself needs a working MPI C++ compiler, so it a bit harder to get right. It also uses cmake to configure. Go into the DOLFIN folder, make a “build” directory: mkdir build ; cd build. Classic cmake. For Cray HPC, you are likely to want a few options though. Here is my cmake line:

cmake -DCMAKE_INSTALL_PREFIX=/your/install/dir \
      -DDOLFIN_SKIP_BUILD_TESTS=ON \
      -DDOLFIN_AUTO_DETECT_MPI=false \
      -DDOLFIN_ENABLE_UNIT_TESTS=false \
      -DDOLFIN_ENABLE_BENCHMARKS=false \
      -DDOLFIN_ENABLE_VTK=false \
      -DDOLFIN_ENABLE_OPENMP=false \
      -DCMAKE_BUILD_TYPE=Developer \
      -DDOLFIN_DEPRECATION_ERROR=true \
      ..
make -j 4
make install

Because it links with MPI, it is unlikely that any DOLFIN demos will work on the login nodes, so you’ll have to compile them and submit to a queue. Good luck!

Written with StackEdit.

Wednesday 16 March 2016

Using docker on HPC

NERSC have a new project called “shifter” which allows docker images to run on their HPC systems (edison.nersc.gov and cori.nersc.gov)

docker

  • What is docker?
    Docker is a virtualisation system, but unlike traditional VM systems, which emulate a complete machine, docker uses “containers” which are more lightweight, and share the kernel of the host OS. It’s probably better explained on the docker website, but basically, that’s it.
    Docker images can be layered, i.e. you build up new images on the basis of existing images. For example, a base image might be a plain Ubuntu installation, and then you might add a layer with some extra packages installed, then another layer with your application software installed.

  • Why do I want it?
    In a word: consistency. Suppose I have a complex package, with loads of dependencies (e.g. FEniCS!) - it is much easier to define a docker image with everything specified exactly, rather than some installation instructions which will probably fail, depending on the particular weirdness of the machine you are trying to install on.
    That is a real advantage for HPC systems, which are notoriously diverse and difficult to build on.
    Only one problem: I can’t run docker on HPC. Not until now.

shifter

So the guys at NERSC in Berkeley, CA have come up with a system to load docker images on HPC nodes. It is also a repo on github.

Cool.

Actually, it looks really complex, and I’m glad I didn’t have to figure it out - but what is it like to use?

First you have to download your image from dockerhub onto the HPC machine.

module load shifter
shifterimg -v pull docker:image_name

This will result in some output like:

Message: {
  "ENTRY": "MISSING", 
  "ENV": "MISSING", 
  "WORKDIR": "MISSING", 
  "groupAcl": [], 
  "id": "MISSING", 
  "itype": "docker", 
  "last_pull": "MISSING", 
  "status": "INIT", 
  "system": "cori", 
  "tag": [], 
  "userAcl": []
}

You can keep on running “shifterimg -v pull” and “status” will cycle through
“INIT”, “PULLING”, “CONVERSION” and finally “READY”. I’ve no idea where it stores the image, that is a mystery…

OK, so now the image is “READY”, what next? Well, as on all HPC systems, you have to submit to the queue. I’m not going to repeat what is on the NERSC page, but they don’t show any examples of how to run with MPI. This is possible as follows in a SLURM script:

#!/bin/bash
#SBATCH --image=docker:repo:image:latest
#SBATCH --nodes=2
#SBATCH -p debug
#SBATCH -t 00:05:00
shifter --image=docker:repo:image:latest mpirun -n 32 python demo.py

… which runs a 32-core MPI demo nicely on 2 nodes.

Of course, one of the nice things about this is that because docker contains a complete local filesystem, there is no penalty for loading python libraries in parallel (as seen on most HPC systems).