Tuesday, 29 November 2016

Performance of DOLFIN on Intel Knight Landing

Recently, Intel has released the next generation of  Xeon Phi's preocessor, code name Knights Landing (KNL). The hardware is currently shipping to developers who can pony up the cash...

There are several articles floating around with a lot of information and insights into KNL (here is one good example). Being a student at Boston University, I got access to XSEDE, and eventually an account with Texas Advanced Computing Center, the University of Texas, AT. Coincidentally(!), they have Intel Xeon Phi 7250 compute nodes available on their cluster - Stampede - for usage.

Each KNL node available at Stampede has 68 cores per node and 4 hardware threads per core. The node includes 112GB of solid state drive (SSD), which 96Gb is DDR4. I have run the weak-scaling Poisson solver with DOLFIN (source code available on Chris' BitBucket page). The result is shown below.

Though there is a certain amount of noise, the run time remains fairly constant, showing that the program scales well in parallel. The main challenge we are now facing is memory constraints - with 112Gb over 68 cores, this means that each core will have a maximum of roughly 1.65Gb (rounded up). This restricts the maximum number of degrees of freedoms (DOF) we can allocate per core, before 68 of them accumulate the available memory. Consequently, the global problem size is also constrained, making it currently hard to solve very large problems. 

In the graph above, when I reached closer to the maximum number of core per node, I had to reduce the number of DOF per core (from originally 640,000 to roughly 300,000). As expected, this may have affected the results. However, since running these jobs took a significant amount of time I decided not to waste any piece of results...

Casting that cheat aside, the potential direction to look at is to utilise the available Multi-Channel RAM (MCRAM). Knights Landing integrates this form of high-bandwidth memory, which greatly enhances performance. The default configuration is using MCDRAM as cache, but we can also use MCDRAM as two distinct Non-Uniform Memory Access (NUMA) nodes. Basically, we can specify what can be stored in MCDRAM (better performance, but only limitted to 16Gb per node) while keeping the others in DDR4 (worse performance, but each node has 96Gb). This mode is referred as Flat mode, which could hold potential for some 'hack' of memory allocation.

Another things to look at would be examining the different clusters mode. Each cluster mode can simply be understood as a way for KNL's core to quickly access the data it needs. Again, different cluster modes may improve the running time and available problem size for our parallel algorithm. And there is a reason why I couldn't explore this aspect any further... 

At this time of writing, I have successfully caused several nodes at Stampede to render non-responsive after sending several jobs of weak-scaling Poisson problem for benchmarking. The sysadmin over there has decided to put my account on hold (today...) before he can figure out what was going on... I truly hope I have provided him all the required information that will reduce my punishment...

Thursday, 17 November 2016

Cray CC for FEniCS

Cray CC is c++11 compliant…

I never thought this was going to happen, but I just randomly checked the latest Cray C++ compiler, and bar a few minor details, it is now c++11 compliant! When you’re using a HPC machine, it is always nice to be able to use the manufacturer’s own compiler - I guess it gives you some kind of satisfaction, and a (probably misplaced) sense that any optimisations will be targeted to the hardware. Building FEniCS with Cray CC is at least now possible. How does it work?

Well, I just installed all the pure python modules of FEniCS as usual (see blog posts passim), and then load the PrgEnv-cray environment. Now, the fun begins. First of all, the Cray c++ compiler CC does print a lot of nonsense to the console, so we need to suppress that with -h msglevel_4, which is supposed to stop “warnings” and only print “errors”. Something is a bit wrong with this, as it still prints “Warning 11709” (quite a lot). OK, so -h nomessage=11709.
Everything is running inside CMake, so it’s just a question of adding a line like:

 -DCMAKE_CXX_FLAGS="-h std=c++11 -h msglevel_4 -h nomessage=11709"

Hmm… another problem… the FEniCS developers have used restrict as a function name, and it is a reserved word for Cray CC. So - I’ll have to rename it to something else. Anyway, if we do that, it all kind-of works. It is very slow though. It might take gcc 10 minutes to plough through the complete FEniCS dolfin build, but it takes Cray CC about an hour and a half… I’m sure it’s working hard on optimising… and at the end of it… you get: libdolfin.a and it is huge. Oh, wait, I wanted a shared library…

So now we can set export CRAYPE_LINK_TYPE=dynamic and try again, but it will fail linking unless you enable -h PIC (position independent code). And that now causes another problem with Eigen3 which has some inline asm which fails. OK - add options, add options…

-DCMAKE_CXX_FLAGS="-DEIGEN_NO_CPUID -h pic -h std=c++11 -h msglevel_4 -h nomessage=11709"

I tried this on two systems, one with Cray CC 8.4, another with CC 8.5 - I guess there must be some bug fixes, because the 8.4 build crashed a few times with Segmentation Faults (never great in a compiler).

CC-2116 crayc++: INTERNAL  
  "/opt/cray/cce/8.4.1/CC/x86-64/lib/ccfe" was terminated due to receipt of signal 013:  Segmentation fault.

Well, the CC 8.5 install completed. I have still to check if it actually works… next time.

Written with StackEdit.

Monday, 12 September 2016

Latest stable DOLFIN on Cray system... HOWTO...

Compiling on Cray… again

I’m just doing a bit of an update on how to install DOLFIN on Cray systems, as the question seems to keep coming up. Maybe one day, we’ll do away with this by using containers everywhere [http://arxiv.org/abs/1608.07573] but I’m not holding my breath, well, at least not until 2017.

Let’s get started. First off, it’s probably best to switch to the gnu programming environment: module swap PrgEnv-intel PrgEnv-gnu (assuming the default is intel). With luck, your Cray will have a lot of nice modules already available, such as cmake, boost, swig, eigen3. If you are really lucky, they will also be up to date. If not, then you’ll either have to compile them yourself, or bang on to your sysadmin until they install them for you. Cray helpfully package PETSc, ParMETIS and SCOTCH as cray-petsc and cray-tpsl. Also, you will find cray-hdf5-parallel. Let’s load all these modules.

module load boost/1.59
module load python/2.7.9
module load eigen3/3.2.0
module load cmake/3.3.2
module load swig/3.0.5
module load git
module load cray-tpsl
module load cray-hdf5-parallel
module load cray-petsc/3.7.2.0

Maybe it’s enough? We also need to set a few environment variables, if they are not already set.

export SCOTCH_DIR=$CRAY_TPSL_PREFIX_DIR
export PARMETIS_DIR=$CRAY_TPSL_PREFIX_DIR
export EIGEN_DIR=$EIGEN3_DIR/include

Maybe I should also say, that if you want to have JIT Python working on the compute nodes, then all these packages need to work on the compute nodes too. Some systems only install packages like cmake or even the compilers, on the login nodes.

Now download all the FEniCS code. You can either get them as tar.gz files from bitbucket:

wget https://bitbucket.org/fenics-project/dolfin/downloads/dolfin-2016.1.0.tar.gz

or just use git to get the latest version:

git clone https://git@bitbucket.org/fenics-project/dolfin

Repeat for instant, fiat, ffc, dijitso and ufl.

The Python packages (instant, fiat, ffc, dititso and ufl) can all be simply installed with: python setup.py install --prefix=/your/install/dir. You may have to make some directories, and set the $PYTHONPATH in order to get them to install. Either way, you will need to set the $PYTHONPATH and adjust $PATH before compiling DOLFIN. A good test is to see if ffc is working:

chris@Cray.loginNode> ffc
*** FFC: Missing file.

OK. DOLFIN itself needs a working MPI C++ compiler, so it a bit harder to get right. It also uses cmake to configure. Go into the DOLFIN folder, make a “build” directory: mkdir build ; cd build. Classic cmake. For Cray HPC, you are likely to want a few options though. Here is my cmake line:

cmake -DCMAKE_INSTALL_PREFIX=/your/install/dir \
      -DDOLFIN_SKIP_BUILD_TESTS=ON \
      -DDOLFIN_AUTO_DETECT_MPI=false \
      -DDOLFIN_ENABLE_UNIT_TESTS=false \
      -DDOLFIN_ENABLE_BENCHMARKS=false \
      -DDOLFIN_ENABLE_VTK=false \
      -DDOLFIN_ENABLE_OPENMP=false \
      -DCMAKE_BUILD_TYPE=Developer \
      -DDOLFIN_DEPRECATION_ERROR=true \
      ..
make -j 4
make install

Because it links with MPI, it is unlikely that any DOLFIN demos will work on the login nodes, so you’ll have to compile them and submit to a queue. Good luck!

Written with StackEdit.

Wednesday, 16 March 2016

Using docker on HPC

NERSC have a new project called “shifter” which allows docker images to run on their HPC systems (edison.nersc.gov and cori.nersc.gov)

docker

  • What is docker?
    Docker is a virtualisation system, but unlike traditional VM systems, which emulate a complete machine, docker uses “containers” which are more lightweight, and share the kernel of the host OS. It’s probably better explained on the docker website, but basically, that’s it.
    Docker images can be layered, i.e. you build up new images on the basis of existing images. For example, a base image might be a plain Ubuntu installation, and then you might add a layer with some extra packages installed, then another layer with your application software installed.

  • Why do I want it?
    In a word: consistency. Suppose I have a complex package, with loads of dependencies (e.g. FEniCS!) - it is much easier to define a docker image with everything specified exactly, rather than some installation instructions which will probably fail, depending on the particular weirdness of the machine you are trying to install on.
    That is a real advantage for HPC systems, which are notoriously diverse and difficult to build on.
    Only one problem: I can’t run docker on HPC. Not until now.

shifter

So the guys at NERSC in Berkeley, CA have come up with a system to load docker images on HPC nodes. It is also a repo on github.

Cool.

Actually, it looks really complex, and I’m glad I didn’t have to figure it out - but what is it like to use?

First you have to download your image from dockerhub onto the HPC machine.

module load shifter
shifterimg -v pull docker:image_name

This will result in some output like:

Message: {
  "ENTRY": "MISSING", 
  "ENV": "MISSING", 
  "WORKDIR": "MISSING", 
  "groupAcl": [], 
  "id": "MISSING", 
  "itype": "docker", 
  "last_pull": "MISSING", 
  "status": "INIT", 
  "system": "cori", 
  "tag": [], 
  "userAcl": []
}

You can keep on running “shifterimg -v pull” and “status” will cycle through
“INIT”, “PULLING”, “CONVERSION” and finally “READY”. I’ve no idea where it stores the image, that is a mystery…

OK, so now the image is “READY”, what next? Well, as on all HPC systems, you have to submit to the queue. I’m not going to repeat what is on the NERSC page, but they don’t show any examples of how to run with MPI. This is possible as follows in a SLURM script:

#!/bin/bash
#SBATCH --image=docker:repo:image:latest
#SBATCH --nodes=2
#SBATCH -p debug
#SBATCH -t 00:05:00
shifter --image=docker:repo:image:latest mpirun -n 32 python demo.py

… which runs a 32-core MPI demo nicely on 2 nodes.

Of course, one of the nice things about this is that because docker contains a complete local filesystem, there is no penalty for loading python libraries in parallel (as seen on most HPC systems).

Wednesday, 29 April 2015

Build FEniCS 1.5.0 on ARCHER

New build of DOLFIN-1.5.0 on Cray XC30

First steps

After a few people have asked me how to do it, I’m providing a complete run-down of building DOLFIN on Cray XC30. It’s not really that difficult.
The first point is to make sure you use gcc. It is probably possible to build with Intel, but the Cray C++ compiler is a non-starter here.

module swap PrgEnv-cray PrgEnv-gnu

Making modules for swig, cmake etc.

For a completely clean build, I’m going to download and install a few dependencies, which are often not found on HPC machines, or if they are, are out of date and installed in the wrong place.
Usually, I like to keep my source files in one folder, say src, and install into another folder, say packages, and then define modules in another folder, modules. Below, I will use ... to indicate some path or other, yours will be different.

For example, let’s install pcre (needed by swig):

cd src
wget ftp://ftp.csx.cam.ac.uk/pub/software/programming/pcre/pcre-8.37.tar.bz2
tar xf pcre-8.37.tar.bz2
cd pcre-8.37
./configure --prefix=/work/.../packages/pcre-8.37
make
make install

I’ll also make a module file in /work/.../modules/pcre/8.37 like this:

#%Module -*- tcl -*-
##
## modulefile
##
proc ModulesHelp { } {

  puts stderr "\tAdds pcre 8.37 to your environment.\n"
}

module-whatis "adds pcre 8.37 to your environment"

set               root                 /work/.../packages/pcre-8.37
prepend-path      PATH                 $root/bin
prepend-path      CPATH                $root/include
prepend-path      LIBRARY_PATH         $root/lib
prepend-path      LD_LIBRARY_PATH      $root/lib
prepend-path      MANPATH              $root/share/man

After that, I can just do

module use /work/.../modules
module load pcre/8.37

and pcre will be on my PATH.

I am just going to repeat that process for swig, cmake, boost and eigen, which are all
essentials for building and running DOLFIN. Mostly these are easy to build and install, using cmake or configure, but I’ll just pause briefly on boost, as it can be a bit more painful.

Building boost for DOLFIN

Usually, ./bootstrap.sh works fine, and creates project-config.jam for gcc, correctly.
However, boost will take forever to compile if we use this, so it is a good idea to limit the number of libraries. I usually edit project-config.jam until it looks like this:

# Boost.Build Configuration
# Automatically generated by bootstrap.sh

import option ;
import feature ;

# Compiler configuration. This definition will be used unless
# you already have defined some toolsets in your user-config.jam
# file.
if ! gcc in [ feature.values <toolset> ]
{
    using gcc : 4.9 : : <compileflags>-std=c++11 ; 
}

project : default-build <toolset>gcc ;

# List of --with-<library> and --without-<library>
# options. If left empty, all libraries will be built.
# Options specified on the command line completely
# override this variable.
libraries = --with-filesystem --with-program_options --with-timer --with-chrono --with-system --with-thread --with-iostreams --with-serialization ;

# These settings are equivivalent to corresponding command-line
# options.
option.set prefix : /work/.../packages/boost-1.55.0 ;
option.set exec-prefix : /work/.../packages/boost-1.55.0 ;
option.set libdir : /work/.../packages/boost-1.55.0/lib ;
option.set includedir : /work/.../packages/boost-1.55.0/include ;

# Stop on first error
option.set keep-going : false ;

Now you can do ./b2 and ./b2 install and it should only take a few minutes(!)

Python packages

So much for preliminaries. Now let’s install the python packages. I tend to just lump these together and install them in one directory, let’s say fenics-1.5.0. Repeat for ffc, fiat, instant and ufl. Other dependencies include sympy and plex, so they can be installed in the same way.

cd ufl-1.5.0
python setup.py install --prefix=/work/.../packages/fenics-1.5.0

etc. etc. etc. and create a module file to set the PATH, PYTHONPATH etc. for them

Ready to build DOLFIN

> module avail 

--------------------------- /work/.../modules/ ---------------------------
boost/1.57.0 cmake/3.2.2  eigen/3.2.4  fenics/1.5.0 pcre/8.37    swig/3.0.5

Now I will load all these modules, and try to build DOLFIN.

cd src
tar xf dolfin-1.5.0.tar.bz2
cd dolfin-1.5.0
mkdir build
cd build
cmake ..
make

That works! However, DOLFIN can be built with various optional packages.

Some are more optional than others. On a HPC system, we need some quality scalable solvers, which are provided by PETSc. Without PETSc, dolfin on HPC doesn’t make much sense.
PETSc is available as a system package on Cray. PARMETIS and SCOTCH are also really useful, as is HDF5. They are all available from Cray:

module load cray-petsc/3.5.3.0
module load cray-tpsl/1.4.4
export SCOTCH_DIR=$CRAY_TPSL_PREFIX_DIR
export PARMETIS_DIR=$CRAY_TPSL_PREFIX_DIR
module load cray-hdf5-parallel/1.8.13

The dolfin build with cmake will try to test these libraries, but they will fail on the login nodes (because of MPI linking). So it is necessary to add some extra flags to cmake:

cmake -DDOLFIN_SKIP_BUILD_TESTS=true -DDOLFIN_AUTO_DETECT_MPI=false -DCMAKE_INSTALL_PREFIX=/work/.../packages/dolfin-1.5.0 ..

make

make install

I also make a module for dolfin, which I can load with the module command.
Now, let’s try it:

xxxxx@eslogin006:~> python
Python 2.7.6 (default, Mar 10 2014, 14:13:45) 
[GCC 4.8.1 20130531 (Cray Inc.)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from dolfin import *
>>> mesh = UnitSquareMesh(2,2)
[Wed Apr 29 14:05:32 2015] [unknown] Fatal error in PMPI_Init_thread: Other MPI error, error stack:
MPIR_Init_thread(547): 
MPID_Init(203).......: channel initialization failed
MPID_Init(579).......:  PMI2 init failed: 1 
Aborted

Well, this is just normal behaviour on the login nodes. Any attempt to use MPI will cause a crash.
A typical batch script might look like this:

#PBS -l select=1
#PBS -N example.py
#PBS -l walltime=0:5:0

# Switch to current working directory
cd $PBS_O_WORKDIR

module use /work/.../modules
module load fenics/1.5.0
module load dolfin/1.5.0

cd /work/.../example

# Run the parallel program
aprun -n 12 -N 12 -S 6 -d 1 python example.py

and that does work!

Written with StackEdit.

Wednesday, 15 October 2014

Why does python take so long to load?

Python with HPC/FEniCS

enter image description here

Here are some plots for a simple Poisson solver, using Python. The red bar is the total run time, and the right-hand bar is the actual computation. So, there is a problem - by the time the core count goes above 1000, we are spending more time just loading python “from dolfin import *” than actually computing.
Let’s look at the module imports:

python -v -c "from dolfin import *" 2>&1 | grep -e "\.py" | grep instant | wc -l

etc.

Here is the breakdown in number of .py files per module:

six5
instant20
FIAT46
dolfin70
numpy146
ffc154
python166
ufl202
sympy319
TOTAL1128

Now 1128 files might be OK to load on one machine, but when multiplied by 1152, that makes about 1.3M file accesses. From what I can understand, LUSTREfs (used on many HPC systems) is not really optimised for this kind of access. Nor is NFS.

Fortunately, there is a partial remedy available. Python natively supports imports from a .zip file, so if we just zip up all our files (I can certainly do that for all the FEniCS folders, though maybe not for python itself), and do something like:
export PYTHONPATH=~/fenics.zip:$PYTHONPATH
then each core will only read one file for instant, FIAT, dolfin, ufl and ffc (the .zip file) and decompress it and load the modules from memory.

Testing out this idea on ARCHER showed that it does work. I compressed FEniCS and sympy (the two worst offenders), and it reduced the load time on 768 cores from 71s to 39s.

Some caveats: you can’t zip up shared object .so files. Python uses dlopen() to access these, so they have to be separate files. Also, though we can bring down the load time, the fundamental scaling is bad, so it is probably necessary, at some point, to simply reduce the number of machines which are reading, and distribute the file contents using MPI. It would be nice to be able to use a FUSE virtual filesystem, or maybe mount an HDF5 file as a mount point, and use MPI-IO to access it…

Any comments, most welcome!

Written with StackEdit.

Wednesday, 9 July 2014

Memory Profiling

Memory Profiling with python

One of the biggest challenges in scaling up Finite Element calculations to HPC scale is memory management. At present, the UK national supercomputer ARCHER has 64GB for 24 cores on each node, which works out as 2.66GB per core.

This tends to occur:

[NID 01724] 2014-07-09 11:43:44 Apid 9128228: initiated application termination
[NID 01724] 2014-07-09 11:43:46 Apid 9128228: OOM killer terminated this process.
Application 9128228 exit signals: Killed

quite often, and it can be difficult to find out why.

One useful tool is memory_profiler. You have to decorate functions with the decorator @profile and it will dump a line-by-line account of memory usage, when execution finishes (that is, if you’re not killed by the OOM killer). OK, so it doesn’t tell you what is going on inside a solve (C code), but it can give you a rough idea of how memory usage stacks up.

    16                                 # Load mesh and subdomains
    17   50.074 MiB    9.906 MiB       mesh = Mesh("meshes/sphere-box-50.xdmf")
    18                             #    mesh = refine(mesh)
    19                             
    20                                 # Define function spaces
    21   50.621 MiB    0.547 MiB       V = VectorFunctionSpace(mesh, "CG", 2)
    22   51.016 MiB    0.395 MiB       Q = FunctionSpace(mesh, "CG", 1)
    23   60.406 MiB    9.137 MiB       W = V * Q
    24                             
    25                             
    26   60.512 MiB    0.105 MiB       vzero = Constant((0, 0, 0))
    27   60.594 MiB    0.082 MiB       bc0 = DirichletBC(W.sub(0), vzero, boundary0)
    28                                 
    29                                 # Inflow boundary condition for velocity
    30   60.688 MiB    0.094 MiB       inflow = Expression(("(1.0 + cos(x[1]*pi))*(1.0 + cos(x[2]*pi))", "0.0", "0.0"))
    31   60.688 MiB    0.000 MiB       bc1 = DirichletBC(W.sub(0), inflow, boundary1)
    32                             
    33                                 # Boundary condition for pressure at outflow
    34   60.688 MiB    0.000 MiB       zero = Constant(0)
    35   60.688 MiB    0.000 MiB       bc2 = DirichletBC(W.sub(1), zero, boundary2)
    36                                 
    37                                 # Collect boundary conditions
    38   60.688 MiB    0.000 MiB       bcs = [bc0, bc1, bc2]
    39                             
    40                                 # Define variational problem
    41   60.688 MiB    0.000 MiB       (u, p) = TrialFunctions(W)
    42   60.688 MiB    0.000 MiB       (v, q) = TestFunctions(W)
    43   60.688 MiB    0.000 MiB       f = Constant((0, 0, 0))
    44   60.688 MiB    0.000 MiB       a = (inner(grad(u), grad(v)) - div(v)*p + q*div(u))*dx
    45   60.688 MiB    0.000 MiB       L = inner(f, v)*dx
    46                                 
    47                                 # Compute solution
    48   60.988 MiB    0.301 MiB       w = Function(W)
    49  100.309 MiB   39.125 MiB       solve(a == L, w, bcs)
    50                             
    51                                 # # Split the mixed solution using a shallow copy
    52  100.320 MiB    0.012 MiB       (u, p) = w.split()
    53                             
    54                                 # Save solution in VTK format
    55  100.336 MiB    0.016 MiB       ufile_pvd = File("velocity.xdmf")
    56  100.828 MiB    0.492 MiB       ufile_pvd << u
    57  100.828 MiB    0.000 MiB       pfile_pvd = File("pressure.xdmf")
    58  100.961 MiB    0.133 MiB       pfile_pvd << p
    59                             
    60                                 # Split the mixed solution using deepcopy
    61                                 # (needed for further computation on coefficient vector)
    62  101.293 MiB    0.332 MiB       (u, p) = w.split(True)
    63  101.328 MiB    0.035 MiB       unorm = u.vector().norm("l2")
    64  101.328 MiB    0.000 MiB       pnorm = p.vector().norm("l2")
    65                             
    66  101.328 MiB    0.000 MiB       if (MPI.rank(mesh.mpi_comm()) == 0):
    67  103.934 MiB    0.000 MiB           print "Norm of velocity coefficient vector: %.15g" % unorm
    68  103.934 MiB    0.000 MiB           print "Norm of pressure coefficient vector: %.15g" % pnorm
    69  103.961 MiB    0.027 MiB           list_timings()

This is just on one core out of 24 on this job. What you can see is the memory usage in the second column, and the change in the third column. At approximately 100MB per core, this represents only 0.1*24/64 = 3.75% of the machine memory. What is not displayed is the transient memory used by the solver. OOM conditions always happen in the solver, so it is critical to use the right backend to get the best performance. With an LU solver (as above) you run out of memory with less than 1M cells.

Written with StackEdit.