Tuesday, 29 November 2016

Performance of DOLFIN on Intel Knight Landing

Recently, Intel has released the next generation of  Xeon Phi's preocessor, code name Knights Landing (KNL). The hardware is currently shipping to developers who can pony up the cash...

There are several articles floating around with a lot of information and insights into KNL (here is one good example). Being a student at Boston University, I got access to XSEDE, and eventually an account with Texas Advanced Computing Center, the University of Texas, AT. Coincidentally(!), they have Intel Xeon Phi 7250 compute nodes available on their cluster - Stampede - for usage.

Each KNL node available at Stampede has 68 cores per node and 4 hardware threads per core. The node includes 112GB of solid state drive (SSD), which 96Gb is DDR4. I have run the weak-scaling Poisson solver with DOLFIN (source code available on Chris' BitBucket page). The result is shown below.

Though there is a certain amount of noise, the run time remains fairly constant, showing that the program scales well in parallel. The main challenge we are now facing is memory constraints - with 112Gb over 68 cores, this means that each core will have a maximum of roughly 1.65Gb (rounded up). This restricts the maximum number of degrees of freedoms (DOF) we can allocate per core, before 68 of them accumulate the available memory. Consequently, the global problem size is also constrained, making it currently hard to solve very large problems. 

In the graph above, when I reached closer to the maximum number of core per node, I had to reduce the number of DOF per core (from originally 640,000 to roughly 300,000). As expected, this may have affected the results. However, since running these jobs took a significant amount of time I decided not to waste any piece of results...

Casting that cheat aside, the potential direction to look at is to utilise the available Multi-Channel RAM (MCRAM). Knights Landing integrates this form of high-bandwidth memory, which greatly enhances performance. The default configuration is using MCDRAM as cache, but we can also use MCDRAM as two distinct Non-Uniform Memory Access (NUMA) nodes. Basically, we can specify what can be stored in MCDRAM (better performance, but only limitted to 16Gb per node) while keeping the others in DDR4 (worse performance, but each node has 96Gb). This mode is referred as Flat mode, which could hold potential for some 'hack' of memory allocation.

Another things to look at would be examining the different clusters mode. Each cluster mode can simply be understood as a way for KNL's core to quickly access the data it needs. Again, different cluster modes may improve the running time and available problem size for our parallel algorithm. And there is a reason why I couldn't explore this aspect any further... 

At this time of writing, I have successfully caused several nodes at Stampede to render non-responsive after sending several jobs of weak-scaling Poisson problem for benchmarking. The sysadmin over there has decided to put my account on hold (today...) before he can figure out what was going on... I truly hope I have provided him all the required information that will reduce my punishment...

Thursday, 17 November 2016

Cray CC for FEniCS

Cray CC is c++11 compliant…

I never thought this was going to happen, but I just randomly checked the latest Cray C++ compiler, and bar a few minor details, it is now c++11 compliant! When you’re using a HPC machine, it is always nice to be able to use the manufacturer’s own compiler - I guess it gives you some kind of satisfaction, and a (probably misplaced) sense that any optimisations will be targeted to the hardware. Building FEniCS with Cray CC is at least now possible. How does it work?

Well, I just installed all the pure python modules of FEniCS as usual (see blog posts passim), and then load the PrgEnv-cray environment. Now, the fun begins. First of all, the Cray c++ compiler CC does print a lot of nonsense to the console, so we need to suppress that with -h msglevel_4, which is supposed to stop “warnings” and only print “errors”. Something is a bit wrong with this, as it still prints “Warning 11709” (quite a lot). OK, so -h nomessage=11709.
Everything is running inside CMake, so it’s just a question of adding a line like:

 -DCMAKE_CXX_FLAGS="-h std=c++11 -h msglevel_4 -h nomessage=11709"

Hmm… another problem… the FEniCS developers have used restrict as a function name, and it is a reserved word for Cray CC. So - I’ll have to rename it to something else. Anyway, if we do that, it all kind-of works. It is very slow though. It might take gcc 10 minutes to plough through the complete FEniCS dolfin build, but it takes Cray CC about an hour and a half… I’m sure it’s working hard on optimising… and at the end of it… you get: libdolfin.a and it is huge. Oh, wait, I wanted a shared library…

So now we can set export CRAYPE_LINK_TYPE=dynamic and try again, but it will fail linking unless you enable -h PIC (position independent code). And that now causes another problem with Eigen3 which has some inline asm which fails. OK - add options, add options…

-DCMAKE_CXX_FLAGS="-DEIGEN_NO_CPUID -h pic -h std=c++11 -h msglevel_4 -h nomessage=11709"

I tried this on two systems, one with Cray CC 8.4, another with CC 8.5 - I guess there must be some bug fixes, because the 8.4 build crashed a few times with Segmentation Faults (never great in a compiler).

CC-2116 crayc++: INTERNAL  
  "/opt/cray/cce/8.4.1/CC/x86-64/lib/ccfe" was terminated due to receipt of signal 013:  Segmentation fault.

Well, the CC 8.5 install completed. I have still to check if it actually works… next time.

Written with StackEdit.