Tuesday 29 November 2016

Performance of DOLFIN on Intel Knight Landing

Recently, Intel has released the next generation of  Xeon Phi's preocessor, code name Knights Landing (KNL). The hardware is currently shipping to developers who can pony up the cash...

There are several articles floating around with a lot of information and insights into KNL (here is one good example). Being a student at Boston University, I got access to XSEDE, and eventually an account with Texas Advanced Computing Center, the University of Texas, AT. Coincidentally(!), they have Intel Xeon Phi 7250 compute nodes available on their cluster - Stampede - for usage.

Each KNL node available at Stampede has 68 cores per node and 4 hardware threads per core. The node includes 112GB of solid state drive (SSD), which 96Gb is DDR4. I have run the weak-scaling Poisson solver with DOLFIN (source code available on Chris' BitBucket page). The result is shown below.

Though there is a certain amount of noise, the run time remains fairly constant, showing that the program scales well in parallel. The main challenge we are now facing is memory constraints - with 112Gb over 68 cores, this means that each core will have a maximum of roughly 1.65Gb (rounded up). This restricts the maximum number of degrees of freedoms (DOF) we can allocate per core, before 68 of them accumulate the available memory. Consequently, the global problem size is also constrained, making it currently hard to solve very large problems. 

In the graph above, when I reached closer to the maximum number of core per node, I had to reduce the number of DOF per core (from originally 640,000 to roughly 300,000). As expected, this may have affected the results. However, since running these jobs took a significant amount of time I decided not to waste any piece of results...

Casting that cheat aside, the potential direction to look at is to utilise the available Multi-Channel RAM (MCRAM). Knights Landing integrates this form of high-bandwidth memory, which greatly enhances performance. The default configuration is using MCDRAM as cache, but we can also use MCDRAM as two distinct Non-Uniform Memory Access (NUMA) nodes. Basically, we can specify what can be stored in MCDRAM (better performance, but only limitted to 16Gb per node) while keeping the others in DDR4 (worse performance, but each node has 96Gb). This mode is referred as Flat mode, which could hold potential for some 'hack' of memory allocation.

Another things to look at would be examining the different clusters mode. Each cluster mode can simply be understood as a way for KNL's core to quickly access the data it needs. Again, different cluster modes may improve the running time and available problem size for our parallel algorithm. And there is a reason why I couldn't explore this aspect any further... 

At this time of writing, I have successfully caused several nodes at Stampede to render non-responsive after sending several jobs of weak-scaling Poisson problem for benchmarking. The sysadmin over there has decided to put my account on hold (today...) before he can figure out what was going on... I truly hope I have provided him all the required information that will reduce my punishment...

1 comment:

  1. I know this is an old post, but I thought I'd ask - do you have any hints for using FEniCS successfully on stampede 2? We run it on the local cluster using a singularity image built from the docker image; on stampede 2, it fails when compiling the problem because PETsc is not available even though I've loaded PETsc modules; it looks like it's trying to use /bin/cc (which turns out to be gcc 4.8.5) for the root compiler too, which is pretty guaranteed to not be a good compiler for Xeon Phi nodes...

    ReplyDelete