Wednesday 15 October 2014

Why does python take so long to load?

Python with HPC/FEniCS

enter image description here

Here are some plots for a simple Poisson solver, using Python. The red bar is the total run time, and the right-hand bar is the actual computation. So, there is a problem - by the time the core count goes above 1000, we are spending more time just loading python “from dolfin import *” than actually computing.
Let’s look at the module imports:

python -v -c "from dolfin import *" 2>&1 | grep -e "\.py" | grep instant | wc -l

etc.

Here is the breakdown in number of .py files per module:

six5
instant20
FIAT46
dolfin70
numpy146
ffc154
python166
ufl202
sympy319
TOTAL1128

Now 1128 files might be OK to load on one machine, but when multiplied by 1152, that makes about 1.3M file accesses. From what I can understand, LUSTREfs (used on many HPC systems) is not really optimised for this kind of access. Nor is NFS.

Fortunately, there is a partial remedy available. Python natively supports imports from a .zip file, so if we just zip up all our files (I can certainly do that for all the FEniCS folders, though maybe not for python itself), and do something like:
export PYTHONPATH=~/fenics.zip:$PYTHONPATH
then each core will only read one file for instant, FIAT, dolfin, ufl and ffc (the .zip file) and decompress it and load the modules from memory.

Testing out this idea on ARCHER showed that it does work. I compressed FEniCS and sympy (the two worst offenders), and it reduced the load time on 768 cores from 71s to 39s.

Some caveats: you can’t zip up shared object .so files. Python uses dlopen() to access these, so they have to be separate files. Also, though we can bring down the load time, the fundamental scaling is bad, so it is probably necessary, at some point, to simply reduce the number of machines which are reading, and distribute the file contents using MPI. It would be nice to be able to use a FUSE virtual filesystem, or maybe mount an HDF5 file as a mount point, and use MPI-IO to access it…

Any comments, most welcome!

Written with StackEdit.

Wednesday 9 July 2014

Memory Profiling

Memory Profiling with python

One of the biggest challenges in scaling up Finite Element calculations to HPC scale is memory management. At present, the UK national supercomputer ARCHER has 64GB for 24 cores on each node, which works out as 2.66GB per core.

This tends to occur:

[NID 01724] 2014-07-09 11:43:44 Apid 9128228: initiated application termination
[NID 01724] 2014-07-09 11:43:46 Apid 9128228: OOM killer terminated this process.
Application 9128228 exit signals: Killed

quite often, and it can be difficult to find out why.

One useful tool is memory_profiler. You have to decorate functions with the decorator @profile and it will dump a line-by-line account of memory usage, when execution finishes (that is, if you’re not killed by the OOM killer). OK, so it doesn’t tell you what is going on inside a solve (C code), but it can give you a rough idea of how memory usage stacks up.

    16                                 # Load mesh and subdomains
    17   50.074 MiB    9.906 MiB       mesh = Mesh("meshes/sphere-box-50.xdmf")
    18                             #    mesh = refine(mesh)
    19                             
    20                                 # Define function spaces
    21   50.621 MiB    0.547 MiB       V = VectorFunctionSpace(mesh, "CG", 2)
    22   51.016 MiB    0.395 MiB       Q = FunctionSpace(mesh, "CG", 1)
    23   60.406 MiB    9.137 MiB       W = V * Q
    24                             
    25                             
    26   60.512 MiB    0.105 MiB       vzero = Constant((0, 0, 0))
    27   60.594 MiB    0.082 MiB       bc0 = DirichletBC(W.sub(0), vzero, boundary0)
    28                                 
    29                                 # Inflow boundary condition for velocity
    30   60.688 MiB    0.094 MiB       inflow = Expression(("(1.0 + cos(x[1]*pi))*(1.0 + cos(x[2]*pi))", "0.0", "0.0"))
    31   60.688 MiB    0.000 MiB       bc1 = DirichletBC(W.sub(0), inflow, boundary1)
    32                             
    33                                 # Boundary condition for pressure at outflow
    34   60.688 MiB    0.000 MiB       zero = Constant(0)
    35   60.688 MiB    0.000 MiB       bc2 = DirichletBC(W.sub(1), zero, boundary2)
    36                                 
    37                                 # Collect boundary conditions
    38   60.688 MiB    0.000 MiB       bcs = [bc0, bc1, bc2]
    39                             
    40                                 # Define variational problem
    41   60.688 MiB    0.000 MiB       (u, p) = TrialFunctions(W)
    42   60.688 MiB    0.000 MiB       (v, q) = TestFunctions(W)
    43   60.688 MiB    0.000 MiB       f = Constant((0, 0, 0))
    44   60.688 MiB    0.000 MiB       a = (inner(grad(u), grad(v)) - div(v)*p + q*div(u))*dx
    45   60.688 MiB    0.000 MiB       L = inner(f, v)*dx
    46                                 
    47                                 # Compute solution
    48   60.988 MiB    0.301 MiB       w = Function(W)
    49  100.309 MiB   39.125 MiB       solve(a == L, w, bcs)
    50                             
    51                                 # # Split the mixed solution using a shallow copy
    52  100.320 MiB    0.012 MiB       (u, p) = w.split()
    53                             
    54                                 # Save solution in VTK format
    55  100.336 MiB    0.016 MiB       ufile_pvd = File("velocity.xdmf")
    56  100.828 MiB    0.492 MiB       ufile_pvd << u
    57  100.828 MiB    0.000 MiB       pfile_pvd = File("pressure.xdmf")
    58  100.961 MiB    0.133 MiB       pfile_pvd << p
    59                             
    60                                 # Split the mixed solution using deepcopy
    61                                 # (needed for further computation on coefficient vector)
    62  101.293 MiB    0.332 MiB       (u, p) = w.split(True)
    63  101.328 MiB    0.035 MiB       unorm = u.vector().norm("l2")
    64  101.328 MiB    0.000 MiB       pnorm = p.vector().norm("l2")
    65                             
    66  101.328 MiB    0.000 MiB       if (MPI.rank(mesh.mpi_comm()) == 0):
    67  103.934 MiB    0.000 MiB           print "Norm of velocity coefficient vector: %.15g" % unorm
    68  103.934 MiB    0.000 MiB           print "Norm of pressure coefficient vector: %.15g" % pnorm
    69  103.961 MiB    0.027 MiB           list_timings()

This is just on one core out of 24 on this job. What you can see is the memory usage in the second column, and the change in the third column. At approximately 100MB per core, this represents only 0.1*24/64 = 3.75% of the machine memory. What is not displayed is the transient memory used by the solver. OOM conditions always happen in the solver, so it is critical to use the right backend to get the best performance. With an LU solver (as above) you run out of memory with less than 1M cells.

Written with StackEdit.

Tuesday 24 June 2014

A quick note on compilers

crayc++ and icpc vs g++

I was just on ARCHER again this afternoon, and I thought I’d
try recompiling dolfin with the Intel or Cray compilers.
Unfortunately, neither of them are very c++11 compliant.
So, for icpc, lots of complaints about std::unique_ptretc.

rc/dolfin-1.4.0/dolfin/adaptivity/ErrorControl.cpp(479): error: namespace "std" has no member "unique_ptr"
      std::unique_ptr<DirichletBC> e_bc;
           ^

Well, this has now been fixed in icpc 14.0.2, apparently, so I’ll just have to
wait for that to appear on ARCHER, I guess. It is already on Darwin, so we can test Intel there…

The situation with crayc++ is considerably further behind the curve…

CC-135 crayc++: ERROR File = 
rc/dolfin-1.4.0/dolfin/common/NoDeleter.h, Line = 42
  The namespace "std" has no member "shared_ptr".

    std::shared_ptr<T> reference_to_no_delete_pointer(T& r)
         ^

Someone from Cray once told me not to bother with crayc++,
it is always playing catch-up…

Written with StackEdit.

FEniCS 1.4.0 on ARCHER - Part 2

Running a job with FEniCS on ARCHER

Yesterday, I compiled up FEniCS 1.4.0 on ARCHER, and it compiled
without a problem, once I’d setup the PETSc library, and
disabled Trilinos. The configuration looks like this:

-- The following optional packages were found:
-- -------------------------------------------
-- (OK) OPENMP
-- (OK) MPI
-- (OK) PETSC
-- (OK) SCOTCH
-- (OK) PARMETIS
-- (OK) ZLIB
-- (OK) PYTHON
-- (OK) HDF5
-- (OK) QT
-- 
-- The following optional packages were not found:
-- -----------------------------------------------
-- (**) PETSC4PY
-- (**) SLEPC
-- (**) TRILINOS
-- (**) UMFPACK
-- (**) CHOLMOD
-- (**) PASTIX
-- (**) CGAL
-- (**) SPHINX
-- (**) VTK
-- 

I should probably support slepc and petsc4py, but that can wait until I’ve done some initial testing.

> module load fenics/1.4.0
> aprun -n 12 python demo_poisson.py 
Number of global vertices: 9261
Number of global cells: 48000
Calling FFC just-in-time (JIT) compiler, this may take some time.
Calling DOLFIN just-in-time (JIT) compiler, this may take some time.
Calling DOLFIN just-in-time (JIT) compiler, this may take some time.
Calling FFC just-in-time (JIT) compiler, this may take some time.
Calling FFC just-in-time (JIT) compiler, this may take some time.
Solving linear variational problem.
Solving linear variational problem.
Solving linear variational problem.
Solving linear variational problem.
Solving linear variational problem.
Solving linear variational problem.
Solving linear variational problem.
Solving linear variational problem.
Solving linear variational problem.
Solving linear variational problem.
Solving linear variational problem.
Solving linear variational problem.

Summary of timings                                |  Average time  Total time  Reps
-----------------------------------------------------------------------------------
Apply (PETScMatrix)                               |    0.00047588  0.00095177     2
Apply (PETScVector)                               |    0.00021639   0.0010819     5
Assemble cells                                    |     0.0019845    0.003969     2
Assemble exterior facets                          |    0.00033498  0.00033498     1
Build mesh number mesh entities                   |    4.0531e-06  4.0531e-06     1
Build sparsity                                    |      0.021424    0.042848     2
Compute local dual graph                          |       0.01228     0.01228     1
Compute non-local dual graph                      |     0.0087841   0.0087841     1
Delete sparsity                                   |    2.1458e-06  4.2915e-06     2
DirichletBC apply                                 |      0.002965    0.002965     1
DirichletBC compute bc                            |     0.0021579   0.0021579     1
DirichletBC init facets                           |     0.0020189   0.0020189     1
Generate Box mesh                                 |       0.78396     0.78396     1
HDF5: reorder vertex values                       |     0.0002284  0.00045681     2
HDF5: write mesh to file                          |       0.13826     0.13826     1
Init MPI                                          |     0.0020611   0.0020611     1
Init PETSc                                        |       0.10989     0.10989     1
Init dof vector                                   |       0.11122     0.11122     1
Init dofmap                                       |       0.10349     0.10349     1
Init dofmap from UFC dofmap                       |      0.061792    0.061792     1
Init tensor                                       |    0.00020909  0.00041819     2
LU solver                                         |       0.12104     0.12104     1
PARALLEL 2: Distribute mesh (cells and vertices)  |     0.0057669   0.0057669     1
PARALLEL 3: Build mesh (from local mesh data)     |      0.061399    0.061399     1
PETSc LU solver                                   |       0.12081     0.12081     1
Partition graph (calling SCOTCH)                  |       0.28882     0.28882     1
SCOTCH graph ordering                             |    0.00022697  0.00022697     1
build LocalMeshData                               |       0.11821     0.11821     1
compute connectivity 0 - 2                        |    0.00058699  0.00058699     1
compute connectivity 2 - 3                        |    0.00032806  0.00032806     1
compute entities dim = 2                          |      0.021483    0.021483     1
Application 8828173 resources: utime ~664s, stime ~24s, Rss ~275324, inblocks ~3158559, outblocks ~423320

Well, that seems to work.

Written with StackEdit.

Monday 23 June 2014

Compiling FEniCS 1.4.0 on ARCHER

Compiling FEniCS 1.4.0 on ARCHER

So, today, I decided: why not? let’s compile FEniCS 1.4.0 on ARCHER.
What can possibly go wrong?
Before attempting anything, I switched to the gnu compilers. I suppose some day I should try the Cray Compiler – but why risk it, when I know gcc works.
module swap PrgEnv-cray PrgEnv-gnu
Then I thought: maybe we should try with Cray PETSC etc. - after all, it
is there, nicely set up - maybe even optimised for the machine? I need a load of
other things - some of which I have compiled myself, some of which are already
there…
module use /work/e319/shared/modules
module load eigen/3.2.0
module load python/2.7.6 
module load numpy/1.8.0
module load ply/3.4
module load boost/1.55
module load swig/2.0.10           # needed on the compute nodes
module load cmake/2.8.12.2        # yes, my own version - available on the compute nodes
module load scientific-python/2.8 # needed by FIAT
module load cray-hdf5-parallel/1.8.12
module load cray-petsc/3.4.3.1
export PETSC_DIR=$CRAY_PETSC_PREFIX_DIR
module load cray-trilinos/11.6.1.0
export TRILINOS_DIR=$CRAY_TRILINOS_PREFIX_DIR
module load cray-tpsl/1.4.0
export SCOTCH_DIR=$CRAY_TPSL_PREFIX_DIR
export PARMETIS_DIR=$CRAY_TPSL_PREFIX_DIR
Now, to get it all working, first of all, we need ffc working. That is mostly
python, so fairly easy to fix up:-
wget https://bitbucket.org/fenics-project/ffc/downloads/ffc-1.4.0.tar.gz
tar xf ffc-1.4.0.tar.gz
cd ffc-1.4.0
python setup.py install --prefix=/work/e319/shared/packages/fenics-1.4.0
which can also be repeated for instant, ufl and fiat.
Now, on to dolfin. In cmake, it is essential to disable the build tests: -DDOLFIN_SKIP_BUILD_TESTS=true, and the MPI autodetection: -DDOLFIN_AUTO_DETECT_MPI=false.
CMake doesn’t know that Cray PETSc has some crazy names for its libraries, so we have to tell it somewhere. I just hack a line into cmake/modules/FindPETSc.cmake :-
set(PETSC_LIB_BASIC "-lcraypetsc_gnu_real")
But now, there is a problem with Trilinos. This looks more serious.
CMake Error at /opt/cray/trilinos/11.6.1.0/GNU/48/sandybridge/lib/cmake/Trilinos/TrilinosTargets.cmake:412 (message):
  The imported target "teuchoscore" references the file

     "/opt/cray/trilinos/11.6.1.0/GNU/48/sandybridge/lib/libteuchoscore.so.11.6.1"
OK, maybe we can manage without Trilinos for the moment… module rm trilinos
Now it compiles.. but does it work? More later…
Written with StackEdit.