Wednesday, 9 July 2014

Memory Profiling

Memory Profiling with python

One of the biggest challenges in scaling up Finite Element calculations to HPC scale is memory management. At present, the UK national supercomputer ARCHER has 64GB for 24 cores on each node, which works out as 2.66GB per core.

This tends to occur:

[NID 01724] 2014-07-09 11:43:44 Apid 9128228: initiated application termination
[NID 01724] 2014-07-09 11:43:46 Apid 9128228: OOM killer terminated this process.
Application 9128228 exit signals: Killed

quite often, and it can be difficult to find out why.

One useful tool is memory_profiler. You have to decorate functions with the decorator @profile and it will dump a line-by-line account of memory usage, when execution finishes (that is, if you’re not killed by the OOM killer). OK, so it doesn’t tell you what is going on inside a solve (C code), but it can give you a rough idea of how memory usage stacks up.

    16                                 # Load mesh and subdomains
    17   50.074 MiB    9.906 MiB       mesh = Mesh("meshes/sphere-box-50.xdmf")
    18                             #    mesh = refine(mesh)
    20                                 # Define function spaces
    21   50.621 MiB    0.547 MiB       V = VectorFunctionSpace(mesh, "CG", 2)
    22   51.016 MiB    0.395 MiB       Q = FunctionSpace(mesh, "CG", 1)
    23   60.406 MiB    9.137 MiB       W = V * Q
    26   60.512 MiB    0.105 MiB       vzero = Constant((0, 0, 0))
    27   60.594 MiB    0.082 MiB       bc0 = DirichletBC(W.sub(0), vzero, boundary0)
    29                                 # Inflow boundary condition for velocity
    30   60.688 MiB    0.094 MiB       inflow = Expression(("(1.0 + cos(x[1]*pi))*(1.0 + cos(x[2]*pi))", "0.0", "0.0"))
    31   60.688 MiB    0.000 MiB       bc1 = DirichletBC(W.sub(0), inflow, boundary1)
    33                                 # Boundary condition for pressure at outflow
    34   60.688 MiB    0.000 MiB       zero = Constant(0)
    35   60.688 MiB    0.000 MiB       bc2 = DirichletBC(W.sub(1), zero, boundary2)
    37                                 # Collect boundary conditions
    38   60.688 MiB    0.000 MiB       bcs = [bc0, bc1, bc2]
    40                                 # Define variational problem
    41   60.688 MiB    0.000 MiB       (u, p) = TrialFunctions(W)
    42   60.688 MiB    0.000 MiB       (v, q) = TestFunctions(W)
    43   60.688 MiB    0.000 MiB       f = Constant((0, 0, 0))
    44   60.688 MiB    0.000 MiB       a = (inner(grad(u), grad(v)) - div(v)*p + q*div(u))*dx
    45   60.688 MiB    0.000 MiB       L = inner(f, v)*dx
    47                                 # Compute solution
    48   60.988 MiB    0.301 MiB       w = Function(W)
    49  100.309 MiB   39.125 MiB       solve(a == L, w, bcs)
    51                                 # # Split the mixed solution using a shallow copy
    52  100.320 MiB    0.012 MiB       (u, p) = w.split()
    54                                 # Save solution in VTK format
    55  100.336 MiB    0.016 MiB       ufile_pvd = File("velocity.xdmf")
    56  100.828 MiB    0.492 MiB       ufile_pvd << u
    57  100.828 MiB    0.000 MiB       pfile_pvd = File("pressure.xdmf")
    58  100.961 MiB    0.133 MiB       pfile_pvd << p
    60                                 # Split the mixed solution using deepcopy
    61                                 # (needed for further computation on coefficient vector)
    62  101.293 MiB    0.332 MiB       (u, p) = w.split(True)
    63  101.328 MiB    0.035 MiB       unorm = u.vector().norm("l2")
    64  101.328 MiB    0.000 MiB       pnorm = p.vector().norm("l2")
    66  101.328 MiB    0.000 MiB       if (MPI.rank(mesh.mpi_comm()) == 0):
    67  103.934 MiB    0.000 MiB           print "Norm of velocity coefficient vector: %.15g" % unorm
    68  103.934 MiB    0.000 MiB           print "Norm of pressure coefficient vector: %.15g" % pnorm
    69  103.961 MiB    0.027 MiB           list_timings()

This is just on one core out of 24 on this job. What you can see is the memory usage in the second column, and the change in the third column. At approximately 100MB per core, this represents only 0.1*24/64 = 3.75% of the machine memory. What is not displayed is the transient memory used by the solver. OOM conditions always happen in the solver, so it is critical to use the right backend to get the best performance. With an LU solver (as above) you run out of memory with less than 1M cells.

