Memory Profiling with python
One of the biggest challenges in scaling up Finite Element calculations to HPC scale is memory management. At present, the UK national supercomputer ARCHER has 64GB for 24 cores on each node, which works out as 2.66GB per core.
This tends to occur:
[NID 01724] 2014-07-09 11:43:44 Apid 9128228: initiated application termination
[NID 01724] 2014-07-09 11:43:46 Apid 9128228: OOM killer terminated this process.
Application 9128228 exit signals: Killed
quite often, and it can be difficult to find out why.
One useful tool is memory_profiler. You have to decorate functions with the decorator @profile
and it will dump a line-by-line account of memory usage, when execution finishes (that is, if you’re not killed by the OOM killer). OK, so it doesn’t tell you what is going on inside a solve (C code), but it can give you a rough idea of how memory usage stacks up.
16 # Load mesh and subdomains
17 50.074 MiB 9.906 MiB mesh = Mesh("meshes/sphere-box-50.xdmf")
18 # mesh = refine(mesh)
19
20 # Define function spaces
21 50.621 MiB 0.547 MiB V = VectorFunctionSpace(mesh, "CG", 2)
22 51.016 MiB 0.395 MiB Q = FunctionSpace(mesh, "CG", 1)
23 60.406 MiB 9.137 MiB W = V * Q
24
25
26 60.512 MiB 0.105 MiB vzero = Constant((0, 0, 0))
27 60.594 MiB 0.082 MiB bc0 = DirichletBC(W.sub(0), vzero, boundary0)
28
29 # Inflow boundary condition for velocity
30 60.688 MiB 0.094 MiB inflow = Expression(("(1.0 + cos(x[1]*pi))*(1.0 + cos(x[2]*pi))", "0.0", "0.0"))
31 60.688 MiB 0.000 MiB bc1 = DirichletBC(W.sub(0), inflow, boundary1)
32
33 # Boundary condition for pressure at outflow
34 60.688 MiB 0.000 MiB zero = Constant(0)
35 60.688 MiB 0.000 MiB bc2 = DirichletBC(W.sub(1), zero, boundary2)
36
37 # Collect boundary conditions
38 60.688 MiB 0.000 MiB bcs = [bc0, bc1, bc2]
39
40 # Define variational problem
41 60.688 MiB 0.000 MiB (u, p) = TrialFunctions(W)
42 60.688 MiB 0.000 MiB (v, q) = TestFunctions(W)
43 60.688 MiB 0.000 MiB f = Constant((0, 0, 0))
44 60.688 MiB 0.000 MiB a = (inner(grad(u), grad(v)) - div(v)*p + q*div(u))*dx
45 60.688 MiB 0.000 MiB L = inner(f, v)*dx
46
47 # Compute solution
48 60.988 MiB 0.301 MiB w = Function(W)
49 100.309 MiB 39.125 MiB solve(a == L, w, bcs)
50
51 # # Split the mixed solution using a shallow copy
52 100.320 MiB 0.012 MiB (u, p) = w.split()
53
54 # Save solution in VTK format
55 100.336 MiB 0.016 MiB ufile_pvd = File("velocity.xdmf")
56 100.828 MiB 0.492 MiB ufile_pvd << u
57 100.828 MiB 0.000 MiB pfile_pvd = File("pressure.xdmf")
58 100.961 MiB 0.133 MiB pfile_pvd << p
59
60 # Split the mixed solution using deepcopy
61 # (needed for further computation on coefficient vector)
62 101.293 MiB 0.332 MiB (u, p) = w.split(True)
63 101.328 MiB 0.035 MiB unorm = u.vector().norm("l2")
64 101.328 MiB 0.000 MiB pnorm = p.vector().norm("l2")
65
66 101.328 MiB 0.000 MiB if (MPI.rank(mesh.mpi_comm()) == 0):
67 103.934 MiB 0.000 MiB print "Norm of velocity coefficient vector: %.15g" % unorm
68 103.934 MiB 0.000 MiB print "Norm of pressure coefficient vector: %.15g" % pnorm
69 103.961 MiB 0.027 MiB list_timings()
This is just on one core out of 24 on this job. What you can see is the memory usage in the second column, and the change in the third column. At approximately 100MB per core, this represents only 0.1*24/64 = 3.75% of the machine memory. What is not displayed is the transient memory used by the solver. OOM conditions always happen in the solver, so it is critical to use the right backend to get the best performance. With an LU solver (as above) you run out of memory with less than 1M cells.
Written with StackEdit.