SIGSEGV fault for large problems

More
4 years 9 months ago #2402 by gcdiwan
Dear NGSolve Developers;

I am having issues with the direct solver when dealing with problems when ndof ~ 2-10 million. My problem involves an inhomogeneous dirichlet bc and i use the technique given in Sec 1.3 of the documentation, namely:
Code:
u, v = fes.TnT() a = BilinearForm(fes, symmetric=True) a += grad(u)*grad(v)*dx f = LinearForm (fes) gfu = GridFunction (fes) gfu.Set (ubar, definedon=BND) # with TaskManager(): a.Assemble() f.Assemble() res = gfu.vec.CreateVector() res.data = f.vec - a.mat * gfu.vec gfu.vec.data += a.mat.Inverse(freedofs=fes.FreeDofs(), inverse="sparsecholesky") * res

The slurm script i use to launch ngsolve:
Code:
#!/bin/bash #SBATCH --job-name=ngs #SBATCH -N 4 #SBATCH --ntasks 96 #SBATCH --ntasks-per-node=24 #SBATCH --ntasks-per-core=1 #SBATCH --mem=24gb #Load ngsolve_mpi module module load apps/ngsolve_mpi mpirun ngspy script.py

However the slurm script returns the following message for one of the nodes:
Code:
------------------------------------------------------- Primary job terminated normally, but 1 process returned a non-zero exit code.. Per user-direction, the job has been aborted. -------------------------------------------------------------------------- A system call failed during shared memory initialization that should not have. It is likely that your MPI job will now either abort or experience performance degradation. Local host: node16 System call: unlink(2) /tmp/openmpi-sessions-1608000011@node16_0/19664/1/2/vader_segment.node16.2 Error: No such file or directory (errno 2) --------------------------------------------------------------------------

The exact same piece of code of course runs fine when the system is smaller. It's only when i use a refined mesh for the same geometry, i get in to the errors. I have tried refinements both outside (using gmsh and then reading the refined mesh in ngsolve) and inside (i.e. reading a corase mesh and then refining it with ngsolve's refine) but both end up in errors. Could you comment on likely cause of the problem? Thank you in advance for your help.
More
4 years 9 months ago #2403 by gcdiwan
Trying with umfpack, i get out of memory error: UMFPACK V5.7.4 (Feb 1, 2016): ERROR: out of memory.
Also ngsolve writes the error message for line where i call umfpack:

Traceback (most recent call last):
File "script.py", line 107, in <module>
gfu.vec.data += a.mat.Inverse(freedofs=fes.FreeDofs(), inverse="umfpack") * res
netgen.libngpy._meshing.NgException: UmfpackInverse: Symbolic factorization failed.
More
4 years 9 months ago #2406 by matthiash
The solver is running out of memory, 24GB is probably not enough for millions of unknowns using a direct solver. For larger problems you might want to switch to an iterative solver instead, have a look at the documentation about preconditioners:

ngsolve.org/docu/latest/i-tutorials/unit.../preconditioner.html

Best,
Matthias
More
4 years 9 months ago #2408 by lkogler
Also, "sparsecholesky" and "umfpack" do not work with parallel matrices - they are for local matrices only!

For parallel matrices use "mumps" as a direct solver (if you have configured with it).

(You can also use "masterinverse" for small problems - then the master proc gathers the entire matrix and inverts it by itself).
More
4 years 9 months ago #2410 by gcdiwan
I had the wrong information on how much RAM I can use (the reason for setting 24G in the earlier script). We have a 15 noded cluster with 256GB of RAM on each node. Setting with 6 nodes (each with 48 parallel threads) and RAM/node to be used as 150GB (leaving some for the OS), I still could not get the code running with MUMPS. With some 900GB of RAM among the 6 nodes requested, I can't understand why it still struggles to solve. I even reduced the problem size considerably: from previous 9m dofs, i now have 5.2m dofs.
Code:
#SBATCH --job-name=ngs_phi #SBATCH -N 6 #SBATCH --mem=150G #SBATCH --nodelist="node[11,16-20]"

Slurm output file has:
Code:
[node16:71446] *** An error occurred in MPI_Comm_rank [node16:71446] *** reported by process [694681601,1] [node16:71446] *** on communicator MPI_COMM_WORLD [node16:71446] *** MPI_ERR_COMM: invalid communicator [node16:71446] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, [node16:71446] *** and potentially your MPI job)

whereas the error file from slurm has:
Code:
[node11:93553] 3 more processes have sent help message help-mpi-errors.txt / mpi_errors_are_fatal [node11:93553] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
More
4 years 9 months ago #2411 by gcdiwan

matthiash wrote: For larger problems you might want to switch to an iterative solver instead, have a look at the documentation about preconditioners:

I am trying to solve high-frequency Helmholtz problems and I have not tried the iterative solvers yet as I don't know enough about the suitability of the preconditioners available in ngsolve.
Time to create page: 0.111 seconds