SIGSEGV fault for large problems

gcdiwan
Topic Author
Offline
Junior Member

5 years 6 months ago #2402 by gcdiwan

SIGSEGV fault for large problems was created by gcdiwan

Dear NGSolve Developers;

I am having issues with the direct solver when dealing with problems when ndof ~ 2-10 million. My problem involves an inhomogeneous dirichlet bc and i use the technique given in Sec 1.3 of the documentation, namely:

Code:

u, v = fes.TnT()
a = BilinearForm(fes, symmetric=True)
a += grad(u)*grad(v)*dx
f = LinearForm (fes)
gfu = GridFunction (fes)
gfu.Set (ubar, definedon=BND)
#
with TaskManager():
    a.Assemble()
    f.Assemble()
    res = gfu.vec.CreateVector()
    res.data = f.vec - a.mat * gfu.vec   
    gfu.vec.data += a.mat.Inverse(freedofs=fes.FreeDofs(), inverse="sparsecholesky") * res

The slurm script i use to launch ngsolve:

Code:

#!/bin/bash
#SBATCH --job-name=ngs
#SBATCH -N 4
#SBATCH --ntasks  96
#SBATCH --ntasks-per-node=24
#SBATCH --ntasks-per-core=1
#SBATCH --mem=24gb
#Load ngsolve_mpi module
module load apps/ngsolve_mpi 
mpirun ngspy script.py

However the slurm script returns the following message for one of the nodes:

Code:

-------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code.. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
A system call failed during shared memory initialization that should
not have.  It is likely that your MPI job will now either abort or
experience performance degradation.

  Local host:  node16
  System call: unlink(2) /tmp/openmpi-sessions-1608000011@node16_0/19664/1/2/vader_segment.node16.2
  Error:       No such file or directory (errno 2)
--------------------------------------------------------------------------

The exact same piece of code of course runs fine when the system is smaller. It's only when i use a refined mesh for the same geometry, i get in to the errors. I have tried refinements both outside (using gmsh and then reading the refined mesh in ngsolve) and inside (i.e. reading a corase mesh and then refining it with ngsolve's refine) but both end up in errors. Could you comment on likely cause of the problem? Thank you in advance for your help.

gcdiwan
Topic Author
Offline
Junior Member

5 years 6 months ago #2403 by gcdiwan

Replied by gcdiwan on topic SIGSEGV fault for large problems

Trying with umfpack, i get out of memory error: UMFPACK V5.7.4 (Feb 1, 2016): ERROR: out of memory.
Also ngsolve writes the error message for line where i call umfpack:

Traceback (most recent call last):
File "script.py", line 107, in <module>
gfu.vec.data += a.mat.Inverse(freedofs=fes.FreeDofs(), inverse="umfpack") * res
netgen.libngpy._meshing.NgException: UmfpackInverse: Symbolic factorization failed.

matthiash
Offline
Administrator

5 years 6 months ago #2406 by matthiash

Replied by matthiash on topic SIGSEGV fault for large problems

The solver is running out of memory, 24GB is probably not enough for millions of unknowns using a direct solver. For larger problems you might want to switch to an iterative solver instead, have a look at the documentation about preconditioners:

ngsolve.org/docu/latest/i-tutorials/unit.../preconditioner.html

Best,
Matthias

lkogler
Offline
Premium Member

5 years 6 months ago #2408 by lkogler

Replied by lkogler on topic SIGSEGV fault for large problems

Also, "sparsecholesky" and "umfpack" do not work with parallel matrices - they are for local matrices only!

For parallel matrices use "mumps" as a direct solver (if you have configured with it).

(You can also use "masterinverse" for small problems - then the master proc gathers the entire matrix and inverts it by itself).

gcdiwan
Topic Author
Offline
Junior Member

5 years 6 months ago #2410 by gcdiwan

Replied by gcdiwan on topic SIGSEGV fault for large problems

I had the wrong information on how much RAM I can use (the reason for setting 24G in the earlier script). We have a 15 noded cluster with 256GB of RAM on each node. Setting with 6 nodes (each with 48 parallel threads) and RAM/node to be used as 150GB (leaving some for the OS), I still could not get the code running with MUMPS. With some 900GB of RAM among the 6 nodes requested, I can't understand why it still struggles to solve. I even reduced the problem size considerably: from previous 9m dofs, i now have 5.2m dofs.

Code:

#SBATCH --job-name=ngs_phi
#SBATCH -N 6
#SBATCH --mem=150G
#SBATCH --nodelist="node[11,16-20]"

Slurm output file has:

Code:

[node16:71446] *** An error occurred in MPI_Comm_rank
[node16:71446] *** reported by process [694681601,1]
[node16:71446] *** on communicator MPI_COMM_WORLD
[node16:71446] *** MPI_ERR_COMM: invalid communicator
[node16:71446] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[node16:71446] ***    and potentially your MPI job)

whereas the error file from slurm has:

Code:

[node11:93553] 3 more processes have sent help message help-mpi-errors.txt / mpi_errors_are_fatal
[node11:93553] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages

gcdiwan
Topic Author
Offline
Junior Member

5 years 6 months ago #2411 by gcdiwan

Replied by gcdiwan on topic SIGSEGV fault for large problems

matthiash wrote: For larger problems you might want to switch to an iterative solver instead, have a look at the documentation about preconditioners:

I am trying to solve high-frequency Helmholtz problems and I have not tried the iterative solvers yet as I don't know enough about the suitability of the preconditioners available in ngsolve.

Time to create page: 0.118 seconds