- Thank you received: 0
SIGSEGV fault for large problems
4 years 9 months ago #2402
by gcdiwan
SIGSEGV fault for large problems was created by gcdiwan
Dear NGSolve Developers;
I am having issues with the direct solver when dealing with problems when ndof ~ 2-10 million. My problem involves an inhomogeneous dirichlet bc and i use the technique given in Sec 1.3 of the documentation, namely:
The slurm script i use to launch ngsolve:
However the slurm script returns the following message for one of the nodes:
The exact same piece of code of course runs fine when the system is smaller. It's only when i use a refined mesh for the same geometry, i get in to the errors. I have tried refinements both outside (using gmsh and then reading the refined mesh in ngsolve) and inside (i.e. reading a corase mesh and then refining it with ngsolve's refine) but both end up in errors. Could you comment on likely cause of the problem? Thank you in advance for your help.
I am having issues with the direct solver when dealing with problems when ndof ~ 2-10 million. My problem involves an inhomogeneous dirichlet bc and i use the technique given in Sec 1.3 of the documentation, namely:
Code:
u, v = fes.TnT()
a = BilinearForm(fes, symmetric=True)
a += grad(u)*grad(v)*dx
f = LinearForm (fes)
gfu = GridFunction (fes)
gfu.Set (ubar, definedon=BND)
#
with TaskManager():
a.Assemble()
f.Assemble()
res = gfu.vec.CreateVector()
res.data = f.vec - a.mat * gfu.vec
gfu.vec.data += a.mat.Inverse(freedofs=fes.FreeDofs(), inverse="sparsecholesky") * res
The slurm script i use to launch ngsolve:
Code:
#!/bin/bash
#SBATCH --job-name=ngs
#SBATCH -N 4
#SBATCH --ntasks 96
#SBATCH --ntasks-per-node=24
#SBATCH --ntasks-per-core=1
#SBATCH --mem=24gb
#Load ngsolve_mpi module
module load apps/ngsolve_mpi
mpirun ngspy script.py
However the slurm script returns the following message for one of the nodes:
Code:
-------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code.. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
A system call failed during shared memory initialization that should
not have. It is likely that your MPI job will now either abort or
experience performance degradation.
Local host: node16
System call: unlink(2) /tmp/openmpi-sessions-1608000011@node16_0/19664/1/2/vader_segment.node16.2
Error: No such file or directory (errno 2)
--------------------------------------------------------------------------
The exact same piece of code of course runs fine when the system is smaller. It's only when i use a refined mesh for the same geometry, i get in to the errors. I have tried refinements both outside (using gmsh and then reading the refined mesh in ngsolve) and inside (i.e. reading a corase mesh and then refining it with ngsolve's refine) but both end up in errors. Could you comment on likely cause of the problem? Thank you in advance for your help.
4 years 9 months ago #2403
by gcdiwan
Replied by gcdiwan on topic SIGSEGV fault for large problems
Trying with umfpack, i get out of memory error: UMFPACK V5.7.4 (Feb 1, 2016): ERROR: out of memory.
Also ngsolve writes the error message for line where i call umfpack:
Traceback (most recent call last):
File "script.py", line 107, in <module>
gfu.vec.data += a.mat.Inverse(freedofs=fes.FreeDofs(), inverse="umfpack") * res
netgen.libngpy._meshing.NgException: UmfpackInverse: Symbolic factorization failed.
Also ngsolve writes the error message for line where i call umfpack:
Traceback (most recent call last):
File "script.py", line 107, in <module>
gfu.vec.data += a.mat.Inverse(freedofs=fes.FreeDofs(), inverse="umfpack") * res
netgen.libngpy._meshing.NgException: UmfpackInverse: Symbolic factorization failed.
4 years 9 months ago #2406
by matthiash
Replied by matthiash on topic SIGSEGV fault for large problems
The solver is running out of memory, 24GB is probably not enough for millions of unknowns using a direct solver. For larger problems you might want to switch to an iterative solver instead, have a look at the documentation about preconditioners:
ngsolve.org/docu/latest/i-tutorials/unit.../preconditioner.html
Best,
Matthias
ngsolve.org/docu/latest/i-tutorials/unit.../preconditioner.html
Best,
Matthias
4 years 9 months ago #2408
by lkogler
Replied by lkogler on topic SIGSEGV fault for large problems
Also, "sparsecholesky" and "umfpack" do not work with parallel matrices - they are for local matrices only!
For parallel matrices use "mumps" as a direct solver (if you have configured with it).
(You can also use "masterinverse" for small problems - then the master proc gathers the entire matrix and inverts it by itself).
For parallel matrices use "mumps" as a direct solver (if you have configured with it).
(You can also use "masterinverse" for small problems - then the master proc gathers the entire matrix and inverts it by itself).
4 years 9 months ago #2410
by gcdiwan
Replied by gcdiwan on topic SIGSEGV fault for large problems
I had the wrong information on how much RAM I can use (the reason for setting 24G in the earlier script). We have a 15 noded cluster with 256GB of RAM on each node. Setting with 6 nodes (each with 48 parallel threads) and RAM/node to be used as 150GB (leaving some for the OS), I still could not get the code running with MUMPS. With some 900GB of RAM among the 6 nodes requested, I can't understand why it still struggles to solve. I even reduced the problem size considerably: from previous 9m dofs, i now have 5.2m dofs.
Slurm output file has:
whereas the error file from slurm has:
Code:
#SBATCH --job-name=ngs_phi
#SBATCH -N 6
#SBATCH --mem=150G
#SBATCH --nodelist="node[11,16-20]"
Slurm output file has:
Code:
[node16:71446] *** An error occurred in MPI_Comm_rank
[node16:71446] *** reported by process [694681601,1]
[node16:71446] *** on communicator MPI_COMM_WORLD
[node16:71446] *** MPI_ERR_COMM: invalid communicator
[node16:71446] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[node16:71446] *** and potentially your MPI job)
whereas the error file from slurm has:
Code:
[node11:93553] 3 more processes have sent help message help-mpi-errors.txt / mpi_errors_are_fatal
[node11:93553] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
4 years 9 months ago #2411
by gcdiwan
Replied by gcdiwan on topic SIGSEGV fault for large problems
I am trying to solve high-frequency Helmholtz problems and I have not tried the iterative solvers yet as I don't know enough about the suitability of the preconditioners available in ngsolve.matthiash wrote: For larger problems you might want to switch to an iterative solver instead, have a look at the documentation about preconditioners:
Time to create page: 0.111 seconds