- Thank you received: 17
SIGSEGV fault for large problems
4 years 9 months ago #2412
by lkogler
Replied by lkogler on topic SIGSEGV fault for large problems
This does not look like a memory issue.
Sorry for asking, but are you properly distributing the mesh in the beginning of your script?
If it is not that, I would say there is a problem with MPI. I have come across similar error messages when there are multiple MPI installations present and I was using the wrong one.
Sorry for asking, but are you properly distributing the mesh in the beginning of your script?
If it is not that, I would say there is a problem with MPI. I have come across similar error messages when there are multiple MPI installations present and I was using the wrong one.
4 years 9 months ago #2413
by gcdiwan
Ok - how do you distribute the mesh properly when it's generated by an external tool such as gmsh?
Here is my complete (well almost except the script that writes the gmsh file) code:
Replied by gcdiwan on topic SIGSEGV fault for large problems
lkogler wrote: This does not look like a memory issue.
Sorry for asking, but are you properly distributing the mesh in the beginning of your script?
If it is not that, I would say there is a problem with MPI. I have come across similar error messages when there are multiple MPI installations present and I was using the wrong one.
Ok - how do you distribute the mesh properly when it's generated by an external tool such as gmsh?
Here is my complete (well almost except the script that writes the gmsh file) code:
Code:
#!/usr/bin/env python
# coding: utf-8
#
from ngsolve import *
from netgen.csg import *
import numpy as np
import sys
import math
import time
import csv
import os
from GMRes_v2 import GMResv2
#
import subprocess
import multiprocessing
#
from netgen.read_gmsh import ReadGmsh
# initialise mpi
comm = mpi_world
rank = comm.rank
nproc= comm.size
PI = np.pi;
# ************************************
# problem data:
freq = float(sys.argv[1])
polOrder = int(sys.argv[2])
elmperlam = int(sys.argv[3])
# ************************************
# geometrical params:
x0 = -0.011817;
y0 = -38.122429;
z0 = -0.004375;
# ************************************
cspeed = 343e3 # in mm/s
waveno = 2.0*PI*freq / cspeed
wavelength = 2.0*PI/waveno
helem = wavelength / elmperlam
dpml = 2.0*wavelength
radcomp = 27.5 + 4.0*wavelength # radius of sensor plus 4 wavelengths
Rext = radcomp
rpml = radcomp - dpml
# ************************************
meshfilename = '../../meshes/model.msh'
# import the Gmsh file to a Netgen mesh object
mesh = ReadGmsh(meshfilename)
mesh = Mesh(mesh)
print('mesh1 ne: ', mesh.ne)
mesh.Refine()
mesh.Refine()
print('mesh2 ne: ', mesh.ne)
if (rank==0):
print(mesh.GetBoundaries());
print ("num vol elements:", mesh.GetNE(VOL))
print ("num bnd elements:", mesh.GetNE(BND))
print('add pml..')
mesh.SetPML(pml.Radial(origin=(x0,y0,z0), rad=rpml, alpha=1j), definedon="air")
ubar = exp (1J*waveno*x)
fes = H1(mesh, complex=True, order=polOrder, dirichlet="sensor_srf")
if (rank==0):
print('ndof = ', fes.ndof)
u = fes.TrialFunction()
v = fes.TestFunction()
print("rank "+str(rank)+" has "+str(fes.ndof)+" of "+str(fes.ndofglobal)+" dofs!")
mesh.GetMaterials()
start = time.time()
gfu = GridFunction (fes)
gfu.Set (ubar, definedon='sensor_srf')
a = BilinearForm (fes, symmetric=True)
a += SymbolicBFI (grad(u)*grad(v) )
a += SymbolicBFI (-waveno*waveno*u*v)
f = LinearForm (fes)
from datetime import datetime
with TaskManager():
# create threads and assemble
print('cpus: ', multiprocessing.cpu_count() )
a.Assemble()
f.Assemble()
res = gfu.vec.CreateVector()
res.data = f.vec - a.mat * gfu.vec
end = time.time()
if (rank==0):
print('tassm: ', end - start)
start = time.time()
print("solve started: ", datetime.now().strftime("%H:%M:%S") )
gfu.vec.data += a.mat.Inverse(freedofs=fes.FreeDofs(), inverse="mumps") * res
end = time.time()
print("solve ended: ", datetime.now().strftime("%H:%M:%S") )
if (rank==0):
print('tsolve: ', end - start)
4 years 9 months ago #2415
by gcdiwan
I am not sure if that is the case as I load the ngsolve parallel build specifically by calling:
There is a serial ngsolve build that's available but i don't think that's being invoked.
Replied by gcdiwan on topic SIGSEGV fault for large problems
lkogler wrote: If it is not that, I would say there is a problem with MPI. I have come across similar error messages when there are multiple MPI installations present and I was using the wrong one.
I am not sure if that is the case as I load the ngsolve parallel build specifically by calling:
Code:
module load apps/ngsolve_mpi
4 years 9 months ago #2416
by lkogler
Replied by lkogler on topic SIGSEGV fault for large problems
Something like this:
I have not personally tested it with a mesh loaded from gmesh, but it should work
Code:
if mpi_world.rank == 0:
ngmesh = ReadGmsh(meshfilename)
if mpi_world.size > 1:
ngmesh.Distribute(mpi_world)
else:
ngmesh = netgen.meshing.Mesh.Receive(mpi_world)
mesh = Mesh(ngmesh)
I have not personally tested it with a mesh loaded from gmesh, but it should work
4 years 9 months ago #2418
by gcdiwan
Replied by gcdiwan on topic SIGSEGV fault for large problems
Still encounter the SIGSEGV fault with mumps despite distributing the mesh. Just for the sake of completeness, I tried the mpi_poisson.py script in master/py_tutorials/mpi/ with mumps (both with and without the preconditioning)
and it still fails with the same error. This probably tells me something's wrong in the parallel build with mumps. mpi_poisson.py works with sparsecholesky however.
Code:
u.vec.data = a.mat.Inverse(V.FreeDofs(), inverse="mumps") * f.vec # use MUMPS parallel inverse
4 years 8 months ago - 4 years 8 months ago #2435
by lkogler
Replied by lkogler on topic SIGSEGV fault for large problems
When you use "sparsecholesky" with a parallel matrix, NGSolve reverts to a different inverse type that works with parallel matrices (i believe "masterinverse") without telling you (not very pretty, I know), which is why mpi_poisson.py works.
Errors like this:
Are you using the MUMPS built with NGSolve or a seperate MUMPS install? We have had issues with MUMPS 5.1 for larger problems. Upgrading to 5.2 resolved those.
If you use a seperate MUMPS install you have to make sure that that MUMPS and NGSolve have been built with the same MPI libraries.
Errors like this:
Usually indicate a problem with the installation, or with how the job is started.[node16:71446] *** An error occurred in MPI_Comm_rank
Are you using the MUMPS built with NGSolve or a seperate MUMPS install? We have had issues with MUMPS 5.1 for larger problems. Upgrading to 5.2 resolved those.
If you use a seperate MUMPS install you have to make sure that that MUMPS and NGSolve have been built with the same MPI libraries.
Last edit: 4 years 8 months ago by lkogler.
Time to create page: 0.120 seconds