SIGSEGV fault for large problems

lkogler
Offline
Premium Member

5 years 6 months ago #2412 by lkogler

Replied by lkogler on topic SIGSEGV fault for large problems

This does not look like a memory issue.

Sorry for asking, but are you properly distributing the mesh in the beginning of your script?

If it is not that, I would say there is a problem with MPI. I have come across similar error messages when there are multiple MPI installations present and I was using the wrong one.

gcdiwan
Topic Author
Offline
Junior Member

5 years 6 months ago #2413 by gcdiwan

Replied by gcdiwan on topic SIGSEGV fault for large problems

lkogler wrote: This does not look like a memory issue.

Sorry for asking, but are you properly distributing the mesh in the beginning of your script?

If it is not that, I would say there is a problem with MPI. I have come across similar error messages when there are multiple MPI installations present and I was using the wrong one.

Ok - how do you distribute the mesh properly when it's generated by an external tool such as gmsh?

Here is my complete (well almost except the script that writes the gmsh file) code:

Code:

#!/usr/bin/env python
# coding: utf-8
#
from ngsolve import *
from netgen.csg import *
import numpy as np
import sys
import math
import time
import csv
import os
from GMRes_v2 import GMResv2
#
import subprocess
import multiprocessing
#
from netgen.read_gmsh import ReadGmsh
# initialise mpi
comm = mpi_world
rank = comm.rank
nproc= comm.size
PI = np.pi;

# ************************************
# problem data:
freq = float(sys.argv[1])
polOrder = int(sys.argv[2])
elmperlam = int(sys.argv[3])


# ************************************
# geometrical params:
x0 = -0.011817;
y0 = -38.122429; 
z0 = -0.004375;
# ************************************
cspeed = 343e3 # in mm/s
waveno = 2.0*PI*freq / cspeed
wavelength = 2.0*PI/waveno
helem = wavelength / elmperlam
dpml = 2.0*wavelength
radcomp = 27.5 + 4.0*wavelength # radius of sensor plus 4 wavelengths
Rext = radcomp
rpml = radcomp - dpml

# ************************************
meshfilename = '../../meshes/model.msh'
# import the Gmsh file to a Netgen mesh object
mesh = ReadGmsh(meshfilename)
mesh = Mesh(mesh)
print('mesh1 ne: ', mesh.ne)
mesh.Refine()
mesh.Refine()
print('mesh2 ne: ', mesh.ne)

if (rank==0):
    print(mesh.GetBoundaries());
    print ("num vol elements:", mesh.GetNE(VOL))
    print ("num bnd elements:", mesh.GetNE(BND))
    print('add pml..')
mesh.SetPML(pml.Radial(origin=(x0,y0,z0), rad=rpml, alpha=1j), definedon="air")
ubar = exp (1J*waveno*x)
fes = H1(mesh, complex=True, order=polOrder, dirichlet="sensor_srf")
if (rank==0):
    print('ndof = ', fes.ndof)
u = fes.TrialFunction()
v = fes.TestFunction()
print("rank "+str(rank)+" has "+str(fes.ndof)+" of "+str(fes.ndofglobal)+" dofs!")
mesh.GetMaterials()

start = time.time()
gfu = GridFunction (fes)
gfu.Set (ubar, definedon='sensor_srf')
a = BilinearForm (fes, symmetric=True)
a += SymbolicBFI (grad(u)*grad(v) )
a += SymbolicBFI (-waveno*waveno*u*v)
f = LinearForm (fes)
from datetime import datetime
with TaskManager():
# create threads and assemble

    print('cpus: ', multiprocessing.cpu_count() )
    a.Assemble()
    f.Assemble()
    res = gfu.vec.CreateVector()
    res.data = f.vec - a.mat * gfu.vec
    end = time.time()
    if (rank==0):
        print('tassm: ', end - start)
    start = time.time()
    print("solve started: ", datetime.now().strftime("%H:%M:%S") )
    gfu.vec.data += a.mat.Inverse(freedofs=fes.FreeDofs(), inverse="mumps") * res
    end = time.time()
    print("solve ended: ", datetime.now().strftime("%H:%M:%S") )
    if (rank==0):
        print('tsolve: ', end - start)

gcdiwan
Topic Author
Offline
Junior Member

5 years 6 months ago #2415 by gcdiwan

Replied by gcdiwan on topic SIGSEGV fault for large problems

lkogler wrote: If it is not that, I would say there is a problem with MPI. I have come across similar error messages when there are multiple MPI installations present and I was using the wrong one.

I am not sure if that is the case as I load the ngsolve parallel build specifically by calling:

Code:

module load apps/ngsolve_mpi

There is a serial ngsolve build that's available but i don't think that's being invoked.

lkogler
Offline
Premium Member

5 years 6 months ago #2416 by lkogler

Replied by lkogler on topic SIGSEGV fault for large problems

Something like this:

Code:

if mpi_world.rank == 0:
    ngmesh = ReadGmsh(meshfilename)
    if mpi_world.size > 1:
        ngmesh.Distribute(mpi_world)
else:
    ngmesh = netgen.meshing.Mesh.Receive(mpi_world)
mesh = Mesh(ngmesh)

I have not personally tested it with a mesh loaded from gmesh, but it should work

gcdiwan
Topic Author
Offline
Junior Member

5 years 6 months ago #2418 by gcdiwan

Replied by gcdiwan on topic SIGSEGV fault for large problems

Still encounter the SIGSEGV fault with mumps despite distributing the mesh. Just for the sake of completeness, I tried the mpi_poisson.py script in master/py_tutorials/mpi/ with mumps (both with and without the preconditioning)

Code:

u.vec.data = a.mat.Inverse(V.FreeDofs(), inverse="mumps") * f.vec  # use MUMPS parallel inverse

and it still fails with the same error. This probably tells me something's wrong in the parallel build with mumps. mpi_poisson.py works with sparsecholesky however.

lkogler
Offline
Premium Member

5 years 6 months ago - 5 years 6 months ago #2435 by lkogler

Replied by lkogler on topic SIGSEGV fault for large problems

When you use "sparsecholesky" with a parallel matrix, NGSolve reverts to a different inverse type that works with parallel matrices (i believe "masterinverse") without telling you (not very pretty, I know), which is why mpi_poisson.py works.

Errors like this:

[node16:71446] *** An error occurred in MPI_Comm_rank

Usually indicate a problem with the installation, or with how the job is started.

Are you using the MUMPS built with NGSolve or a seperate MUMPS install? We have had issues with MUMPS 5.1 for larger problems. Upgrading to 5.2 resolved those.

If you use a seperate MUMPS install you have to make sure that that MUMPS and NGSolve have been built with the same MPI libraries.

Last edit: 5 years 6 months ago by lkogler.

Time to create page: 0.114 seconds