SIGSEGV fault for large problems

More
2 years 4 months ago #2412 by lkogler
This does not look like a memory issue.

Sorry for asking, but are you properly distributing the mesh in the beginning of your script?

If it is not that, I would say there is a problem with MPI. I have come across similar error messages when there are multiple MPI installations present and I was using the wrong one.

Please Log in or Create an account to join the conversation.

More
2 years 4 months ago #2413 by gcdiwan

lkogler wrote: This does not look like a memory issue.

Sorry for asking, but are you properly distributing the mesh in the beginning of your script?

If it is not that, I would say there is a problem with MPI. I have come across similar error messages when there are multiple MPI installations present and I was using the wrong one.


Ok - how do you distribute the mesh properly when it's generated by an external tool such as gmsh?

Here is my complete (well almost except the script that writes the gmsh file) code:
Code:
#!/usr/bin/env python # coding: utf-8 # from ngsolve import * from netgen.csg import * import numpy as np import sys import math import time import csv import os from GMRes_v2 import GMResv2 # import subprocess import multiprocessing # from netgen.read_gmsh import ReadGmsh # initialise mpi comm = mpi_world rank = comm.rank nproc= comm.size PI = np.pi; # ************************************ # problem data: freq = float(sys.argv[1]) polOrder = int(sys.argv[2]) elmperlam = int(sys.argv[3]) # ************************************ # geometrical params: x0 = -0.011817; y0 = -38.122429; z0 = -0.004375; # ************************************ cspeed = 343e3 # in mm/s waveno = 2.0*PI*freq / cspeed wavelength = 2.0*PI/waveno helem = wavelength / elmperlam dpml = 2.0*wavelength radcomp = 27.5 + 4.0*wavelength # radius of sensor plus 4 wavelengths Rext = radcomp rpml = radcomp - dpml # ************************************ meshfilename = '../../meshes/model.msh' # import the Gmsh file to a Netgen mesh object mesh = ReadGmsh(meshfilename) mesh = Mesh(mesh) print('mesh1 ne: ', mesh.ne) mesh.Refine() mesh.Refine() print('mesh2 ne: ', mesh.ne) if (rank==0): print(mesh.GetBoundaries()); print ("num vol elements:", mesh.GetNE(VOL)) print ("num bnd elements:", mesh.GetNE(BND)) print('add pml..') mesh.SetPML(pml.Radial(origin=(x0,y0,z0), rad=rpml, alpha=1j), definedon="air") ubar = exp (1J*waveno*x) fes = H1(mesh, complex=True, order=polOrder, dirichlet="sensor_srf") if (rank==0): print('ndof = ', fes.ndof) u = fes.TrialFunction() v = fes.TestFunction() print("rank "+str(rank)+" has "+str(fes.ndof)+" of "+str(fes.ndofglobal)+" dofs!") mesh.GetMaterials() start = time.time() gfu = GridFunction (fes) gfu.Set (ubar, definedon='sensor_srf') a = BilinearForm (fes, symmetric=True) a += SymbolicBFI (grad(u)*grad(v) ) a += SymbolicBFI (-waveno*waveno*u*v) f = LinearForm (fes) from datetime import datetime with TaskManager(): # create threads and assemble print('cpus: ', multiprocessing.cpu_count() ) a.Assemble() f.Assemble() res = gfu.vec.CreateVector() res.data = f.vec - a.mat * gfu.vec end = time.time() if (rank==0): print('tassm: ', end - start) start = time.time() print("solve started: ", datetime.now().strftime("%H:%M:%S") ) gfu.vec.data += a.mat.Inverse(freedofs=fes.FreeDofs(), inverse="mumps") * res end = time.time() print("solve ended: ", datetime.now().strftime("%H:%M:%S") ) if (rank==0): print('tsolve: ', end - start)

Please Log in or Create an account to join the conversation.

More
2 years 4 months ago #2415 by gcdiwan

lkogler wrote: If it is not that, I would say there is a problem with MPI. I have come across similar error messages when there are multiple MPI installations present and I was using the wrong one.


I am not sure if that is the case as I load the ngsolve parallel build specifically by calling:
Code:
module load apps/ngsolve_mpi
There is a serial ngsolve build that's available but i don't think that's being invoked.

Please Log in or Create an account to join the conversation.

More
2 years 4 months ago #2416 by lkogler
Something like this:
Code:
if mpi_world.rank == 0: ngmesh = ReadGmsh(meshfilename) if mpi_world.size > 1: ngmesh.Distribute(mpi_world) else: ngmesh = netgen.meshing.Mesh.Receive(mpi_world) mesh = Mesh(ngmesh)

I have not personally tested it with a mesh loaded from gmesh, but it should work

Please Log in or Create an account to join the conversation.

More
2 years 4 months ago #2418 by gcdiwan
Still encounter the SIGSEGV fault with mumps despite distributing the mesh. Just for the sake of completeness, I tried the mpi_poisson.py script in master/py_tutorials/mpi/ with mumps (both with and without the preconditioning)
Code:
u.vec.data = a.mat.Inverse(V.FreeDofs(), inverse="mumps") * f.vec # use MUMPS parallel inverse
and it still fails with the same error. This probably tells me something's wrong in the parallel build with mumps. mpi_poisson.py works with sparsecholesky however.

Please Log in or Create an account to join the conversation.

More
2 years 3 months ago - 2 years 3 months ago #2435 by lkogler
When you use "sparsecholesky" with a parallel matrix, NGSolve reverts to a different inverse type that works with parallel matrices (i believe "masterinverse") without telling you (not very pretty, I know), which is why mpi_poisson.py works.

Errors like this:

[node16:71446] *** An error occurred in MPI_Comm_rank

Usually indicate a problem with the installation, or with how the job is started.

Are you using the MUMPS built with NGSolve or a seperate MUMPS install? We have had issues with MUMPS 5.1 for larger problems. Upgrading to 5.2 resolved those.

If you use a seperate MUMPS install you have to make sure that that MUMPS and NGSolve have been built with the same MPI libraries.
Last edit: 2 years 3 months ago by lkogler.

Please Log in or Create an account to join the conversation.

Time to create page: 0.144 seconds