Netgen GUI fails to start when libngsolve loaded and MPI=ON (ALT Linux)

More
6 years 5 months ago - 6 years 5 months ago #551 by nickel
Hi,

I've encountered an issue recently trying to run netgen GUI built with openMPI (github v6.2.1804):

[host-68.localdomain:12429] *** An error occurred in MPI_comm_size
[host-68.localdomain:12429] *** on communicator MPI_COMM_WORLD
[host-68.localdomain:12429] *** MPI_ERR_COMM: invalid communicator
[host-68.localdomain:12429] *** MPI_ERRORS_ARE_FATAL: your MPI job will now abort
...
more output under spoiler
Code:
[user@host-68 ~]$ /usr/lib64/openmpi-compat/bin/mpirun -np 4 netgen NETGEN-6.2-dev Developed by Joachim Schoeberl at 2010-xxxx Vienna University of Technology 2006-2010 RWTH Aachen University 1996-2006 Johannes Kepler University Linz Including OpenCascade geometry kernel Running MPI - parallel using 4 processors MPI-version = 2.1 optfile ./ng.opt does not exist - using default values togl-version : 2 OCC module loaded loading ngsolve library NGSolve-........-..-.. Using Lapack [host-68.localdomain:12429] *** An error occurred in MPI_comm_size [host-68.localdomain:12429] *** on communicator MPI_COMM_WORLD [host-68.localdomain:12429] *** MPI_ERR_COMM: invalid communicator [host-68.localdomain:12429] *** MPI_ERRORS_ARE_FATAL: your MPI job will now abort -------------------------------------------------------------------------- mpirun has exited due to process rank 0 with PID 12429 on node host-68.localdomain exiting improperly. There are two reasons this could occur: 1. this process did not call "init" before exiting, but others in the job did. This can cause a job to hang indefinitely while it waits for all processes to call "init". By rule, if one process calls "init", then ALL processes must call "init" prior to termination. 2. this process called "init", but exited without calling "finalize". By rule, all processes that call "init" MUST call "finalize" prior to exiting or it will be considered an "abnormal termination" This may have caused other processes in the application to be terminated by signals sent by mpirun (as reported here). -------------------------------------------------------------------------- [user@host-68 ~]$



However if libngsolve is not loaded netgen starts fine (GUI works).

Code:
[user@host-68 ~]$ /usr/lib64/openmpi-compat/bin/mpirun -np 4 netgen NETGEN-6.2-dev Developed by Joachim Schoeberl at 2010-xxxx Vienna University of Technology 2006-2010 RWTH Aachen University 1996-2006 Johannes Kepler University Linz Including OpenCascade geometry kernel Running MPI - parallel using 4 processors MPI-version = 2.1 optfile ./ng.opt does not exist - using default values togl-version : 2 OCC module loaded loading ngsolve library cannot load ngsolve error: couldn't load file "libngsolve.so": libngsolve.so: cannot open shared object file: No such file or directory [user@host-68 ~]$


Are there any solution hints?
Last edit: 6 years 5 months ago by nickel.
More
6 years 4 months ago #592 by ddrake
Hi,

This sounds like an issue of needing to add an entry for libgomp to the preload path. Maybe this will help...

find / -name libgomp.so.1 2>&1 | grep -v "Permission denied"

Then in the directory where the netgen binary is installed, look for the small textfile ngspy.

Edit that file, inserting the path to libgomp into the preload path so it looks something like this:

LD_PRELOAD=$LD_PRELOAD:/act/openmpi-2.0/gcc-7.2.0/lib/libmpi.so:/opt/intel/mkl/lib/intel64/libmkl_core.so:/opt/intel/mkl/lib/intel64/libmkl_gnu_thread.so:/opt/intel/mkl/lib/intel64/libmkl_intel_lp64.so:/opt/intel/mkl/lib/intel64/libmkl_blacs_openmpi_lp64.so:/usr/lib64/libgomp.so.1 /home/ddrake/common/install/bin/python3 $*

Best,

Dow
Time to create page: 0.114 seconds