Building NGSolve with HYPRE support

More
4 years 9 months ago - 4 years 9 months ago #2335 by JanWesterdiep
Hello!

I want to experiment with some different preconditioners. In the documentation, somewhere the "hypre" and "hypre_ams" preconditioners are mentioned. I found in a semi-recent github commit how to configure cmake with hypre (`cmake -DUSE_HYPRE ../ngsolve-src`).

Pulling the most recent NGSolve version from GitHub (resulting in version NGSolve-6.2.2001-6-g8bbe2629), this command runs fine and downloads hypre, but upon running the subsequent `make`, I get the following error:
Code:
[ 62%] Building CXX object comp/CMakeFiles/ngcomp.dir/bddc.cpp.o ngsolve/ngsolve-src/comp/bddc.cpp:317:28: error: reference to 'MPI_Op' is ambiguous AllReduceDofData (weight, MPI_SUM, fes->GetParallelDofs()); ^ /usr/local/include/mpi.h:1130:40: note: expanded from macro 'MPI_SUM' #define MPI_SUM OMPI_PREDEFINED_GLOBAL(MPI_Op, ompi_mpi_op_sum) ^ /usr/local/include/mpi.h:406:27: note: candidate found by name lookup is 'MPI_Op' typedef struct ompi_op_t *MPI_Op; ^ /Applications/Netgen.app/Contents/Resources/include/core/mpi_wrapper.hpp:265:15: note: candidate found by name lookup is 'ngcore::MPI_Op' typedef int MPI_Op; ^ ngsolve/ngsolve-src/comp/bddc.cpp:317:28: error: static_cast from 'void *' to 'ngcore::MPI_Op' (aka 'int') is not allowed AllReduceDofData (weight, MPI_SUM, fes->GetParallelDofs()); ^~~~~~~ /usr/local/include/mpi.h:1130:17: note: expanded from macro 'MPI_SUM' #define MPI_SUM OMPI_PREDEFINED_GLOBAL(MPI_Op, ompi_mpi_op_sum) ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ /usr/local/include/mpi.h:381:47: note: expanded from macro 'OMPI_PREDEFINED_GLOBAL' #define OMPI_PREDEFINED_GLOBAL(type, global) (static_cast<type> (static_cast<void *> (&(global)))) ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 2 errors generated.
Any idea how to get around this compile error?
Last edit: 4 years 9 months ago by JanWesterdiep.
More
4 years 9 months ago #2336 by lkogler
It looks like you did not turn on MPI (-DUSE_MPI=ON). This is needed by hypre.

We should throw an error when MPI is turned off and hypre is turned on.
More
4 years 9 months ago #2343 by JanWesterdiep
Hey! Yes, perfect, that worked :-)

I installed NGSolve from source with HYPRE and MPI support on two machines, Ubuntu 18 and MacOS 10.14.
Whenever I run *anything*, I get the following error:
Code:
$ netgen navierstokes.py NETGEN-6.2-dev Developed by Joachim Schoeberl at 2010-xxxx Vienna University of Technology 2006-2010 RWTH Aachen University 1996-2006 Johannes Kepler University Linz Including MPI version 3.1 Problem in Tk_Init: result = no display name and no $DISPLAY environment variable optfile ./ng.opt does not exist - using default values togl-version : 2 no OpenGL loading ngsolve library NGSolve-6.2.2001-11-g0929bd80 Using Lapack Including sparse direct solver UMFPACK Running parallel using 1 thread(s) (should) load python file 'navierstokes.py' loading ngsolve library NGSolve-6.2.2001-11-g0929bd80 Using Lapack Including sparse direct solver UMFPACK Running parallel using 1 thread(s) Caught SIGSEGV: segmentation fault

I realize this is probably very difficult to debug, also for you, so I have a more general question: is there any way of getting a stack trace at this point? Running this through valgrind produces another very mysterious error:
Code:
$ valgrind !! valgrind netgen navierstokes.py ==16773== Memcheck, a memory error detector ==16773== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al. ==16773== Using Valgrind-3.13.0 and LibVEX; rerun with -h for copyright info ==16773== Command: netgen navierstokes.py ==16773== vex amd64->IR: unhandled instruction bytes: 0x62 0xF1 0x7D 0x28 0xEF 0xC0 0x83 0xFE 0x8 0xB8 vex amd64->IR: REX=0 REX.W=0 REX.R=0 REX.X=0 REX.B=0 vex amd64->IR: VEX=0 VEX.L=0 VEX.nVVVV=0x0 ESC=NONE vex amd64->IR: PFX.66=0 PFX.F2=0 PFX.F3=0 ==16773== valgrind: Unrecognised instruction at address 0x5f81400. ==16773== at 0x5F81400: __mutex_base (std_mutex.h:68) ==16773== by 0x5F81400: mutex (std_mutex.h:94) ==16773== by 0x5F81400: netgen::BlockAllocator::BlockAllocator(unsigned int, unsigned int) (optmem.cpp:20) ==16773== by 0x5D68679: __static_initialization_and_destruction_0 (localh.cpp:27) ==16773== by 0x5D68679: _GLOBAL__sub_I_localh.cpp (localh.cpp:800) ==16773== by 0x4010732: call_init (dl-init.c:72) ==16773== by 0x4010732: _dl_init (dl-init.c:119) ==16773== by 0x40010C9: ??? (in /lib/x86_64-linux-gnu/ld-2.27.so) ==16773== by 0x1: ??? ==16773== by 0x1FFF00047E: ??? ==16773== by 0x1FFF000485: ??? ==16773== Your program just tried to execute an instruction that Valgrind ==16773== did not recognise. There are two possible reasons for this. ==16773== 1. Your program has a bug and erroneously jumped to a non-code ==16773== location. If you are running Memcheck and you just saw a ==16773== warning about a bad jump, it's probably your program's fault. ==16773== 2. The instruction is legitimate but Valgrind doesn't handle it, ==16773== i.e. it's Valgrind's fault. If you think this is the case or ==16773== you are not sure, please let us know and we'll try to fix it. ==16773== Either way, Valgrind will now raise a SIGILL signal which will ==16773== probably kill your program. Caught SIGILL: illegal instruction ==16773== ==16773== HEAP SUMMARY: ==16773== in use at exit: 2,916 bytes in 62 blocks ==16773== total heap usage: 87 allocs, 25 frees, 881,786 bytes allocated ==16773== ==16773== LEAK SUMMARY: ==16773== definitely lost: 0 bytes in 0 blocks ==16773== indirectly lost: 0 bytes in 0 blocks ==16773== possibly lost: 160 bytes in 2 blocks ==16773== still reachable: 2,756 bytes in 60 blocks ==16773== suppressed: 0 bytes in 0 blocks ==16773== Rerun with --leak-check=full to see details of leaked memory ==16773== ==16773== For counts of detected and suppressed errors, rerun with: -v ==16773== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)

In the mean time, I will try to checkout the NGSolve repo at the moment HYPRE support was first introduced, and see if that fixes any of my problems.

Thank you for your continued support :-)
More
4 years 9 months ago #2344 by lkogler
This valgrind error happens when valgrind does not know some instruction. That happens occasionally on newer hardware and might not be related to the segfault you are getting. Could you try gdb instead?

Also, could you try "ngspy navierstokes.py"? It looks like there is some error with the GUI.

Best,
Lukas
More
4 years 9 months ago #2345 by JanWesterdiep
Hey Lukas, yeah I figured but am more familiar with valgrind than GDB. Thanks for the proposal, I will try to learn how to use it.

`ngspy navierstokes.py` runs :-)

Now the interesting stuff begins: when I take the preconditioner example from ngsolve.org/docu/latest/i-tutorials/unit.../preconditioner.html and run it with some builtin preconditioner like "local" or "h1amg", it runs fine. When I run it with "hypre", I get the following error:
Code:
$ ngspy precond_test.py Generate Mesh from spline geometry Boundary mesh done, np = 8 CalcLocalH: 8 Points 0 Elements 0 Surface Elements Meshing domain 1 / 1 load internal triangle rules Surface meshing done Edgeswapping, topological Smoothing Split improve Combine improve Smoothing Edgeswapping, metric Smoothing Split improve Combine improve Smoothing Edgeswapping, metric Smoothing Split improve Combine improve Smoothing Update mesh topology Update clusters assemble VOL element 6/6 assemble VOL element 6/6 Setup Hypre preconditioner Traceback (most recent call last): File "precond_test.py", line 61, in <module> print(SolveProblem(levels=5, precond="hypre")) File "precond_test.py", line 41, in SolveProblem a.Assemble() netgen.libngpy._meshing.NgException: std::bad_cast in Assemble BilinearForm 'biform_from_py'

Unfortunately, the exception seems to be caught by Python or something before the process exists, because running it through GDB produces no stack trace. Any clues?
More
4 years 9 months ago - 4 years 9 months ago #2346 by lkogler
in the shell, run:
"gdb python3"
in gdb, run:
"set breakpoint pendong on"
"break RangeException"
"run navierstokes.py"

I cannot reproduce this error with the newest Netgen/NGSolve version.

Are you, by any chance, running on a computer with AVX512? And which compiler are you using? We recently ran into issues with gcc 9.2 andf AVX512, but now this combination should throw an error.
Last edit: 4 years 9 months ago by lkogler.
Time to create page: 0.122 seconds