Building NGSolve with HYPRE support

1 year 8 months ago - 1 year 8 months ago #2335 by JanWesterdiep
Hello!

I want to experiment with some different preconditioners. In the documentation, somewhere the "hypre" and "hypre_ams" preconditioners are mentioned. I found in a semi-recent github commit how to configure cmake with hypre (`cmake -DUSE_HYPRE ../ngsolve-src`).

Pulling the most recent NGSolve version from GitHub (resulting in version NGSolve-6.2.2001-6-g8bbe2629), this command runs fine and downloads hypre, but upon running the subsequent `make`, I get the following error:
[ 62%] Building CXX object comp/CMakeFiles/ngcomp.dir/bddc.cpp.o
ngsolve/ngsolve-src/comp/bddc.cpp:317:28: error: reference to 'MPI_Op' is ambiguous
        AllReduceDofData (weight, MPI_SUM, fes->GetParallelDofs());
                                  ^
/usr/local/include/mpi.h:1130:40: note: expanded from macro 'MPI_SUM'
#define MPI_SUM OMPI_PREDEFINED_GLOBAL(MPI_Op, ompi_mpi_op_sum)
                                       ^
/usr/local/include/mpi.h:406:27: note: candidate found by name lookup is 'MPI_Op'
typedef struct ompi_op_t *MPI_Op;
                          ^
/Applications/Netgen.app/Contents/Resources/include/core/mpi_wrapper.hpp:265:15: note: candidate found by name lookup is 'ngcore::MPI_Op'
  typedef int MPI_Op;
              ^
ngsolve/ngsolve-src/comp/bddc.cpp:317:28: error: static_cast from 'void *' to 'ngcore::MPI_Op' (aka 'int') is not allowed
        AllReduceDofData (weight, MPI_SUM, fes->GetParallelDofs());
                                  ^~~~~~~
/usr/local/include/mpi.h:1130:17: note: expanded from macro 'MPI_SUM'
#define MPI_SUM OMPI_PREDEFINED_GLOBAL(MPI_Op, ompi_mpi_op_sum)
                ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/usr/local/include/mpi.h:381:47: note: expanded from macro 'OMPI_PREDEFINED_GLOBAL'
#define OMPI_PREDEFINED_GLOBAL(type, global) (static_cast<type> (static_cast<void *> (&(global))))
                                              ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
2 errors generated.
Any idea how to get around this compile error?

Please Log in or Create an account to join the conversation.

1 year 8 months ago #2336 by lkogler
It looks like you did not turn on MPI (-DUSE_MPI=ON). This is needed by hypre.

We should throw an error when MPI is turned off and hypre is turned on.

Please Log in or Create an account to join the conversation.

1 year 8 months ago #2343 by JanWesterdiep
Hey! Yes, perfect, that worked :-)

I installed NGSolve from source with HYPRE and MPI support on two machines, Ubuntu 18 and MacOS 10.14.
Whenever I run *anything*, I get the following error:
$ netgen navierstokes.py
NETGEN-6.2-dev
Developed by Joachim Schoeberl at
2010-xxxx Vienna University of Technology
2006-2010 RWTH Aachen University
1996-2006 Johannes Kepler University Linz
Including MPI version 3.1
Problem in Tk_Init:
result = no display name and no $DISPLAY environment variable
optfile ./ng.opt does not exist - using default values
togl-version : 2
no OpenGL
loading ngsolve library
NGSolve-6.2.2001-11-g0929bd80
Using Lapack
Including sparse direct solver UMFPACK
Running parallel using 1 thread(s)
(should) load python file 'navierstokes.py'
loading ngsolve library
NGSolve-6.2.2001-11-g0929bd80
Using Lapack
Including sparse direct solver UMFPACK
Running parallel using 1 thread(s)
Caught SIGSEGV: segmentation fault

I realize this is probably very difficult to debug, also for you, so I have a more general question: is there any way of getting a stack trace at this point? Running this through valgrind produces another very mysterious error:
$ valgrind !!
valgrind netgen navierstokes.py
==16773== Memcheck, a memory error detector
==16773== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==16773== Using Valgrind-3.13.0 and LibVEX; rerun with -h for copyright info
==16773== Command: netgen navierstokes.py
==16773==
vex amd64->IR: unhandled instruction bytes: 0x62 0xF1 0x7D 0x28 0xEF 0xC0 0x83 0xFE 0x8 0xB8
vex amd64->IR:   REX=0 REX.W=0 REX.R=0 REX.X=0 REX.B=0
vex amd64->IR:   VEX=0 VEX.L=0 VEX.nVVVV=0x0 ESC=NONE
vex amd64->IR:   PFX.66=0 PFX.F2=0 PFX.F3=0
==16773== valgrind: Unrecognised instruction at address 0x5f81400.
==16773==    at 0x5F81400: __mutex_base (std_mutex.h:68)
==16773==    by 0x5F81400: mutex (std_mutex.h:94)
==16773==    by 0x5F81400: netgen::BlockAllocator::BlockAllocator(unsigned int, unsigned int) (optmem.cpp:20)
==16773==    by 0x5D68679: __static_initialization_and_destruction_0 (localh.cpp:27)
==16773==    by 0x5D68679: _GLOBAL__sub_I_localh.cpp (localh.cpp:800)
==16773==    by 0x4010732: call_init (dl-init.c:72)
==16773==    by 0x4010732: _dl_init (dl-init.c:119)
==16773==    by 0x40010C9: ??? (in /lib/x86_64-linux-gnu/ld-2.27.so)
==16773==    by 0x1: ???
==16773==    by 0x1FFF00047E: ???
==16773==    by 0x1FFF000485: ???
==16773== Your program just tried to execute an instruction that Valgrind
==16773== did not recognise.  There are two possible reasons for this.
==16773== 1. Your program has a bug and erroneously jumped to a non-code
==16773==    location.  If you are running Memcheck and you just saw a
==16773==    warning about a bad jump, it's probably your program's fault.
==16773== 2. The instruction is legitimate but Valgrind doesn't handle it,
==16773==    i.e. it's Valgrind's fault.  If you think this is the case or
==16773==    you are not sure, please let us know and we'll try to fix it.
==16773== Either way, Valgrind will now raise a SIGILL signal which will
==16773== probably kill your program.
Caught SIGILL: illegal instruction

==16773==
==16773== HEAP SUMMARY:
==16773==     in use at exit: 2,916 bytes in 62 blocks
==16773==   total heap usage: 87 allocs, 25 frees, 881,786 bytes allocated
==16773==
==16773== LEAK SUMMARY:
==16773==    definitely lost: 0 bytes in 0 blocks
==16773==    indirectly lost: 0 bytes in 0 blocks
==16773==      possibly lost: 160 bytes in 2 blocks
==16773==    still reachable: 2,756 bytes in 60 blocks
==16773==         suppressed: 0 bytes in 0 blocks
==16773== Rerun with --leak-check=full to see details of leaked memory
==16773==
==16773== For counts of detected and suppressed errors, rerun with: -v
==16773== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)

In the mean time, I will try to checkout the NGSolve repo at the moment HYPRE support was first introduced, and see if that fixes any of my problems.

Thank you for your continued support :-)

Please Log in or Create an account to join the conversation.

1 year 8 months ago #2344 by lkogler
This valgrind error happens when valgrind does not know some instruction. That happens occasionally on newer hardware and might not be related to the segfault you are getting. Could you try gdb instead?

Also, could you try "ngspy navierstokes.py"? It looks like there is some error with the GUI.

Best,
Lukas

Please Log in or Create an account to join the conversation.

1 year 8 months ago #2345 by JanWesterdiep
Hey Lukas, yeah I figured but am more familiar with valgrind than GDB. Thanks for the proposal, I will try to learn how to use it.

`ngspy navierstokes.py` runs :-)

Now the interesting stuff begins: when I take the preconditioner example from ngsolve.org/docu/latest/i-tutorials/unit.../preconditioner.html and run it with some builtin preconditioner like "local" or "h1amg", it runs fine. When I run it with "hypre", I get the following error:
$ ngspy precond_test.py
 Generate Mesh from spline geometry
 Boundary mesh done, np = 8
 CalcLocalH: 8 Points 0 Elements 0 Surface Elements
 Meshing domain 1 / 1
 load internal triangle rules
 Surface meshing done
 Edgeswapping, topological
 Smoothing
 Split improve
 Combine improve
 Smoothing
 Edgeswapping, metric
 Smoothing
 Split improve
 Combine improve
 Smoothing
 Edgeswapping, metric
 Smoothing
 Split improve
 Combine improve
 Smoothing
 Update mesh topology
 Update clusters
assemble VOL element 6/6
assemble VOL element 6/6
Setup Hypre preconditioner
Traceback (most recent call last):
  File "precond_test.py", line 61, in <module>
    print(SolveProblem(levels=5, precond="hypre"))
  File "precond_test.py", line 41, in SolveProblem
    a.Assemble()
netgen.libngpy._meshing.NgException: std::bad_cast
 in Assemble BilinearForm 'biform_from_py'

Unfortunately, the exception seems to be caught by Python or something before the process exists, because running it through GDB produces no stack trace. Any clues?

Please Log in or Create an account to join the conversation.

1 year 8 months ago - 1 year 8 months ago #2346 by lkogler
in the shell, run:
"gdb python3"
in gdb, run:
"set breakpoint pendong on"
"break RangeException"
"run navierstokes.py"

I cannot reproduce this error with the newest Netgen/NGSolve version.

Are you, by any chance, running on a computer with AVX512? And which compiler are you using? We recently ran into issues with gcc 9.2 andf AVX512, but now this combination should throw an error.

Please Log in or Create an account to join the conversation.

© 2019 Netgen/NGSolve