install NGsolve without root access

More
7 years 5 months ago #34 by matthiash

Guosheng Fu wrote: ../comp/libngcomp.so: undefined reference to `_ZNK6netgen8Ngx_Mesh26MultiElementTransformationILi2ELi2EDv4_dEEviiPKT1_mPS3_mS6_m'


This happens if you compile Netgen without AVX support but NGSolve with AVX support. Did you build Netgen separately (i.e. no superbuild) or play with the setting USE_NATIVE_ARCH?

Please attach the following files in your build directory (they contain all your build settings):
./CMakeCache.txt
./ngsolve/CMakeCache.txt
./netgen/CMakeCache.txt
./netgen/netgen/CMakeCache.txt

Best,
Matthias
More
7 years 5 months ago #35 by Guosheng Fu
Yeah, it was the linking issue.

Initially, I successfully installed ngsolve without MPI on a local folder ~/netgen/inst/

Then, I tried to turn on MPI, and install it on another folder ~/netgen/inst-mpi/, which caused the issue.
The issue is that at the final stage, the complier try to link the old library at the folder ~/netgen/inst/ rather than those at the folder ~/netgen/inst-mpi/

So, I changed my install directory for the MPI version back to ~/netgen/inst/ and the installation work fine now. Is there a way to specifically tell the compiler where the library to link? I don't understand why it search the old folder...

Now, I tried to run the tutorial.
After adding "from mpi4py import *" in the python tutorial file, I can run the code with command
>> python3 adaptive.py
But, it gives me a segmentation fault when I run, say,
>> mpirun -n 4 python3 adaptive.py
More
7 years 5 months ago #36 by lkogler
If you want to use mpi4py, you have to make sure that mpi4py and ngsolve both use the exact same mpi-library, we have had issues with that in the past.

Also I think you have to import mpi4py BEFORE netgen/ngsolve because on importing, ngsolve checks if MPI has already been initialized and if not initializes, and I do not know if mpi4py likes it when somewone else has already done that.

For ngsolve to work you do not necessarily need mpi4py (anymore).

Today, there was an update for the netgen- and ngsolve master branches which featured a bunch of MPI-related bugfixes. Those are probably ESSENTIAL!
There are now also a couple of mpi-tutorial files in "ngsolve/py_tutorials/mpi/".

You need to get the newest version of BOTH ngsolve and netgen.
Keep in mind that when you "git pull" in the ngsolve-directory, it will probably not update netgen yet, so go to "ngsolve/external_dependencies/netgen" and "git pull" there too!


Also, if you run into runtime-library-issues, use "ngspy" instead of "python3".
ngspy is just a wrapper around python3 which preloads a couple of libraries.
More
7 years 5 months ago #37 by lkogler
And about the linking issue:
The linker usually takes the first library of any name it can find, and if ~/netgen/inst/lib is in your LD_LIBRARY_PATH, it sometimes takes the wrong one.


Do you have environment-modules on your cluster?

In that case, you could create one module "netgen" and one module "netgen-mpi" and only load one at any given time in order to properly seperate them.
More
7 years 5 months ago #38 by Guosheng Fu
OK. So I updated the library, and finally have MPI version installed.
Previously, It was the mpi location issue. I have two mpi in the cluster, one is located under the python folder that is not working properly. Say, how to specify the location of MPI in cmake? like -DMPI_ROOT=...

Now, I need to add a direct solver. I don't have any of umpack/pardiso/mumps.

1) I tried to install umpack as my local installation in the laptop using

"-DCMAKE_PREFIX_PATH= ~/netgen/SuiteSparse -DUSE_UMFPACK=ON"

but got an length error at the final linking stage:
../fem/libngfem.so: undefined reference to `std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::replace(unsigned long, unsigned long, char const*, unsigned long)@GLIBCXX_3.4.21'
../comp/libngcomp.so: undefined reference to `std::basic_ostream<char, std::char_traits<char> >& std::operator<< <char, std::char_traits<char>, std::allocator<char> >(std::basic_ostream<char, std::char_traits<char> >&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)@GLIBCXX_3.4.21'
../comp/libngcomp.so: undefined reference to `std::basic_ofstream<char, std::char_traits<char> >::basic_ofstream(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::_Ios_Openmode)@GLIBCXX_3.4.21'
libsolve.so: undefined reference to `VTT for std::__cxx11::basic_stringstream<char, std::char_traits<char>, std::allocator<char> >@GLIBCXX_3.4.21'
....

2) similar thing happes when I activated mumps with "-DUSE_MUMPS=ON"

3) I tried to add pardiso with MKL:

"-DUSE_MKL=ON
-DMKL_ROOT=/soft/intel/x86_64/12.1/8.273/composer_xe_2011_sp1.8.273/mkl/"

but got a compiling error for ngsolve at very begining:
netgen/src/ngsolve/ngstd/taskmanager.cpp: In member function 'void ngstd::TaskManager::Loop(int)':
netgen/src/ngsolve/ngstd/taskmanager.cpp:345:32: error: 'mkl_set_num_threads_local' was not declared in this scope
mkl_set_num_threads_local(1);



Finally, I have convergence issue with the provided preconditioner. Running with
>> mpirun -np 5 ngspy mpi_poission.py
(the bddc preconditioner)
I got the following diverging result:
assemble VOL element 6697/6697
assemble VOL element 6697/6697
create masterinverse
master: got data from 4
now build graph
n = 8507
now build matrix
have matrix, now invert
start order
order ........ 14952360 Bytes task-based parallelization (C++11 threads) using 1 threads
factor SPD ........
0 1.00669
1 0.940628
2 0.533298
3 0.540046
4 1.33798
5 1.05662

But without MPI, the method converges in 12 iterations.
I replaced the preconditioner with type "local", then there is no convergence difference between the mpi version and non-mpi version. Is this to be expected?

Thanks in advance,
Guosheng
More
7 years 5 months ago #39 by matthiash
You really seem to encounter all the problems one could think of. First, let me point to a script I wrote for someone else to get NGSolve running on a cluster. I should have mentioned it before, maybe it helps:
data.asc.tuwien.ac.at/snippets/6

Now, step by step:

Guosheng Fu wrote: OK. So I updated the library, and finally have MPI version installed.
Previously, It was the mpi location issue. I have two mpi in the cluster, one is located under the python folder that is not working properly. Say, how to specify the location of MPI in cmake? like -DMPI_ROOT=...

Now, I need to add a direct solver. I don't have any of umpack/pardiso/mumps.

1) I tried to install umpack as my local installation in the laptop using

"-DCMAKE_PREFIX_PATH= ~/netgen/SuiteSparse -DUSE_UMFPACK=ON"

but got an length error at the final linking stage:
../fem/libngfem.so: undefined reference to `std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::replace(unsigned long, unsigned long, char const*, unsigned long)@GLIBCXX_3.4.21'
../comp/libngcomp.so: undefined reference to `std::basic_ostream<char, std::char_traits<char> >& std::operator<< <char, std::char_traits<char>, std::allocator<char> >(std::basic_ostream<char, std::char_traits<char> >&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)@GLIBCXX_3.4.21'
../comp/libngcomp.so: undefined reference to `std::basic_ofstream<char, std::char_traits<char> >::basic_ofstream(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::_Ios_Openmode)@GLIBCXX_3.4.21'
libsolve.so: undefined reference to `VTT for std::__cxx11::basic_stringstream<char, std::char_traits<char>, std::allocator<char> >@GLIBCXX_3.4.21'
....

Things like this happen when C++ libraries are linked together, but they were compiled with different compilers. (e.g. Umfpack was compiled with gcc 6 and NGSolve with gcc 4.8 for instance). I just saw that I am not passing CMAKE_CXX_COMPILER to the Umfpack subproject, I will fix that.

Anyway, Umfpack doesn't make sense on a parallel environment, so try to get MUMPS running instead.

Guosheng Fu wrote: 2) similar thing happes when I activated mumps with "-DUSE_MUMPS=ON"

Please give me the exact error message. Did you point to a prebuilt version of MUMPS? Otherwise it's built automatically with NGSolve (recommended approach).

Guosheng Fu wrote: 3) I tried to add pardiso with MKL:

"-DUSE_MKL=ON
-DMKL_ROOT=/soft/intel/x86_64/12.1/8.273/composer_xe_2011_sp1.8.273/mkl/"

but got a compiling error for ngsolve at very begining:
netgen/src/ngsolve/ngstd/taskmanager.cpp: In member function 'void ngstd::TaskManager::Loop(int)':
netgen/src/ngsolve/ngstd/taskmanager.cpp:345:32: error: 'mkl_set_num_threads_local' was not declared in this scope
mkl_set_num_threads_local(1);

This function seems to be missing in you mkl version. Is there no newer version installed? (Yours is from 2011). If not, you can just comment out those two functions calls, they only affect shared memory parallelization and are irrelevant with MPI.

Guosheng Fu wrote: Finally, I have convergence issue with the provided preconditioner. Running with
>> mpirun -np 5 ngspy mpi_poission.py
(the bddc preconditioner)
I got the following diverging result:
assemble VOL element 6697/6697
assemble VOL element 6697/6697
create masterinverse
master: got data from 4
now build graph
n = 8507
now build matrix
have matrix, now invert
start order
order ........ 14952360 Bytes task-based parallelization (C++11 threads) using 1 threads
factor SPD ........
0 1.00669
1 0.940628
2 0.533298
3 0.540046
4 1.33798
5 1.05662

But without MPI, the method converges in 12 iterations.
I replaced the preconditioner with type "local", then there is no convergence difference between the mpi version and non-mpi version. Is this to be expected?

Thanks in advance,
Guosheng


This seems to be a bug/missing feature. bddc is calling 'masterinverse' at one point, which means the whole matrix is copied to the master rank and inverted there. This seems to be working only for symmetrically stored matrices. Lukas is working on it.


I hope, we can sort out all the issues before you lose your patience... :)

Best,
Matthias
Time to create page: 0.103 seconds