Hi Philip,
That is quite the difficult problem, I must say! Unfortunately, I am afraid I have no easy solution for you.
I do have some suggestions, though.
The first thing you could try is hybrid parallelization. Only run one MPI-rank per node (or socket), but use multiple threads on every rank. Per default, NGSolve turns off shared mempory parallelization when run with MPI, but you can turn it back on from python with
That way, you would still have to load the global mesh on every rank to compute the integral, but you would have fewer ranks in total.
To speed up the computation of the integral itself, could you compute multiple integrals at the same time? Possibly, your CoefficientFunction could compute all integrals when it is called the first time in each timestep. Then you would only need to loop through elements once and compute integrals for all points you need.
Finally, if your mesh gets too fine for the hybrid parallelization approach, you could use MPI sub-communicators. Then, every rank that participates in the "main" computation could have a series of workers that partition one instance of the global mesh among them and are responsible for computing the integral only.
I think this would require a good amount of additional C++ coding. You would have to send the coefficients of the gridfunction to the worker nodes, and the integral values from workers to master ranks. This might be quite nasty to do efficiently.
I hope this helps you a bit.
Best,
Lukas