Featured

Distributed Parameter Tests with NGSolve and dispy

As this may be useful for some people I decided to write a short article about how one can do distributed parameter tests either on a cloud or on a cluster. As a test case, we will solve the poisson equation for some 1000 different conductivities - automatically distributed onto all available computing nodes. Apart from NGSolve we will use dispy to manage the network communication.

Distributed computing using a cloud

This introduction works if all your computing nodes are in a local network. For use in a cluster look a bit further down, if the nodes are in different networks, an explanation how to use dispy in that setting is available in the dispy documentation. First one needs to install both dispy and NGSolve on all nodes. Then execute the dispynode.py script (provided by dispy) on every node you want to use for your computation. Next we write a compute function which will be executed on every node:

def compute(alpha):
    import ngsolve as ngs
    import pickle
    import socket
    import os

    from netgen.geom2d import unit_square
    ngs.SetNumThreads(1)
    mesh = ngs.Mesh(unit_square.GenerateMesh(maxh=0.1))
    fes = ngs.H1(mesh,order=5,dirichlet="left|top")

    u,v = fes.TrialFunction(), fes.TestFunction()
    a = ngs.BilinearForm(fes)
    a += ngs.SymbolicBFI(ngs.grad(u)*ngs.grad(v))
    f = ngs.LinearForm(fes)
    f += ngs.SymbolicLFI(1 * v)
    u = ngs.GridFunction(fes,"u_"+str(alpha))
    with ngs.TaskManager():
        a.Assemble()
        u.vec.data = a.mat.Inverse(fes.FreeDofs()) * f.vec
        print("computed solution")
    directory = os.path.expanduser("~") + "/tmp"
    if not os.path.exists(directory):
        os.makedirs(directory)
    filename = directory + "/solution_"+str(alpha) + ".dat"
    with open(filename,"wb") as f:
        pickler = pickle.Pickler(f)
        pickler.dump([mesh,u])
    dispy_send_file(filename)
    host = socket.gethostname()
    return host

This function computes the solution for an input conductivity $\alpha$, pickles the solution and sends it back to the master node. On the master we will need to run some further dispy code which creates the jobs and submits them:

if __name__ == '__main__':
    import dispy
    import numpy as np
    cluster = dispy.JobCluster(compute)
    jobs = []
    for alpha in np.linspace(1,200,1000):
        job = cluster.submit(alpha)
        job.id = alpha
        jobs.append(job)
    for job in jobs:
        host = job()
        if job.exception:
            print(job.exception)
        else:
            print('%s computed solution for alpha = %s in %s seconds' % (host, job.id,
                                                                         job.end_time-job.start_time))
    cluster.print_status()

Running the dispynode.py on 4 computers in the network and then running simple_ngsolve.py we get all the pickled solutions back as well as a short report:

                           Node |  CPUs |    Jobs |    Sec/Job | Node Time Sec
------------------------------------------------------------------------------
 pc1                            |   128 |     264 |      7.884 |      2081.290
 pc2                            |    16 |     152 |      1.349 |       205.017
 pc3                            |    48 |     300 |      2.115 |       634.399
 pc4                            |    80 |     284 |      3.718 |      1055.902

Total job time: 3976.608 sec, wall time: 21.270 sec, speedup: 186.955

Note that we told ngsolve to run with 1 thread only with the line

ngs.SetNumThreads(1)

And let dispy start one node per available thread. For larger computations it may be more efficient (or even necessary because of the memory) to do the reverse (have a look at the dispy documentation to see how to run dispy with one node per computer instead of one per thread).

 

Parameter Testing using a cluster (by Lukas Kogler)

If you have access to a cluster you might want to do your parameter tests there. The setup for a cluster is basically the same from the python side, however you will usually have to deal with some kind of workload manager or queue to start your computations.

We will give a short example on how to set this up on a cluster using the SLURM workload-manager. All relevant files are in the attached zip-container.

We want to have one node running the master-script and all the other nodes wait for jobs using the dispy script.

In this example, each node has 20 cores and we will start one instance of ngsolve per node, each using

20 threads.

The job is submitted with "sbatch script_dispy", which contains the following:

#!/usr/bin/bash
#
#SBATCH --job-name=dispy_test
#SBATCH --partition=medium
#SBATCH --output=/home/lkogler/scripts/dispy/slurm_out
#SBATCH --ntasks 4
#SBATCH -N 4
#SBATCH --time=15:00
#SBATCH --cpus-per-task=20
#SBATCH --tasks-per-node=1

srun --multi-prog multi.conf

This tells SLURM that we want to start a job using 4 nodes (N), running 1 task per node with 20 cpus assigned to each.

--multi-prog tells slurm that you are planning to execute different commands on different tasks. In this case, the list of commands is contained in the file multi.conf. Task 0 is the master and tasks 1 to 3 wait for jobs.

0   python3 /home/lkogler/scripts/dispy/simple_ngsolve.py 
1-3 bash /home/lkogler/scripts/dispy/.pipe_it python3 /home/lkogler/scripts/dispy/dispynode.py --daemon --clean

Per default, slurm writes the output of all tasks into the same file, which can be specified by "--output=..." in the slurm-script.

We want each instance of ngsolve to write into its own file. In this case, SLRUM provides the environment variable SLURM_NODEID and we can use this to specify one file per node in the attached file ".pipe_it":

#!/usr/bin/bash
$@ > /scratch/lkogler/output/out_node_$SLURM_NODEID

This is an important point because otherwise all of the processes would try to write to the same file at the same time. They would constantly block each other and the resulting file would be absolutely unreadable.

 

If you run into problems or need help getting started, leave a comment here or in the forum or contact us directly!

 

Attachments
simple ngsolve.py [1.55Kb]
Uploaded Friday, 17 November 2017 by Christopher Lackner
dispy cluster.zip [796B]
Uploaded Monday, 20 November 2017 by Lukas Kogler