The Distributed ASCI Supercomputer 3 (DAS-3)

SGE does not enforce exclusive access to the reserved processors. It is not difficult to run processes on the compute nodes behind the reservation system's back. However, this harms your fellow users, and yourself when you are interested in performance information.
SGE does not alter the users' environment. In particular this means, that pre-set execution limits (such as memory_use, cpu_time, etc) are not changed. We think this is the way it ought to be: if users want to change their environment, they should do so with the appropriate "ulimit" command in their .bashrc (for bash users) or .cshrc (for csh/tcsh users) files in their home directory.
NOTE: your .bash_profile (for bash users) or .login (for csh/tcsh users) is NOT executed within the SGE job, so be very careful with environment settings in your .bashrc/.cshrc.

SGE/MPI example:

Here, we will discuss a simple parallel example application, using MPI on a single DAS-3 cluster. See the DAS-3 OpenMPI page for an example regarding multi-cluster Grid runs using OpenMPI.

Step 1: Inspect the source code:

$ cat cpi.c
#include "mpi.h"
#include <stdio.h>
#include <math.h>

static double
f(double a)
{
    return (4.0 / (1.0 + a*a));
}

int
main(int argc, char *argv[])
{
    int done = 0, n, myid, numprocs, i;
    double PI25DT = 3.141592653589793238462643;
    double mypi, pi, h, sum, x;
    double startwtime = 0.0, endwtime;
    int  namelen;
    char processor_name[MPI_MAX_PROCESSOR_NAME];

    MPI_Init(&argc, &argv);
    MPI_Comm_size(MPI_COMM_WORLD, &numprocs);
    MPI_Comm_rank(MPI_COMM_WORLD, &myid);
    MPI_Get_processor_name(processor_name, &namelen);

    fprintf(stderr, "Process %d on %s\n", myid, processor_name);

    n = 0;
    while (!done) {
        if (myid == 0) {
	    if (n == 0) {
		n = 100; /* precision for first iteration */
	    } else {
		n = 0;   /* second iteration: force stop */
	    }

	    startwtime = MPI_Wtime();
        }

        MPI_Bcast(&n, 1, MPI_INT, 0, MPI_COMM_WORLD);
        if (n == 0) {
            done = 1;
        } else {
            h   = 1.0 / (double) n;
            sum = 0.0;
            for (i = myid + 1; i <= n; i += numprocs) {
                x = h * ((double) i - 0.5);
                sum += f(x);
            }
            mypi = h * sum;

            MPI_Reduce(&mypi, & pi, 1, MPI_DOUBLE, MPI_SUM, 0, MPI_COMM_WORLD);

            if (myid == 0) {
                printf("pi is approximately %.16f, error is %.16f\n",
                       pi, fabs(pi - PI25DT));
		endwtime = MPI_Wtime();
		printf("wall clock time = %f\n", endwtime - startwtime);
	    }
        }
    }

    MPI_Finalize();

    return 0;
}

Step 2: Compile the code with MPICH:

$ module load default-ethernet   # or "default-myrinet"
$ which mpicc
/usr/local/Cluster-Apps/mpich/ge/gcc/64/1.2.7/bin/mpicc
$ mpicc -o cpi cpi.c

Step 3: Adapt the SGE job submission script to your needs. In general, running a new program using SGE only requires a few minor changes to an existing job script. A job script can be any regular shell script, but a few SGE-specific annotations in comments starting with "#$" are used to influence scheduling behavior. A script to start an MPI (MPICH) application could look like this:

$ cat cpi.sge
#!/bin/sh
#$ -S /bin/sh
#$ -cwd
#$ -N cpi-job
#$ -l h_rt=0:15:00
#$ -pe mpich 4

MYAPP=cpi
MYAPP_FLAGS=''

# The number of cores per node can be different on the various DAS sites.
# Here we will use every cpu core there is on the first compute node,
# and assume this will be the same on the other nodes in the same cluster:
cpuspernode=`cat /proc/cpuinfo | grep processor | wc -l`
ncpus=`expr $NSLOTS \* $cpuspernode`

export MPICH_PROCESS_GROUP=no
. /etc/profile.d/modules.sh
module add default-ethernet  # or "default-myrinet"

#$ -v MPI_HOME
#$ -v LD_LIBRARY_PATH

echo "Got $NSLOTS nodes."
echo "Original machines file:"
cat $TMPDIR/machines
for i in `cat $TMPDIR/machines`; do
    j=0
    while [ $j -lt $cpuspernode ]; do
        echo $i
        j=`expr $j + 1`
    done
done > $TMPDIR/machines2
echo "Using following machines file:"
cat $TMPDIR/machines2

cmd="$MPI_HOME/bin/mpirun -np $ncpus -machinefile \
$TMPDIR/machines2  $MYAPP $MYAPP_FLAGS"
echo Will run: $cmd
$cmd

In this example, on the first two lines we specified the shell "/bin/sh" to be used for the execute of the job script. The first line is in fact sufficient on DAS-3, but the second line is included for backward compatibility with other SGE installations.

The "-cwd" in line 3 requests execution of the script from the current working directory; without this, the job would be started from the user's home directory.

The name of the SGE job is then explicitly set to "cpi-job" with "-N cpi-job"; otherwise the name of the SGE script itself would be used (i.e., "cpi.sge"). The name of the job also determines how the job's output files will be called.

In lines five and six, we set the job parameters. Here the job is given 15 minutes maximum; if it takes longer than that, it will automatically be terminated by SGE. This is important since during working hours only short jobs are allowed on DAS-3. The phrase "-pe mpich 4" asks for a "parallel environment" called "mpich", to be run with 4 nodes. The parallel environment specification causes SGE to use wrapper script that know about the SGE job interfaces, and which can be used to conveniently start parallel applications of a particular kind, in this case MPICH.

The script then sets the program to be run and arguments to be passed. The actual starting of the application is by means of MPICH's "mpirun" tool. The example job script starts one process for each processor core on each node. Since the number of cores per node depends on the cluster in DAS-3 (it is 4 on DAS-3/VU and DAS-3/UvA, but 2 on DAS-3/Leiden, DAS-3/Delft and DAS-3/UvA-MN), it is determined dynamically by the SGE script, which gets run on the first node allocated. To use only one or two MPI processes per node, explicitly set "cpuspernode" to 1 or 2.

Note: the mpirun command used should be from the same MPICH directory that was used during compilation, since the startup procedure may be different between the various versions of MPICH supported on DAS-3.

Step 4: Check to see how many nodes there are, and their availability using qstat -f, or use "preserve -llist" to quickly check the current node usage:

$ qstat

$ qstat -f
queuename                      qtype used/tot. load_avg arch          states
----------------------------------------------------------------------------
all.q@node301.beowulf.cluster  BIP   0/1       0.00     lx26-amd64    
----------------------------------------------------------------------------
all.q@node302.beowulf.cluster  BIP   0/1       0.00     lx26-amd64    
----------------------------------------------------------------------------
all.q@node303.beowulf.cluster  BIP   0/1       0.00     lx26-amd64    
----------------------------------------------------------------------------
all.q@node304.beowulf.cluster  BIP   0/1       0.00     lx26-amd64    
[...]

$ preserve -llist
Mon Oct  9 15:29:39 2006

id      user            start           stop            state   nhosts  hosts
2590    versto          10/09   15:29   10/09   15:44   r       4       node308 node320
node354 node366

Note: DAS-3 compute nodes will generally be in the SGE queue called "all.q", the default host queue containing all nodes. Depending on the site, other nodes for special purposes may be available as well, but they will typically be put in a different queue to prevent their accidental use.

Step 5: Submit the SGE job and check its status until it has completed:

$ qsub cpi.sge 
Your job 2591 ("cpi-job") has been submitted
$ qstat -u $USER
job-ID  prior   name     user     state submit/start at     queue    slots ja-task-ID 
-------------------------------------------------------------------------------------
  2591  0.00000 cpi-job  versto   qw    10/09/2006 15:31:14              4        
$ preserve -llist
Mon Oct  9 15:31:24 2006

id      user            start           stop            state   nhosts  hosts
2591    versto          10/09   15:31   10/09   15:46   r       4       node322 node353
node359 node368

$ preserve -llist
Mon Oct  9 15:35:02 2006

$

Note that besides the consise "preserve -llist" output, there are many ways to get specific information about the running jobs using SGE's native qstat command. To get an overview of the options available, run "qstat -help".

Step 6: Examine the standard output and standard error files for job with ID 2591:

$ cat cpi-job.o2591 
Got 4 nodes.
Original machines file:
node353
node322
node359
node368
Using following machines file:
node353
node353
node322
node322
node359
node359
node368
node368
Will run: /usr/local/Cluster-Apps/mpich/ge/gcc/64/1.2.7/bin/mpirun -np 8 -machinefile
/tmp/2591.1.all.q/machines2 cpi
pi is approximately 3.1416009869231249, error is 0.0000083333333318
wall clock time = 0.007812

$ cat cpi-job.e2591 
Process 0 on node353.beowulf.cluster
Process 1 on node353.beowulf.cluster
Process 2 on node322.beowulf.cluster
Process 3 on node322.beowulf.cluster
Process 5 on node359.beowulf.cluster
Process 4 on node359.beowulf.cluster
Process 6 on node368.beowulf.cluster
Process 7 on node368.beowulf.cluster

Prun/MPI example:

$ module load default-ethernet   # or "default-myrinet"
$ prun -v -1 -sge-script /usr/local/sitedep/reserve.sge/sge-script `pwd`/cpi 4
Reservation number 2592: Reserved 4 hosts for 900 seconds 
Run on 4 hosts for 960 seconds from Mon Oct  9 15:50:08 2006
: node355/0 node334/0 node338/0 node305/0 
Process 0 on node355.beowulf.cluster
Process 1 on node334.beowulf.cluster
Process 2 on node338.beowulf.cluster
Process 3 on node305.beowulf.cluster
pi is approximately 3.1416009869231249, error is 0.0000083333333318
wall clock time = 0.003906

$ prun -v -2 -sge-script /usr/local/sitedep/reserve.sge/sge-script `pwd`/cpi 4
Reservation number 2593: Reserved 4 hosts for 900 seconds 
Run on 4 hosts for 960 seconds from Mon Oct  9 15:50:37 2006
: node321/0 node321/1 node349/0 node349/1 node302/0 node302/1 node356/0 node356/1 
Process 0 on node321.beowulf.cluster
Process 1 on node349.beowulf.cluster
Process 2 on node302.beowulf.cluster
Process 3 on node356.beowulf.cluster
Process 4 on node321.beowulf.cluster
Process 5 on node349.beowulf.cluster
Process 6 on node302.beowulf.cluster
Process 7 on node356.beowulf.cluster
pi is approximately 3.1416009869231249, error is 0.0000083333333318
wall clock time = 0.003906

$ prun -v -sge-script /usr/local/sitedep/reserve.sge/sge-script `pwd`/cpi 4
Reservation number 2594: Reserved 4 hosts for 900 seconds 
Run on 4 hosts for 960 seconds from Mon Oct  9 15:50:52 2006
: node350/0 node350/1 node301/0 node301/1 
Process 0 on node350.beowulf.cluster
Process 1 on node301.beowulf.cluster
Process 2 on node350.beowulf.cluster
Process 3 on node301.beowulf.cluster
pi is approximately 3.1416009869231249, error is 0.0000083333333318
wall clock time = 0.003906

/usr/local/sitedep/reserve.sge/sge-script

A few more Prun examples:

prun manual page

$ prun -1 -v /bin/echo 2 hello
Reservation number 2595: Reserved 2 hosts for 900 seconds 
Run on 2 hosts for 960 seconds from Mon Oct  9 16:01:37 2006
: node341/0 node323/0 
0 2 hello
1 2 hello

$ prun -1 -v -no-panda /bin/echo 2 hello
Reservation number 2596: Reserved 2 hosts for 900 seconds 
Run on 2 hosts for 960 seconds from Mon Oct  9 16:01:52 2006
: node318/0 node343/0 
hello
hello

$ prun -1 -v -np 2 /bin/echo hello
Reservation number 2597: Reserved 2 hosts for 900 seconds 
Run on 2 hosts for 960 seconds from Mon Oct  9 16:02:38 2006
: node317/0 node365/0 
hello
hello