DAS-3 Home
Achievements
Announcements
Overview
Cluster Sites
Connectivity
Research
User Accounts
Usage Policy
Job Execution
History
Steering Group
About the name...
|
DAS-3 Job Execution
General
Programs are started on the DAS-3 compute nodes using the
Sun Grid Engine (SGE)
batch queueing system. The SGE system reserves the requested number of
nodes for the duration of a program run. It is also possible to reserve a
number of hosts in advance, terminate running jobs, or query the status of
current jobs. A full
documentation set is available.
Job time policy on DAS-3
The default run time for SGE jobs on DAS-3 is 15 minutes,
which is also the maximum for jobs on DAS-3
during working hours. We do not want people to monopolize
the clusters for a long time, since that makes interactively
running short jobs on many or all nodes practically impossible.
During daytime, DAS-3 is specifically NOT a cluster
for doing production work. It is meant for people doing experimental
research on parallel and distributed programming.
Only during the night and in the weekend, when DAS-3 is
regularly idle, it is allowed to run long jobs.
In all other cases, you will first have to ask permission
in advance from
das-sysadm@cs.vu.nl
to make sure you are not causing too much trouble for other users.
More information is on the DAS-3 Usage Policy
page.
SGE
SGE on DAS-3 is imported into the development environment
of the current login session (by setting the PATH and
appropriate other environment variables) using the commands
module load default-ethernet
or
module load default-myrinet
(the latter only
if the high speed Myri-10G interconnect should be used,
which is not available in DAS-3/Delft).
The most often used SGE commands are
- qsub: submit a new job
- qstat: ask status about current jobs
- qdel: delete a queued job
Starting a program by means of SGE usually involves the creation
of a SGE job script first, which takes care of setting up
the proper environment, possibly copying some files, querying
the nodes that are used, and starting the actual processes
on the compute nodes that were reserved for the run.
An MPI-based example can be found below.
The SGE system should be the only way in which processes on the DAS compute
nodes are invoked; it provides exclusive access to the reserved processors.
This is vitally important for doing controlled performance experiments,
which is one of the main uses of DAS-3.
NOTE: People circumventing the reservation system will risk their account
being blocked!
Temporary files in /local
If variable SGE_KEEP_TMPFILES is set to "no" in "$HOME/.bashrc",
at the end of the job any files created by the user in the /local
file system will be removed by SGE.
If SGE_KEEP_TMPFILES is set to "yes", the user's files in /local will
be left untouched.
If SGE_KEEP_TMPFILES is not set (as is the default), only the user's JavaGAT
sandbox directories named /local/.JavaGAT_SANDBOX.* will be deleted.
Prun user interface
An alternative, often more convenient way to use SGE is via the
prun user interface.
The advantage is that the prun command acts as a synchronous,
shell-like interface, which was originally developed for DAS.
For DAS-3, the user interface is kept largely the same,
but the node reservation is done by SGE.
Therefore, SGE- and prun-initiated jobs don't interfere with each other.
Note that on DAS-3 a module command,
as for SGE mentioned above, should be used before invoking prun.
See also the manual pages for
preserve.
SGE caveats
-
SGE does not enforce exclusive access to the reserved processors.
It is not difficult to run processes on the compute nodes behind
the reservation system's back. However, this harms your fellow users, and
yourself when you are interested in performance information.
-
SGE does not alter the users' environment.
In particular this means, that pre-set execution limits
(such as memory_use, cpu_time, etc) are not changed.
We think this is the way it ought to be:
if users want to change their environment, they should do so
with the appropriate "ulimit" command in their .bashrc (for bash users)
or .cshrc (for csh/tcsh users) files in their home directory.
-
NOTE: your .bash_profile (for bash users) or .login (for csh/tcsh users)
is NOT executed within the SGE job, so be very careful with
environment settings in your .bashrc/.cshrc.
SGE/MPI example:
Here, we will discuss a simple parallel example application,
using MPI on a single DAS-3 cluster.
See the DAS-3 OpenMPI page
for an example regarding multi-cluster Grid runs using OpenMPI.
- Step 1: Inspect the source code:
$ cat cpi.c
#include "mpi.h"
#include <stdio.h>
#include <math.h>
static double
f(double a)
{
return (4.0 / (1.0 + a*a));
}
int
main(int argc, char *argv[])
{
int done = 0, n, myid, numprocs, i;
double PI25DT = 3.141592653589793238462643;
double mypi, pi, h, sum, x;
double startwtime = 0.0, endwtime;
int namelen;
char processor_name[MPI_MAX_PROCESSOR_NAME];
MPI_Init(&argc, &argv);
MPI_Comm_size(MPI_COMM_WORLD, &numprocs);
MPI_Comm_rank(MPI_COMM_WORLD, &myid);
MPI_Get_processor_name(processor_name, &namelen);
fprintf(stderr, "Process %d on %s\n", myid, processor_name);
n = 0;
while (!done) {
if (myid == 0) {
if (n == 0) {
n = 100; /* precision for first iteration */
} else {
n = 0; /* second iteration: force stop */
}
startwtime = MPI_Wtime();
}
MPI_Bcast(&n, 1, MPI_INT, 0, MPI_COMM_WORLD);
if (n == 0) {
done = 1;
} else {
h = 1.0 / (double) n;
sum = 0.0;
for (i = myid + 1; i <= n; i += numprocs) {
x = h * ((double) i - 0.5);
sum += f(x);
}
mypi = h * sum;
MPI_Reduce(&mypi, & pi, 1, MPI_DOUBLE, MPI_SUM, 0, MPI_COMM_WORLD);
if (myid == 0) {
printf("pi is approximately %.16f, error is %.16f\n",
pi, fabs(pi - PI25DT));
endwtime = MPI_Wtime();
printf("wall clock time = %f\n", endwtime - startwtime);
}
}
}
MPI_Finalize();
return 0;
}
- Step 2: Compile the code with MPICH:
$ module load default-ethernet # or "default-myrinet"
$ which mpicc
/usr/local/Cluster-Apps/mpich/ge/gcc/64/1.2.7/bin/mpicc
$ mpicc -o cpi cpi.c
- Step 3: Adapt the SGE job submission script to your needs. In general,
running a new program using SGE only requires a few minor
changes to an existing job script. A job script can be any regular
shell script, but a few SGE-specific annotations in comments
starting with "#$" are used to influence scheduling behavior.
A script to start an MPI (MPICH) application could look like this:
$ cat cpi.sge
#!/bin/sh
#$ -S /bin/sh
#$ -cwd
#$ -N cpi-job
#$ -l h_rt=0:15:00
#$ -pe mpich 4
MYAPP=cpi
MYAPP_FLAGS=''
# The number of cores per node can be different on the various DAS sites.
# Here we will use every cpu core there is on the first compute node,
# and assume this will be the same on the other nodes in the same cluster:
cpuspernode=`cat /proc/cpuinfo | grep processor | wc -l`
ncpus=`expr $NSLOTS \* $cpuspernode`
export MPICH_PROCESS_GROUP=no
. /etc/profile.d/modules.sh
module add default-ethernet # or "default-myrinet"
#$ -v MPI_HOME
#$ -v LD_LIBRARY_PATH
echo "Got $NSLOTS nodes."
echo "Original machines file:"
cat $TMPDIR/machines
for i in `cat $TMPDIR/machines`; do
j=0
while [ $j -lt $cpuspernode ]; do
echo $i
j=`expr $j + 1`
done
done > $TMPDIR/machines2
echo "Using following machines file:"
cat $TMPDIR/machines2
cmd="$MPI_HOME/bin/mpirun -np $ncpus -machinefile \
$TMPDIR/machines2 $MYAPP $MYAPP_FLAGS"
echo Will run: $cmd
$cmd
In this example, on the first two lines we specified the shell
"/bin/sh" to be used for the execute of the job script.
The first line is in fact sufficient on DAS-3, but the second line
is included for backward compatibility with other SGE installations.
The "-cwd" in line 3 requests execution of the script from the
current working directory; without this, the job would be started
from the user's home directory.
The name of the SGE job is then explicitly set to "cpi-job"
with "-N cpi-job"; otherwise the name of the SGE script itself
would be used (i.e., "cpi.sge"). The name of the job also
determines how the job's output files will be called.
In lines five and six, we set the job parameters.
Here the job is given 15 minutes maximum; if it takes longer than that,
it will automatically be terminated by SGE. This is important
since during working hours only short jobs are allowed on DAS-3.
The phrase "-pe mpich 4" asks for a "parallel environment" called "mpich",
to be run with 4 nodes. The parallel environment specification
causes SGE to use wrapper script that know about the SGE job interfaces,
and which can be used to conveniently start parallel applications of
a particular kind, in this case MPICH.
The script then sets the program to be run and arguments to
be passed. The actual starting of the application is by
means of MPICH's "mpirun" tool.
The example job script starts one process for each processor core
on each node. Since the number of cores per node depends on
the cluster in DAS-3 (it is 4 on DAS-3/VU and DAS-3/UvA,
but 2 on DAS-3/Leiden, DAS-3/Delft and DAS-3/UvA-MN),
it is determined dynamically by the SGE script, which gets
run on the first node allocated.
To use only one or two MPI processes per node, explicitly set "cpuspernode"
to 1 or 2.
Note: the mpirun command used should be from the same MPICH
directory that was used during compilation, since the startup procedure
may be different between the various versions of MPICH supported
on DAS-3.
- Step 4: Check to see how many nodes there are, and their
availability using qstat -f, or use "preserve -llist" to quickly
check the current node usage:
$ qstat
$ qstat -f
queuename qtype used/tot. load_avg arch states
----------------------------------------------------------------------------
all.q@node301.beowulf.cluster BIP 0/1 0.00 lx26-amd64
----------------------------------------------------------------------------
all.q@node302.beowulf.cluster BIP 0/1 0.00 lx26-amd64
----------------------------------------------------------------------------
all.q@node303.beowulf.cluster BIP 0/1 0.00 lx26-amd64
----------------------------------------------------------------------------
all.q@node304.beowulf.cluster BIP 0/1 0.00 lx26-amd64
[...]
$ preserve -llist
Mon Oct 9 15:29:39 2006
id user start stop state nhosts hosts
2590 versto 10/09 15:29 10/09 15:44 r 4 node308 node320
node354 node366
Note: DAS-3 compute nodes will generally be in the SGE queue called
"all.q", the default host queue containing all nodes.
Depending on the site, other nodes for special purposes may be available
as well, but they will typically be put in a different queue to prevent
their accidental use.
- Step 5: Submit the SGE job and check its status until it has completed:
$ qsub cpi.sge
Your job 2591 ("cpi-job") has been submitted
$ qstat -u $USER
job-ID prior name user state submit/start at queue slots ja-task-ID
-------------------------------------------------------------------------------------
2591 0.00000 cpi-job versto qw 10/09/2006 15:31:14 4
$ preserve -llist
Mon Oct 9 15:31:24 2006
id user start stop state nhosts hosts
2591 versto 10/09 15:31 10/09 15:46 r 4 node322 node353
node359 node368
$ preserve -llist
Mon Oct 9 15:35:02 2006
$
Note that besides the consise "preserve -llist" output,
there are many ways to get specific information about the
running jobs using SGE's native qstat command.
To get an overview of the options available, run "qstat -help".
- Step 6: Examine the standard output and standard error files for
job with ID 2591:
$ cat cpi-job.o2591
Got 4 nodes.
Original machines file:
node353
node322
node359
node368
Using following machines file:
node353
node353
node322
node322
node359
node359
node368
node368
Will run: /usr/local/Cluster-Apps/mpich/ge/gcc/64/1.2.7/bin/mpirun -np 8 -machinefile
/tmp/2591.1.all.q/machines2 cpi
pi is approximately 3.1416009869231249, error is 0.0000083333333318
wall clock time = 0.007812
$ cat cpi-job.e2591
Process 0 on node353.beowulf.cluster
Process 1 on node353.beowulf.cluster
Process 2 on node322.beowulf.cluster
Process 3 on node322.beowulf.cluster
Process 5 on node359.beowulf.cluster
Process 4 on node359.beowulf.cluster
Process 6 on node368.beowulf.cluster
Process 7 on node368.beowulf.cluster
Prun/MPI example:
Using the Prun user interface on top of SGE can often be more convenient
as the following examples show.
Note carefully the difference between using the "-1", "-2" and "-4" options
(explicitly starting one, two or four processes per node,
with -np specifying the number of nodes)
and the default behavior: starting two process per node,
with -np specifying the total number of processes.
$ module load default-ethernet # or "default-myrinet"
$ prun -v -1 -sge-script /usr/local/sitedep/reserve.sge/sge-script `pwd`/cpi 4
Reservation number 2592: Reserved 4 hosts for 900 seconds
Run on 4 hosts for 960 seconds from Mon Oct 9 15:50:08 2006
: node355/0 node334/0 node338/0 node305/0
Process 0 on node355.beowulf.cluster
Process 1 on node334.beowulf.cluster
Process 2 on node338.beowulf.cluster
Process 3 on node305.beowulf.cluster
pi is approximately 3.1416009869231249, error is 0.0000083333333318
wall clock time = 0.003906
$ prun -v -2 -sge-script /usr/local/sitedep/reserve.sge/sge-script `pwd`/cpi 4
Reservation number 2593: Reserved 4 hosts for 900 seconds
Run on 4 hosts for 960 seconds from Mon Oct 9 15:50:37 2006
: node321/0 node321/1 node349/0 node349/1 node302/0 node302/1 node356/0 node356/1
Process 0 on node321.beowulf.cluster
Process 1 on node349.beowulf.cluster
Process 2 on node302.beowulf.cluster
Process 3 on node356.beowulf.cluster
Process 4 on node321.beowulf.cluster
Process 5 on node349.beowulf.cluster
Process 6 on node302.beowulf.cluster
Process 7 on node356.beowulf.cluster
pi is approximately 3.1416009869231249, error is 0.0000083333333318
wall clock time = 0.003906
$ prun -v -sge-script /usr/local/sitedep/reserve.sge/sge-script `pwd`/cpi 4
Reservation number 2594: Reserved 4 hosts for 900 seconds
Run on 4 hosts for 960 seconds from Mon Oct 9 15:50:52 2006
: node350/0 node350/1 node301/0 node301/1
Process 0 on node350.beowulf.cluster
Process 1 on node301.beowulf.cluster
Process 2 on node350.beowulf.cluster
Process 3 on node301.beowulf.cluster
pi is approximately 3.1416009869231249, error is 0.0000083333333318
wall clock time = 0.003906
The generic Prun/SGE script /usr/local/sitedep/reserve.sge/sge-script
is used to start the application, similar to the SGE example above.
The script also uses a number of environment variables that are
provided by Prun.
A few more Prun examples:
The following examples show that starting commands that do not require
pre-execution coordination is quite easy using Prun, and can
be done without the help of a startup script.
For more details, see the
prun manual page.
$ prun -1 -v /bin/echo 2 hello
Reservation number 2595: Reserved 2 hosts for 900 seconds
Run on 2 hosts for 960 seconds from Mon Oct 9 16:01:37 2006
: node341/0 node323/0
0 2 hello
1 2 hello
$ prun -1 -v -no-panda /bin/echo 2 hello
Reservation number 2596: Reserved 2 hosts for 900 seconds
Run on 2 hosts for 960 seconds from Mon Oct 9 16:01:52 2006
: node318/0 node343/0
hello
hello
$ prun -1 -v -np 2 /bin/echo hello
Reservation number 2597: Reserved 2 hosts for 900 seconds
Run on 2 hosts for 960 seconds from Mon Oct 9 16:02:38 2006
: node317/0 node365/0
hello
hello
|
RELATED LINKS
|