SLURM Tutorial & Quick Reference

A concise guide to using SLURM (Simple Linux Utility for Resource Management) for high-performance computing clusters.

Prerequisites

Before using SLURM commands, make sure to:

Connect to your cluster's login node (usually via SSH)
Have proper account permissions set up by your system administrator
Know your cluster's partition names and available resources

ssh username@your-cluster-login-node

Common Commands

Check Job Status

squeue                    # View all running jobs
squeue -u username        # View your jobs only
squeue -j jobid           # View specific job

Check Cluster Resources

sinfo                     # Basic cluster information
sinfo -Nel                # Detailed node information

Custom Resource View

For a more detailed and customed resource overview (by @rhfeiyang):

sinfo -O Nodehost:13,partition:.15,statecompact:.7,Gres:.30,GresUsed:.47,freemem:.10,memory:.10,cpusstate:.15

Job Management

scancel jobid             # Cancel a job
scancel -u username       # Cancel all your jobs
scontrol show job jobid   # Show job details

Modify Running Jobs

scontrol update JobId=jobid TimeLimit=10:00:00    # Modify time limit
scontrol update JobId=jobid Partition=newpartition # Change partition

Resource Allocation and Job Execution

Interactive Jobs (for debugging)

Basic CPU-only allocation:

salloc -N 1 -c 4 -t 2:00:00 --mem=8G

GPU allocation (adjust GPU type and count as needed):

# Single GPU
salloc -N 1 -c 8 --gres=gpu:1 --mem=32G -t 4:00:00

# Specific GPU type (check with `sinfo -O Gres:13`)
salloc -N 1 -c 8 --gres=gpu:v100:1 --mem=32G -t 4:00:00
salloc -N 1 -c 8 --gres=gpu:a100:1 --mem=64G -t 4:00:00

Then you can ssh into the allocated node with ssh <node_name>.

Direct Command Execution with srun

srun allows you to run commands directly on allocated resources without needing an interactive shell.

Common srun Options

# Run with GPU
srun -N 1 -c 8 --gres=gpu:1 --mem=32G -t 2:00:00 nvidia-smi

# Run Python script directly
srun -N 1 -c 8 --gres=gpu:1 --mem=32G -t 4:00:00 python train.py

Parameter Explanations

-N 1: Number of nodes
-c 8: Number of CPU cores
--gres=gpu:1: Generic resource (1 GPU)
--mem=32G: Memory allocation
-t 4:00:00: Time limit (4 hours)
-p partition_name: Specify partition
-A account_name: Specify account

Batch Job Submission

Create a SLURM Script

Create a file with .slurm or .sh extension:

#!/bin/bash
#SBATCH --job-name=my_job          # Job name
#SBATCH --partition=gpu            # Partition name
#SBATCH --account=your_account     # Account name
#SBATCH --requeue                  # Requeue the job if it fails/preempted
#SBATCH --nodes=1                  # Number of nodes
#SBATCH --cpus-per-task=8          # CPU cores per task
#SBATCH --mem=32G                  # Memory per node
#SBATCH --time=05:00:00            # Time limit (5 hours)
#SBATCH --gres=gpu:1               # GPU allocation
#SBATCH --output=job_%j.out        # Standard output (%j = job ID)
#SBATCH --error=job_%j.err         # Standard error
#SBATCH --mail-type=END,FAIL       # Email notifications
#SBATCH --mail-user=your@email.com # Email address


# Navigate to working directory
cd /path/to/your/project

# Activate environment, change to your own shell and environment name
source /home/user/.bashrc
conda activate your_env

# Run your commands
python your_script.py
echo "Job completed at $(date)"

Submit the Job

sbatch your_script.slurm

Common SBATCH Options

Option	Description	Example
`--job-name`	Job name	`--job-name=training`
`--partition`	Queue/partition	`--partition=gpu`
`--nodes`	Number of nodes	`--nodes=2`
`--ntasks`	Number of tasks	`--ntasks=4`
`--cpus-per-task`	CPUs per task	`--cpus-per-task=8`
`--mem`	Memory per node	`--mem=64G`
`--mem-per-cpu`	Memory per CPU	`--mem-per-cpu=4G`
`--time`	Time limit	`--time=12:00:00`
`--gres`	Generic resources	`--gres=gpu:2`
`--exclusive`	Exclusive node use	`--exclusive`
`--array`	Job arrays	`--array=1-10`

Best Practices

1. Resource Planning

Request only the resources you need
Use appropriate time limits
Monitor your jobs with squeue and sacct

2. Job Management

Use meaningful job names
Redirect output to files
Set up email notifications for long jobs

3. Debugging

Start with interactive jobs for testing
Use small test datasets first
Check your scripts locally when possible

4. Efficiency Tips

Use job arrays for parameter sweeps
Checkpoint long-running jobs
Clean up temporary files
Use --requeue for fault tolerance

5. Common Troubleshooting

# Check job details
scontrol show job jobid

# View job history
sacct -j jobid --format=JobID,JobName,State,ExitCode,Start,End

# Check node details
scontrol show node nodename

# View cluster usage
sshare -A account_name

Additional Resources

Official SLURM Documentation
Check your cluster's specific documentation for:
- Available partitions and QoS settings
- Installed software modules
- Local policies and limits
- Cluster-specific examples

Note: Replace placeholder values (account names, partition names, email addresses, etc.) with your cluster-specific information. Contact your system administrator for cluster-specific configurations.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

SLURM Tutorial & Quick Reference

Table of Contents

Prerequisites

Common Commands

Check Job Status

Check Cluster Resources

Custom Resource View

Job Management

Modify Running Jobs

Resource Allocation and Job Execution

Interactive Jobs (for debugging)

Direct Command Execution with srun

Common srun Options

Parameter Explanations

Batch Job Submission

Create a SLURM Script

Submit the Job

Common SBATCH Options

Best Practices

1. Resource Planning

2. Job Management

3. Debugging

4. Efficiency Tips

5. Common Troubleshooting

Additional Resources

About

Uh oh!

rhfeiyang/slurm-tutorial

Folders and files

Latest commit

History

Repository files navigation

SLURM Tutorial & Quick Reference

Table of Contents

Prerequisites

Common Commands

Check Job Status

Check Cluster Resources

Custom Resource View

Job Management

Modify Running Jobs

Resource Allocation and Job Execution

Interactive Jobs (for debugging)

Direct Command Execution with srun

Common srun Options

Parameter Explanations

Batch Job Submission

Create a SLURM Script

Submit the Job

Common SBATCH Options

Best Practices

1. Resource Planning

2. Job Management

3. Debugging

4. Efficiency Tips

5. Common Troubleshooting

Additional Resources

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks