8000 GitHub - rhfeiyang/slurm-tutorial: SLURM Tutorial
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

rhfeiyang/slurm-tutorial

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 

Repository files navigation

SLURM Tutorial & Quick Reference

A concise guide to using SLURM (Simple Linux Utility for Resource Management) for high-performance computing clusters.

Table of Contents

Prerequisites

Before using SLURM commands, make sure to:

  1. Connect to your cluster's login node (usually via SSH)
  2. Have proper account permissions set up by your system administrator
  3. Know your cluster's partition names and available resources
ssh username@your-cluster-login-node

Common Commands

Check Job Status

squeue                    # View all running jobs
squeue -u username        # View your jobs only
squeue -j jobid           # View specific job

Check Cluster Resources

sinfo                     # Basic cluster information
sinfo -Nel                # Detailed node information

Custom Resource View

For a more detailed and customed resource overview (by @rhfeiyang):

sinfo -O Nodehost:13,partition:.15,statecompact:.7,Gres:.30,GresUsed:.47,freemem:.10,memory:.10,cpusstate:.15

Job Management

scancel jobid             # Cancel a job
scancel -u username       # Cancel all your jobs
scontrol show job jobid   # Show job details

Modify Running Jobs

scontrol update JobId=jobid TimeLimit=10:00:00    # Modify time limit
scontrol update JobId=jobid Partition=newpartition # Change partition

Resource Allocation and Job Execution

Interactive Jobs (for debugging)

Basic CPU-only allocation:

salloc -N 1 -c 4 -t 2:00:00 --mem=8G

GPU allocation (adjust GPU type and count as needed):

# Single GPU
salloc -N 1 -c 8 --gres=gpu:1 --mem=32G -t 4:00:00

# Specific GPU type (check with `sinfo -O Gres:13`)
salloc -N 1 -c 8 --gres=gpu:v100:1 --mem=32G -t 4:00:00
salloc -N 1 -c 8 --gres=gpu:a100:1 --mem=64G -t 4:00:00

Then you can ssh into the allocated node with ssh <node_name>.

Direct Command Execution with srun

srun allows you to run commands directly on allocated resources without needing an interactive shell.

Common srun Options

# Run with GPU
srun -N 1 -c 8 --gres=gpu:1 --mem=32G -t 2:00:00 nvidia-smi

# Run Python script directly
srun -N 1 -c 8 --gres=gpu:1 --mem=32G -t 4:00:00 python train.py

Parameter Explanations

  • -N 1: Number of nodes
  • -c 8: Number of CPU cores
  • --gres=gpu:1: Generic resource (1 GPU)
  • --mem=32G: Memory allocation
  • -t 4:00:00: Time limit (4 hours)
  • -p partition_name: Specify partition
  • -A account_name: Specify account

Batch Job Submission

Create a SLURM Script

Create a file with .slurm or .sh extension:

#!/bin/bash
#SBATCH --job-name=my_job          # Job name
#SBATCH --partition=gpu            # Partition name
#SBATCH --account=your_account     # Account name
#SBATCH --requeue                  # Requeue the job if it fails/preempted
#SBATCH --nodes=1                  # Number of nodes
#SBATCH --cpus-per-task=8          # CPU cores per task
#SBATCH --mem=32G                  # Memory per node
#SBATCH --time=05:00:00            # Time limit (5 hours)
#SBATCH --gres=gpu:1               # GPU allocation
#SBATCH --output=job_%j.out        # Standard output (%j = job ID)
#SBATCH --error=job_%j.err         # Standard error
#SBATCH --mail-type=END,FAIL       # Email notifications
#SBATCH --mail-user=your@email.com # Email address


# Navigate to working directory
cd /path/to/your/project

# Activate environment, change to your own shell and environment name
source /home/user/.bashrc
conda activate your_env

# Run your commands
python your_script.py
echo "Job completed at $(date)"

Submit the Job

sbatch your_script.slurm

Common SBATCH Options

Option Description Example
--job-name Job name --job-name=training
--partition Queue/partition --partition=gpu
--nodes Number of nodes --nodes=2
--ntasks Number of tasks --ntasks=4
--cpus-per-task CPUs per task --cpus-per-task=8
--mem Memory per node --mem=64G
--mem-per-cpu Memory per CPU --mem-per-cpu=4G
--time Time limit --time=12:00:00
--gres Generic resources --gres=gpu:2
--exclusive Exclusive node use --exclusive
--array Job arrays --array=1-10

Best Practices

1. Resource Planning

  • Request only the resources you need
  • Use appropriate time limits
  • Monitor your jobs with squeue and sacct

2. Job Management

  • Use meaningful job names
  • Redirect output to files
  • Set up email notifications for long jobs

3. Debugging

  • Start with interactive jobs for testing
  • Use small test datasets first
  • Check your scripts locally when possible

4. Efficiency Tips

  • Use job arrays for parameter sweeps
  • Checkpoint long-running jobs
  • Clean up temporary files
  • Use --requeue for fault tolerance

5. Common Troubleshooting

# Check job details
scontrol show job jobid

# View job history
sacct -j jobid --format=JobID,JobName,State,ExitCode,Start,End

# Check node details
scontrol show node nodename

# View cluster usage
sshare -A account_name

Additional Resources

  • Official SLURM Documentation
  • Check your cluster's specific documentation for:
    • Available partitions and QoS settings
    • Installed software modules
    • Local policies and limits
    • Cluster-specific examples

Note: Replace placeholder values (account names, partition names, email addresses, etc.) with your cluster-specific information. Contact your system administrator for cluster-specific configurations.

About

SLURM Tutorial

Topics

Resources

Stars

Watchers

Forks

0