Running Jobs with Slurm

Running Jobs with Slurm

Overview

All three clusters, Puma, Ocelote, and ElGato, use Slurm for resource management and job scheduling.

Additional Slurm Resources and Examples

Link

Description

Official SchedMD User Documentation

Official SchedMD user documentation. Includes detailed information on Slurm directives and commands.

PBS ⇔ Slurm Rosetta Stone

Table for converting some common PBS job directives to Slurm syntax.

HPC Quick Start

HPC Quick Start guide. If you have never submitted a batch job before, this is a great place to start.

Job Examples

Basic Slurm example scripts. Includes PBS scripts for comparison. 

Even More Job Examples!

Growing repository of example Slurm submission scripts

Intro to HPC

A recorded video presentation of our Intro to HPC workshop. Keep your eyes peeled for periodic announcements in the HPC listserv on upcoming live sessions!=

Contents

Node Summary

Before submitting a Slurm script, you must know (or at least have a general idea) of the resources needed for your job. This will tell you which type of node to request, how much memory, and other useful information that can be provided to the system via your batch script. A detailed list of Slurm batch flags are included below. 

General Overview

Node Type

Description

Standard CPU Node

This is the general purpose node, which can (and should) be used by the majority of jobs. 

High Memory CPU Node

Similar to the standard nodes, but with significantly more RAM. There a only a few of them and they should only be requested if you have tested your job on a standard node and find that its memory usage is too high. Both standard and high memory nodes share the same file system, so there is no advantage in terms of long term storage, only active RAM usage.

GPU Node

Similar to the standard node, but with one or more GPUs available, depending on which cluster is in use.

Hardware Limitations by Node Type and Cluster

Please consult the following table when crafting Slurm submission scripts. Requesting resources greater than what are available on a given cluster+node may lead to errors or delays.

Cluster

Node Type

N Nodes

N CPU/ Node

RAM/CPU

CPU RAM/ Node

N GPU/ Node

RAM/GPU

GPU RAM/ Node

Total N GPUs

Puma

Standard

236

94

5 gb

470 gb

-

-

-

-

High Mem

3 standard

94

32 gb

3008 gb

-

-

-

-

2 buy-in

GPU

8 standard

94

5 gb

470 gb

4

32 gb

128 gb

32

7 buy-in

28

Ocelote

Standard

400

28

6 gb

168 gb

-

-

-

-

High Mem

1

48

41 gb

1968 gb

-

-

-

-

GPU

46

28

8 gb

224 gb

1

16 gb

16 gb

46

El Gato

Standard

130

16

4 gb

62 gb

-

-

-

-

See here for example Slurm requests.

Other Job Limits

In addition to fitting your jobs within the constraints of our hardware, there are other limitations imposed by the scheduler to maintain fair use. 

  • Time Limit Per Job: A single job cannot run for more than 10 days (240 hours). Requesting more time than this will lead to your job being stuck in the queue indefinitely with reason code "QOSMaxWallDurationPerJobLimit" (see below for more reason codes). 

  • CPU Hours Per Group: The number of CPU hours used per job is subtracted from the PI's allocation. More info here

  • Active jobs, CPUs, GPUs, and Memory: To see the limits and usage of these items, log onto the cluster you wish to know more about, and type "job-limits <group>"

Slurm and System Commands

Command

Purpose

Example(s)

Native Slurm Commands

sbatch

Submits a batch script for execution

sbatch script.slurm

srun

Run parallel jobs. Can be in place of mpirun/mpiexec. Can be used interactively as well as in batch scripts

srun -n 1 --mpi=pmi2 a.out

salloc

Requests a session to  work on a compute node interactively

see: Interactive Sessions section below

squeue

Checks the status of pending and running jobs

squeue --job $JOBID
squeue --user $NETID

scancel

Cancel a running or pending job

scancel $JOBID
scancel -u $NETID

scontrol hold

Place a hold on a job to prevent it from being executed

scontrol hold $JOBID

scontrol release

Releases a hold placed on a job allowing it to be executed

scontrol release $JOBID

System Commands

va

Displays your group membership, your account usage, and CPU allocation. Short for "view allocation"

va

interactive

Shortcut for quickly requesting an interactive job. Use "interactive --help" to get full usage. 

interactive -a $GROUP_NAME

job-history

Retrieves a running or completed job's history in a user-friendly format

job-history $JOBID

seff

Retrieves a completed job's memory and CPU efficiency

seff $JOBID

past-jobs

Retrieves past jobs run by user. Can be used with option "-d N" to search for jobs run in the past N days.

past-jobs -d 5

job-limits

View your group's job resource limits and current usage.

job-limits $GROUP

nodes-busy

Display a visualization of nodes on a cluster and their usage

nodes-busy --help

system-busy

Display a text-based summary of a cluster's usage

system-busy

cluster-busy

Display a visualization of all three cluster's overall usage

cluster-busy --help

Batch Job Directives

Command 

Purpose

#SBATCH --account=group_name

Specify the account where hours are charged. Don't know your group name? Run the command "va" to see which groups you belong to

#SBATCH --partition=partition_name

Set the job partition. This determines your job's priority and the hours charged. See Job Partition Requests below for additional information

#SBATCH --time=DD-HH:MM:SS

Set the job's runtime limit in days, hours, minutes, and seconds. A single job cannot exceed 10 days or 240 hours.

#SBATCH --nodes=N

Allocate N nodes to your job.

For non-MPI enabled jobs, this should be set to "–-nodes=1" to ensure access to all requested resources and prevent memory errors.

#SBATCH --ntasks=N

ntasks specifies the number of tasks (or processes) the job will run. For MPI jobs, this is the number of MPI processes.  Most of the time, you can use ntasks to specify the number of CPUs your job needs. However, in some odd cases you might run into issues. For example, see: Using Matlab

By default, you will be allocated one CPU/task. This can be increased by including the additional directive --cpus-per-task.

The number of CPUs a job is allocated is cpus/task * ntasks, or M*N

#SBATCH --cpus-per-task=M

#SBATCH --mem=Ngb

Select N gb of memory per node. If "gb" is not included, this value defaults to MB. Directives --mem and --mem-per-cpu are mutually exclusive.

#SBATCH --mem-per-cpu=Ngb

Select N GB of memory per CPU. Valid values can be found in the Node Types/Example Resource Requests section below. If "gb" is not included, this value defaults to MB.

#SBATCH --gres=gpu:N

Optional: (Ocelote and Puma only) Request N GPUs of any type.

#SBATCH --gres=gpu:volta:N

Optional: (Puma only) Request N V100 GPUs.

#SBATCH --gres=gpu:nvidia_a100_80gb_pcie_2g.20gb

Optional: (Puma only, Available 2/26/24) Request MIG GPU slice with 20 GB GPU memory. See MIG (Multi-Instance GPU) resources.

#SBATCH --constraint=hi_mem

Optional: Request a high memory node (Ocelote and Puma only).

#SBATCH --array=N-M

Submits an array job from indices N to M

#SBATCH --job-name=JobName

Optional: Specify a name for your job. This will not automatically affect the output filename.

#SBATCH -e output_filename.err
#SBATCH -o output_filename.out

Optional: Specify output filename(s). If -e is missing, stdout and stderr will be combined.

#SBATCH --open-mode=append

Optional: Append your job's output to the specified output filename(s). 

#SBATCH --mail-type=BEGIN|END|FAIL|ALL

Optional: Request email notifications. Beware of mail bombing yourself.

#SBATCH --mail-user=email@address.xyz

Optional: Specify email address. If this is missing, notifications will go to your UArizona email address by default.

#SBATCH --exclusive

Optional: Request exclusive access to node.

#SBATCH --export=VAR

Optional: Export a comma-delimited list of environment variables to a job. 

#SBATCH --export=all (default)

Optional: Export your working environment to your job.

#SBATCH --export=none

Optional: Do not export working environment to your job.

Slurm Environment Variables

Variable

Purpose

Example Value

$SLURM_ARRAY_JOB_ID

Job array's parent ID

399124

$SLURM_ARRAY_TASK_COUNT

Total number of subjobs in the array

4

$SLURM_ARRAY_TASK_ID

Job index number (unique for each job in the array)

1

$SLURM_ARRAY_TASK_MAX

Maximum index for the job array

7

$SLURM_ARRAY_TASK_MIN

Minimum index for the job array

1

$SLURM_ARRAY_TASK_STEP

Job array's index step size

2

$SLURM_CLUSTER_NAME

Which cluster your job is running on

elgato

$SLURM_CONF

Points to the Slurm configuration file

/var/spool/slurm/d/conf-cache/slurm.conf

$SLURM_CPUS_ON_NODE

Number of CPUs allocated to target node

3

$SLURM_GPUS_ON_NODE

Number of GPUs allocated to the target node

1

$SLURM_GPUS_PER_NODE

Number of GPUs per node. Only set if --gpus-per-node is specified

1

$SLURM_JOB_ACCOUNT

Account being charged

groupname

$SLURM_JOB_GPUS

The global GPU IDs of the GPUs allocated to the job. Only set in batch and interactive jobs.

0

$SLURM_JOB_ID

Your Slurm Job ID

399072

$SLURM_JOB_CPUS_PER_NODE

Number of CPUs per node. This can be a list if there is more than one node allocated to the job. The list has the same order as SLURM_JOB_NODELIST

3,1

$SLURM_JOB_NAME

The job's name

interactive

$SLURM_JOB_NODELIST

The nodes that have been assigned to your job

gpu[73-74]

$SLURM_JOB_NUM_NODES

The number of nodes allocated to the job

2

$SLURM_JOB_PARTITION

The job's partition

standard

$SLURM_JOB_QOS

The job's QOS/Partition

qos_standard_part

$SLURM_JOB_USER

The username of the person who submitted the job

netid

$SLURM_JOBID

Same as SLURM_JOB_ID, your Slurm Job ID

399072

$SLURM_MEM_PER_CPU

The memory/CPU ratio allocated to the job

4096

$SLURM_NNODES

Same as SLURM_JOB_NUM_NODES – the number of nodes allocated to the job

2

$SLURM_NODELIST

Same as SLURM_JOB_NODELIST, The nodes that have been assigned to your job

gpu[73-74]

$SLURM_NPROCS

The number of tasks allocated to your job

4

$SLURM_NTASKS

Same as SLURM_NPROCS, the number of tasks allocated to your job

4

$SLURM_SUBMIT_DIR

The directory where sbatch was used to submit the job

/home/u00/netid

$SLURM_SUBMIT_HOST

The hostname where sbatch was used to submit the job

wentletrap.hpc.arizona.edu

$SLURM_TASKS_PER_NODE

The number of tasks to be initiated on each node. This can be a list if there is more than one node allocated to the job. The list has the same order as SLURM_JOB_NODELIST

3,1

$SLURM_WORKING_CLUSTER

Valid for interactive jobs, will be set with remote sibling cluster's IP address, port and RPC version so that any sruns will know which cluster to communicate with.

elgato:foo:0000:0000:000