Running Jobs with Slurm


Increased monthly allocations!

On March 1st, 2024, the monthly allocation for each group was increased to 150,000 CPU Hours on Puma (previously 100,000), and 100,000 CPU hours on Ocelote (previously 70,000).


New GPUs on Ocelote!

We have recently added 22 new P100 GPUs to Ocelote. Need to request multiple GPUs on a node and you're finding Puma queue times too slow? You can now request two GPUs per node on Ocelote using --gres=gpu:2.

Overview

All three clusters, Puma, Ocelote, and ElGato, use Slurm for resource management and job scheduling.

Additional Slurm Resources and Examples

LinkDescription
Official SchedMD User DocumentationOfficial SchedMD user documentation. Includes detailed information on Slurm directives and commands.
PBS ⇔ Slurm Rosetta StoneTable for converting some common PBS job directives to Slurm syntax.
HPC Quick StartHPC Quick Start guide. If you have never submitted a batch job before, this is a great place to start.
Job ExamplesBasic Slurm example scripts. Includes PBS scripts for comparison. 
Even More Job Examples!Growing repository of example Slurm submission scripts
Intro to HPCA recorded video presentation of our Intro to HPC workshop. Keep your eyes peeled for periodic announcements in the HPC listserv on upcoming live sessions!=
Contents

Node Summary

Before submitting a Slurm script, you must know (or at least have a general idea) of the resources needed for your job. This will tell you which type of node to request, how much memory, and other useful information that can be provided to the system via your batch script. A detailed list of Slurm batch flags are included below. 

General Overview

Node TypeDescription
Standard CPU NodeThis is the general purpose node, which can (and should) be used by the majority of jobs. 
High Memory CPU NodeSimilar to the standard nodes, but with significantly more RAM. There a only a few of them and they should only be requested if you have tested your job on a standard node and find that its memory usage is too high. Both standard and high memory nodes share the same file system, so there is no advantage in terms of long term storage, only active RAM usage.
GPU NodeSimilar to the standard node, but with one or more GPUs available, depending on which cluster is in use.

Hardware Limitations by Node Type and Cluster

Please consult the following table when crafting Slurm submission scripts. Requesting resources greater than what are available on a given cluster+node may lead to errors or delays.

ClusterNode TypeN NodesN CPU/ NodeRAM/CPUCPU RAM/ NodeN GPU/ NodeRAM/GPUGPU RAM/ NodeTotal N GPUs

Puma

Standard236945 gb470 gb----
High Mem3 standard9432 gb3008 gb----
2 buy-in

GPU

8 standard

94

5 gb

470 gb

4

32 gb

128 gb

32
7 buy-in28

Ocelote

Standard400286 gb168 gb----
High Mem14841 gb1968 gb----
GPU46288 gb224 gb116 gb16 gb46
El GatoStandard130164 gb62 gb----

See here for example Slurm requests.

Other Job Limits

In addition to fitting your jobs within the constraints of our hardware, there are other limitations imposed by the scheduler to maintain fair use. 

  • Time Limit Per Job: A single job cannot run for more than 10 days (240 hours). Requesting more time than this will lead to your job being stuck in the queue indefinitely with reason code "QOSMaxWallDurationPerJobLimit" (see below for more reason codes). 
  • CPU Hours Per Group: The number of CPU hours used per job is subtracted from the PI's allocation. More info here
  • Active jobs, CPUs, GPUs, and Memory: To see the limits and usage of these items, log onto the cluster you wish to know more about, and type "job-limits <group>"


Slurm and System Commands

CommandPurposeExample(s)
Native Slurm Commands
sbatchSubmits a batch script for executionsbatch script.slurm
srunRun parallel jobs. Can be in place of mpirun/mpiexec. Can be used interactively as well as in batch scriptssrun -n 1 --mpi=pmi2 a.out
sallocRequests a session to  work on a compute node interactivelysee: Interactive Sessions section below
squeueChecks the status of pending and running jobs

squeue --job $JOBID
squeue --user $NETID

scancelCancel a running or pending job

scancel $JOBID
scancel -u $NETID

scontrol holdPlace a hold on a job to prevent it from being executedscontrol hold $JOBID
scontrol releaseReleases a hold placed on a job allowing it to be executedscontrol release $JOBID
System Commands
vaDisplays your group membership, your account usage, and CPU allocation. Short for "view allocation"va
interactiveShortcut for quickly requesting an interactive job. Use "interactive --help" to get full usage. interactive -a $GROUP_NAME
job-historyRetrieves a running or completed job's history in a user-friendly formatjob-history $JOBID
seffRetrieves a completed job's memory and CPU efficiencyseff $JOBID
past-jobsRetrieves past jobs run by user. Can be used with option "-d N" to search for jobs run in the past N days.past-jobs -d 5
job-limitsView your group's job resource limits and current usage.job-limits $GROUP
nodes-busyDisplay a visualization of nodes on a cluster and their usagenodes-busy --help
system-busyDisplay a text-based summary of a cluster's usagesystem-busy
cluster-busyDisplay a visualization of all three cluster's overall usagecluster-busy --help

Batch Job Directives

Command Purpose
#SBATCH --account=group_nameSpecify the account where hours are charged. Don't know your group name? Run the command "va" to see which groups you belong to
#SBATCH --partition=partition_nameSet the job partition. This determines your job's priority and the hours charged. See Running Jobs with Slurm#Job Partition Requests below for additional information
#SBATCH --time=DD-HH:MM:SSSet the job's runtime limit in days, hours, minutes, and seconds. A single job cannot exceed 10 days or 240 hours.
#SBATCH --nodes=N

Allocate N nodes to your job.

For non-MPI enabled jobs, this should be set to "–-nodes=1" to ensure access to all requested resources and prevent memory errors.

#SBATCH --ntasks=N

ntasks specifies the number of tasks (or processes) the job will run. For MPI jobs, this is the number of MPI processes.  Most of the time, you can use ntasks to specify the number of CPUs your job needs. However, in some odd cases you might run into issues. For example, see: Using Matlab

By default, you will be allocated one CPU/task. This can be increased by including the additional directive --cpus-per-task.

The number of CPUs a job is allocated is cpus/task * ntasks, or M*N

#SBATCH --cpus-per-task=M
#SBATCH --mem=Ngb

Select N gb of memory per node. If "gb" is not included, this value defaults to MB. Directives --mem and --mem-per-cpu are mutually exclusive.

#SBATCH --mem-per-cpu=NgbSelect N GB of memory per CPU. Valid values can be found in the Node Types/Example Resource Requests section below. If "gb" is not included, this value defaults to MB.
#SBATCH --gres=gpu:N

Optional: (Ocelote and Puma only) Request N GPUs of any type.

#SBATCH --gres=gpu:volta:NOptional: (Puma only) Request N V100 GPUs.
#SBATCH --gres=gpu:nvidia_a100_80gb_pcie_2g.20gbOptional: (Puma only, Available 2/26/24) Request MIG GPU slice with 20 GB GPU memory. See MIG (Multi-Instance GPU) resources.
#SBATCH --constraint=hi_memOptional: Request a high memory node (Ocelote and Puma only).
#SBATCH --array=N-MSubmits an array job from indices N to M
#SBATCH --job-name=JobNameOptional: Specify a name for your job. This will not automatically affect the output filename.
#SBATCH -e output_filename.err
#SBATCH -o output_filename.out
Optional: Specify output filename(s). If -e is missing, stdout and stderr will be combined.
#SBATCH --open-mode=appendOptional: Append your job's output to the specified output filename(s). 
#SBATCH --mail-type=BEGIN|END|FAIL|ALLOptional: Request email notifications. Beware of mail bombing yourself.
#SBATCH --mail-user=email@address.xyzOptional: Specify email address. If this is missing, notifications will go to your UArizona email address by default.
#SBATCH --exclusiveOptional: Request exclusive access to node.
#SBATCH --export=VAROptional: Export a comma-delimited list of environment variables to a job. 
#SBATCH --export=all (default)Optional: Export your working environment to your job.
#SBATCH --export=noneOptional: Do not export working environment to your job.

Slurm Environment Variables

VariablePurposeExample Value
$SLURM_ARRAY_JOB_IDJob array's parent ID399124
$SLURM_ARRAY_TASK_COUNTTotal number of subjobs in the array4
$SLURM_ARRAY_TASK_IDJob index number (unique for each job in the array)1
$SLURM_ARRAY_TASK_MAXMaximum index for the job array7
$SLURM_ARRAY_TASK_MINMinimum index for the job array1
$SLURM_ARRAY_TASK_STEPJob array's index step size2
$SLURM_CLUSTER_NAMEWhich cluster your job is running onelgato
$SLURM_CONFPoints to the Slurm configuration file/var/spool/slurm/d/conf-cache/slurm.conf
$SLURM_CPUS_ON_NODENumber of CPUs allocated to target node3
$SLURM_GPUS_ON_NODENumber of GPUs allocated to the target node1
$SLURM_GPUS_PER_NODENumber of GPUs per node. Only set if --gpus-per-node is specified1
$SLURM_JOB_ACCOUNTAccount being chargedgroupname
$SLURM_JOB_GPUSThe global GPU IDs of the GPUs allocated to the job. Only set in batch and interactive jobs.0
$SLURM_JOB_IDYour Slurm Job ID399072
$SLURM_JOB_CPUS_PER_NODENumber of CPUs per node. This can be a list if there is more than one node allocated to the job. The list has the same order as SLURM_JOB_NODELIST3,1
$SLURM_JOB_NAMEThe job's nameinteractive
$SLURM_JOB_NODELISTThe nodes that have been assigned to your jobgpu[73-74]
$SLURM_JOB_NUM_NODESThe number of nodes allocated to the job2
$SLURM_JOB_PARTITIONThe job's partitionstandard
$SLURM_JOB_QOSThe job's QOS/Partitionqos_standard_part
$SLURM_JOB_USERThe username of the person who submitted the jobnetid
$SLURM_JOBIDSame as SLURM_JOB_ID, your Slurm Job ID399072
$SLURM_MEM_PER_CPUThe memory/CPU ratio allocated to the job4096
$SLURM_NNODESSame as SLURM_JOB_NUM_NODES – the number of nodes allocated to the job2
$SLURM_NODELISTSame as SLURM_JOB_NODELIST, The nodes that have been assigned to your jobgpu[73-74]
$SLURM_NPROCSThe number of tasks allocated to your job4
$SLURM_NTASKSSame as SLURM_NPROCS, the number of tasks allocated to your job4
$SLURM_SUBMIT_DIRThe directory where sbatch was used to submit the job/home/u00/netid
$SLURM_SUBMIT_HOSTThe hostname where sbatch was used to submit the jobwentletrap.hpc.arizona.edu
$SLURM_TASKS_PER_NODEThe number of tasks to be initiated on each node. This can be a list if there is more than one node allocated to the job. The list has the same order as SLURM_JOB_NODELIST3,1
$SLURM_WORKING_CLUSTERValid for interactive jobs, will be set with remote sibling cluster's IP address, port and RPC version so that any sruns will know which cluster to communicate with.elgato:foo:0000:0000:000

Slurm Reason Codes

Sometimes, if you check a pending job using squeue, there are some messages that show up under Reason indicating why your job may not be running. Some of these codes are non-intuitive so a human-readable translation is provided below:

ReasonExplanation
AssocGrpCpuLimitThis is a per-group limitation on the number of CPUs that can be used simultaneously by all group members. Your job is not running because this limit has been reached. Check your group's limits using "job-limits <group_name>".

AssocGrpMemLimit

This is a per-group limitation on the amount of memory that can be used simultaneously by all group members. Your job is not running because this limit has been reached. Check your group's limits using "job-limits <group_name>".
AssocGrpCPUMinutesLimitEither your group is out of CPU hours or your job will exhaust your group's CPU hours.
AssocGrpGRESThis is a per-group limitation on the number of GPUs that can be used simultaneously by all group members. Your job is not running because this limit has been reached. Check your group's limits using "job-limits <group_name>".

Dependency

Your job depends on the completion of another job. It will wait in queue until the target job completes.

QOSGrpCPUMinutesLimitThis message indicates that your high priority or qualified hours allocation has been exhausted for the month.
QOSMaxWallDurationPerJobLimit Your job's time limit exceeds the max allowable and will never run. Generally, the limit is 10 days or 240 hours. To see an individual job's limits, run "job-limits <group_name>".

Nodes_required_for_job_are_

DOWN,_DRAINED_or_reserved_

or_jobs_in_higher_priority_

partitions

This very long message simply means your job is waiting in queue until there is enough space for it to run
PriorityYour job is waiting in queue until there is enough space for it to run.
QOSMaxCpuPerUserLimitThis is a per-user limitation on the number of CPUs that you can use simultaneously among all of your jobs. Your job is not running because this limit has been reached. Check your user limits using "job_limits <group_name>".
ReqNodeNotAvail, Reserved for maintenanceYour job's time limit overlaps with an upcoming maintenance window. Run "uptime_remaining" to see when the system will go offline. If you remove and resubmit your job with a shorter walltime that does not overlap with maintenance, it will likely run. Otherwise, it will remain pending until after the maintenance window.

Resources

Your job is waiting in queue until the required resources are available.
ReqNodeNotAvail, UnavailableNodes:<long list of node names>This message means the nodes needed for the job are currently unavailable. This is most commonly seen if a node is down (for example, following maintenance or due to technical issues) or busy with running jobs. Once the node type is back online and available to accept jobs, your work should run.

Job Partition Requests

PartitionSlurm Details
standard#SBATCH --account=<PI GROUP>
#SBATCH --partition=standard
Consumes your group's standard allocation. These jobs cannot be interrupted.
windfall#SBATCH --partition=windfallDoes not consume your group's standard allocation. Jobs may be interrupted and restarted by higher-priority jobs. The --account flag needs to be omitted or an error will occur. 
high_priority#SBATCH --account=<PI GROUP>
#SBATCH --partition=high_priority
#SBATCH --qos=user_qos_<PI GROUP>
Available for groups who have purchased compute resources.
qualified#SBATCH --account=<PI GROUP>
#SBATCH --partition=standard
#SBATCH --qos=qual_qos_<PI GROUP>
Available for groups that have submitted a special project request.

Slurm Output Filename Patterns

Slurm offers ways to make your job's output filenames customizable through the use of character replacements. A table is provided below as a guide with some examples. Variables may be used or combined as desired. Note: character replacements may also be used with other SBATCH directives such as error filename, input filename, and job name.

VariableMeaningExample Slurm Directive(s)Output
%AA job array's main job ID

#SBATCH --array=1-2
#SBATCH -o %A.out
#SBATCH --open-mode=append

12345.out
%aA job array's index number#SBATCH --array=1-2
#SBATCH -o %A_%a.out
12345_1.out
12345_2.out
%JJob ID plus stepid#SBATCH -o %J.out12345.out
%jJob ID#SBATCH -o %j.out12345.out
%NHostname of the first compute node allocated to the job#SBATCH -o %N.outr1u11n1.out
%uUsername#SBATCH -o %u.outnetid.out
%xJobname#SBATCH --job-name=JobName
#SBATCH -o %x.out
JobName.out

Node Types/Example Resource Requests

Standard Nodes

ClusterMax CPUsMem/CPUMax MemSample Request Statement
ElGato164gb62gb#SBATCH --nodes=1
#SBATCH --ntasks=16
#SBATCH --mem-per-cpu=4gb
Ocelote286gb168gb

#SBATCH --nodes=1
#SBATCH --ntasks=28
#SBATCH --mem-per-cpu=6gb

Puma945gb470gb#SBATCH --nodes=1
#SBATCH --ntasks=94
#SBATCH --mem-per-cpu=5gb

GPU Nodes

During the quarterly maintenance cycle on April 27, 2022 the ElGato K20s and Ocelote K80s were removed because they are no longer supported by Nvidia.

GPU jobs are requested using the generic resource, or --gres, Slurm directive. In general, the directive to request N GPUs will be of the form: --gres=gpu:N

ClusterMax CPUsMem/CPUMax MemSample Request Statement
Ocelote288gb224gb

#SBATCH --nodes=1
#SBATCH --ntasks=28
#SBATCH --mem-per-cpu=8gb
#SBATCH --gres=gpu:1

Puma1945gb470gb

#SBATCH --nodes=1
#SBATCH --ntasks=94
#SBATCH --mem-per-cpu=5gb
#SBATCH --gres=gpu:1

Up to four GPUs may be requested on Puma on a single GPU node with --gres=gpu:1, 2, 3, or 4

High Memory Nodes

When requesting a high memory node, include both the memory/CPU and constraint directives

ClusterMax CPUsMem/CPUMax MemSample Request Statement
Ocelote4841gb2015gb

#SBATCH --nodes=1
#SBATCH --ntasks=48
#SBATCH --mem-per-cpu=41gb
#SBATCH --constraint=hi_mem

Puma9432gb3000gb#SBATCH --nodes=1
#SBATCH --ntasks=94
#SBATCH --mem-per-cpu=32gb
#SBATCH --constraint=hi_mem

Total Job Memory vs. CPU Count

Interested in learning more about how memory and CPU count are related? Check out our YouTube video!

Job Memory and CPU Count are Correlated

The memory your job is allocated is dependent on the number of CPUs you request.

For example, on Puma standard nodes, you get 5G for each CPU you request. This means a standard job using 4 CPUs gets 5G/CPU × 4 CPUs = 20G of total memory. Each node has its own memory ratio that's dependent on its total memory ÷ total number of CPUs. A reference for all the node types, the memory ratios, and how to request each can be found in the Node Types/Example Resource Requests section above.

What Happens if My Memory and CPU Requests Don't Match?

Our systems are configured to try to help when your memory request does not match your CPU count.

For example, if you request 1 CPU and 470G of memory on Puma, the system will automatically scale up your CPU count to 94 to ensure that you get your full memory requirements. This does not go the other way, so if you request less memory than would be provided by your CPU count, no adjustments are made. If you omit the --memory flag entirely, the system will use the memory ratio for the standard nodes on that cluster.

Possible Problems You Might Encounter

  • Be careful when using --mem-per-cpu ratio. If you use a higher value than a standard node ratio, you may inadvertently wind up in queue for a high memory node. On Puma there are three of these machines available for standard jobs and only one on Ocelote. This means the wait times are frequently longer than those for standard nodes. If you notice your job is in queue much longer than you would expect, check your job using job-history to ensure the memory ratio looks correct.
  • Stick to using --ntasks=N and --cpus-per-task=M to request N × M CPUs. Using the flag -c N to request CPUs has been found to cause problems with memory requests and may inadvertently limit you to ~4MB of total memory.

Interactive Jobs

Want your session to start faster? Try one or both of the following:

  • Switch to ElGato. This cluster shares the same operating system, software, and file system as Puma so often your workflows are portable across clusters. Ocelote and ElGato standard nodes have 28 and 16 CPUs, respectively, and are often less utilized than Puma meaning much shorter wait times. Before you run the interactive command, type elgato to switch.
  • Use the account flag. By default, interactive will request a session using the windfall partition. Windfall is lower priority than standard and so these types of jobs take longer to get through the queue. If you include the account flag, that will switch your partition to standard. An example of this type of request:

    $ interactive -a YOUR_GROUP

When you are on a login node, you can request an interactive session on a compute node. This is useful for checking available modules, testing submission scripts, compiling software, and running programs directly from the command line. We have a built-in shortcut command that will allow you to quickly and easily request a session by simply entering: interactive

When you request a session, the full salloc command being executed will be displayed for verification/copying/editing/pasting purposes. For example:

(ocelote) [netid@junonia ~]$ interactive
Run "interactive -h for help customizing interactive use"
Submitting with /usr/local/bin/salloc --job-name=interactive --mem-per-cpu=4GB --nodes=1    --ntasks=1 --time=01:00:00 --account=windfall --partition=windfall
salloc: Pending job allocation 531843
salloc: job 531843 queued and waiting for resources
salloc: job 531843 has been allocated resources
salloc: Granted job allocation 531843
salloc: Waiting for resource configuration
salloc: Nodes i16n1 are ready for job
[netid@i16n1 ~]$ 

Notice in the example above how the command prompt changes once your session starts. When you're on a login node, your prompt will show "junonia" or "wentletrap". Once you're in an interactive session, you'll see the name of the compute node you're connected to. 

If no options are supplied to the command interactive, your job will automatically run using the windfall partition for one hour using one CPU. To use the standard partition, include the flag "-a" followed by your group's name. To see all the customization options:

(ocelote) [netid@junonia ~]$ interactive -h
Usage: /usr/local/bin/interactive [-x] [-g] [-N nodes] [-m memory per core] [-n ncpus per node] [-Q optional qos] [-t hh::mm:ss] [-a account to charge]

You may also create your own salloc commands using any desired Slurm directives for maximum customization.

MPI Jobs

OpenMPI

For openmpi the important variables are set by default, so you do not need to include them in your scripts.

Default OpenMPI variables
export SBATCH_GET_USER_ENV=1
export OMPI_MCA_btl_openib_cpc_include=rdmacm
export OMPI_MCA_btl_openib_if_include=bnxt_re1
export OMPI_MCA_btl_openib_rroce_enable=1
export OMPI_MCA_btl=vader,self,openib
export OMPI_MCA_oob_tcp_if_include=eth1

Intel MPI

For Intel MPI, these variables are set for you:

module unload openmpi3 gnu8

If you're using Intel MPI with mpirun and are getting errors, try replacing mpirun -np $NPROCESSES with:

srun -n $NPROCESSES --mpi=pmi2

Parallel Work

To make proper use of a supercomputer, you will likely want to use the benefit of many cores.  Puma has 94 cores in each node available to Slurm.  The exception to that is running hundreds or thousands of jobs using High Throughput Computing.  

We have a training course which explains the concepts and terminology of parallel computing with some examples.  Introduction to Parallel Computing

This practical course in Parallel Analysis in R is also useful