Jobs and Scheduling
- sarawillis@arizona.edu
There are a few reasons your job may not be running, check below for some ideas on diagnosing the issue:
- Run
squeue --job <jobid>
and see if there is anything listed under "(REASON)". This may give an idea why your job is stuck in queue. We have a table in our SLURM documentation that describes what each Reason code means. - Due to the number of HPC users, it may not always be possible to run a submitted job immediately. If there are insufficient resources available, your job will be queued and it may take up to a few hours for it to begin executing.
- Your group may have run out of standard hours. You can check your allocation using the command
va
. - Your group/job has reached a resource usage limit (e.g., number of GPUs that may be used concurrently by a group, or a job has requested more than the 10 day max walltime). Try running
job-limits <group_name>
to see what limits you're subject to and if there are any problem jobs listed. - You may be requesting a rare resource (e.g., 4 GPUs on a single node on Puma or a high memory node).
- If you are requesting a single GPU on Puma and are frustrated with the wait times, you might consider checking if Ocelote will work for your analyses. There are more GPU nodes available on that cluster with shorter wait times.
- If you are trying to run a job on a standard node and have been waiting for a very long time, try checking its status using
job-history <jobid>
. If you see Allocated RAM/CPU above 5gb on Puma or above 6gb on Ocelote, then you are queued for the high memory node which can have very long wait times. To queue for a standard node, cancel your job and check that your script has the correct ratios.
If your job is in queue, sometimes SLURM will give you information on why it's not running. This may be for a number of reasons, for example there may be an upcoming maintenance cycle, your group's allocation may be exhausted, you may have requested resources that surpass system limits, or the node type you've requested may be very busy running jobs. We have a list of reason codes in our Running Jobs With SLURM page that will give more comprehensive information on what these messages mean. If you don't see the reason code listed, contact our consultants.
va
. To see more information on your allotted hours and the different job queues, see: Allocation and Limits.
There are a few reasons you might get out of memory errors:
- You're using
-c <N>
to request CPUs. Based on the way our scheduler is set up, this will reduce the memory allocation for your job to ~4MB. To solve this, change your CPU request by either setting--ntasks=<N>
or--ntasks=1 --cpus-per-task=<N>
.
You may not have specified the number of nodes required for your job. For non-MPI workflows, if SLURM scatters your CPUs across multiple nodes, you will only have access to the resources on the executing node. Explicitly setting --nodes in your script should help, e.g.:
#SBATCH --nodes=1
- You may not have allocated enough memory to your job. Try running
seff <jobid>
to see your memory usage. You may consider using memory profiling techniques, allocating more CPUs, or using the high memory node.
Yes, when you start an interactive session via the terminal or submit a batch job, the modules gnu8, openmpi3, and cmake are loaded by default. If you need to use intel, you'll want to unload openmpi3 and gnu8 first.
However, if you start a terminal in an interactive desktop session through Open OnDemand, no modules are loaded by default in this environment. To start, at the minimum you'll want to run the command:
module load ohpc