Jobs and Scheduling

Jobs and Scheduling

Q. Why isn't my job running?

There are a few reasons your job may not be running, check below for some ideas on diagnosing the issue:

Run squeue --job <jobid> and see if there is anything listed under "(REASON)". This may give an idea why your job is stuck in queue. We have a table in our SLURM documentation that describes what each Reason code means.
Due to the number of HPC users, it may not always be possible to run a submitted job immediately. If there are insufficient resources available, your job will be queued and it may take up to a few hours for it to begin executing.
Your group may have run out of standard hours. You can check your allocation using the command va.
Your group/job has reached a resource usage limit (e.g., number of GPUs that may be used concurrently by a group, or a job has requested more than the 10 day max walltime). Try running job-limits <group_name> to see what limits you're subject to and if there are any problem jobs listed.
You may be requesting a rare resource (e.g., 4 GPUs on a single node on Puma or a high memory node).
- If you are requesting a single GPU on Puma and are frustrated with the wait times, you might consider checking if Ocelote will work for your analyses. There are more GPU nodes available on that cluster with shorter wait times.
- If you are trying to run a job on a standard node and have been waiting for a very long time, try checking its status using job-history <jobid>. If you see Allocated RAM/CPU above 5gb on Puma or above 6gb on Ocelote, then you are queued for the high memory node which can have very long wait times. To queue for a standard node, cancel your job and check that your script has the correct ratios.

Q. My job has a Reason code when I check it with squeue. What does this mean?

If your job is in queue, sometimes SLURM will give you information on why it's not running. This may be for a number of reasons, for example there may be an upcoming maintenance cycle, your group's allocation may be exhausted, you may have requested resources that surpass system limits, or the node type you've requested may be very busy running jobs. We have a list of reason codes in our Running Jobs With SLURM page that will give more comprehensive information on what these messages mean. If you don't see the reason code listed, contact our consultants.

Q. Why do my jobs keep getting interrupted?

If your jobs keep stopping and restarting, it's likely because you are using Windfall. Windfall is considered lower priority and is subject to preemption by higher priority jobs. Before submitting a job to Windfall, consider using your group's allotted monthly hours first. Jobs using Standard hours will queue for a shorter period of time and will not be interrupted. You can check your group's remaining hours using the command va. To see more information on your allotted hours and the different job queues, see: Allocation and Limits.

Q. Can I run programs on the login nodes?

No. Software to run applications is not available on the login nodes. To run/test your code interactively, start an interactive session on one of the system's compute nodes. Processes running on the head node are subject to being terminated if we think they are affecting other users. Think of these as 'submit' nodes where you prepare and submit job scripts.

Q. Can I get root access to my compute nodes?

Unfortunately, that is not possible. The compute nodes get their image from the head node and have to remain the same. If you need to install software that needs root access, for example, you can install the software locally in your account. See this example.

Q. Can I ssh to compute nodes?

SLURM will let you ssh to nodes that are assigned to your job, but not to others.

Q. Why am I getting out of memory errors?

There are a few reasons you might get out of memory errors:

You're using -c <N> to request CPUs. Based on the way our scheduler is set up, this will reduce the memory allocation for your job to ~4MB. To solve this, change your CPU request by either setting --ntasks=<N> or --ntasks=1 --cpus-per-task=<N>.
You may not have specified the number of nodes required for your job. For non-MPI workflows, if SLURM scatters your CPUs across multiple nodes, you will only have access to the resources on the executing node. Explicitly setting --nodes in your script should help, e.g.:
```
#SBATCH --nodes=1
```
You may not have allocated enough memory to your job. Try running seff <jobid> to see your memory usage. You may consider using memory profiling techniques, allocating more CPUs, or using the high memory node.

Q. Why shouldn't I use Windfall with OnDemand?

Windfall jobs can be preempted by a higher priority queue. Each session creates an interactive job on a node and it is unsatisfactory to be dumped in the middle of that session. A desktop session would have the same unpleasant result. Windfall can be used if you do not have enough standard time left. Consider though that a one hour session using one compute core only takes up 1 cpu hour of your allocation.

Q. My interactive session has been disconnected, can I return to it?

No, unfortunately when an interactive job ends it is no longer accessible. This applies to both OOD sessions and those accessed via the command line. We recommend using the standard partition rather than windfall when running interactive jobs to prevent preemption.

Q. Are any modules loaded by default?

Yes, when you start an interactive session via the terminal or submit a batch job, the modules gnu8, openmpi3, and cmake are loaded by default. If you need to use intel, you'll want to unload openmpi3 and gnu8 first.

However, if you start a terminal in an interactive desktop session through Open OnDemand, no modules are loaded by default in this environment. To start, at the minimum you'll want to run the command:

Open OnDemand Desktop Terminal

module load ohpc