Page Comparison

...

Panel

borderColor	#9c9fb5
bgColor	#fcfcfc
titleColor	#fcfcfc
titleBGColor	#021D61
borderStyle	solid
title	Content

Excerpt

Expand

title	Q. Why isn't my job running?

Panel

borderColor	#07105b
bgColor	#f6f6f6
titleColor	#00084d
titleBGColor	#d9dae5
borderStyle	dotted

There are a few reasons your job may not be running, check below for some ideas on diagnosing the issue:

Run squeue --job <jobid> and see if there is anything listed under "(REASON)". This may give an idea why your job is stuck in queue. We have a table in our SLURM documentation that describes what each Reason code means.
Due to the number of HPC users, it may not always be possible to run a submitted job immediately. If there are insufficient resources available, your job will be queued and it may take up to a few hours for it to begin executing.
Your group may have run out of standard hours. You can check your allocation using the command va.
Your group/job has reached a resource usage limit (e.g., number of GPUs that may be used concurrently by a group, or a job has requested more than the 10 day max walltime). Try running job-limits <group_name> to see what limits you're subject to and if there are any problem jobs listed.
You may be requesting a rare resource (e.g., 4 GPUs on a single node on Puma or a high memory node).
- If you are requesting a single GPU on Puma and are frustrated with the wait times, you might consider checking if Ocelote will work for your analyses. There are more GPU nodes available on that cluster with shorter wait times.
- If you are trying to run a job on a standard node and have been waiting for a very long time, try checking its status using job-history <jobid>. If you see Allocated RAM/CPU above 5gb on Puma or above 6gb on Ocelote, then you are queued for the high memory node which can have very long wait times. To queue for a standard node, cancel your job and check that your script has the correct ratios.

Expand

title	Q. Why do my jobs keep getting interrupted?

Panel

borderColor	#07105b
bgColor	#f6f6f6
titleColor	#00084d
titleBGColor	#d9dae5
borderStyle	dotted

If your jobs keep stopping and restarting, it's likely because you are using Windfall. Windfall is considered lower priority and is subject to preemption by higher priority jobs. Before submitting a job to Windfall, consider using your group's allotted monthly hours first. Jobs using Standard hours will queue for a shorter period of time and will not be interrupted. You can check your group's remaining hours using the command va. To see more information on your allotted hours and the different job queues, see: Allocation and Limits.

Expand

title	Q. Can I run programs on the login nodes?

Panel

borderColor	#07105b
bgColor	#f6f6f6
titleColor	#00084d
titleBGColor	#d9dae5
borderStyle	dotted

No. Software to run applications is not available on the login nodes. To run/test your code interactively, start an interactive session on one of the system's compute nodes. Processes running on the head node are subject to being terminated if we think they are affecting other users. Think of these as 'submit' nodes where you prepare and submit job scripts.

Expand

title	Q. Can I get root access to my compute nodes?

Panel

borderColor	#07105b
bgColor	#f6f6f6
titleColor	#00084d
titleBGColor	#d9dae5
borderStyle	dotted

Unfortunately, that is not possible. The compute nodes get their image from the head node and have to remain the same. If you need to install software that needs root access, for example, you can install the software locally in your account. See this example.

Expand

title	Q. Can I ssh to compute nodes?

Panel

borderColor	#07105b
bgColor	#f6f6f6
titleColor	#00084d
titleBGColor	#d9dae5
borderStyle	dotted

SLURM will let you ssh to nodes that are assigned to your job, but not to others.

Expand

title	Q. Why am I getting out of memory errors?

Panel

borderColor	#07105b
bgColor	#f6f6f6
titleColor	#00084d
titleBGColor	#d9dae5
borderStyle	dotted

There are a few reasons you might get out of memory errors:

You're using -c <N> to request CPUs. Based on the way our scheduler is set up, this will reduce the memory allocation for your job to ~4MB. To solve this, change your CPU request by either setting --ntasks=<N> or --ntasks=1 --cpus-per-task=<N>.
You may not have specified the number of nodes required for your job. For non-MPI workflows, if SLURM scatters your CPUs across multiple nodes, you will only have access to the resources on the executing node. Explicitly setting --nodes in your script should help, e.g.:
Code Block
language bash
theme Midnight
#SBATCH --nodes=1
You may not have allocated enough memory to your job. Try running seff <jobid> to see your memory usage. You may consider using memory profiling techniques, allocating more CPUs, or using the high memory node.

Expand

title	Q. Why shouldn't I use Windfall with OnDemand?

Panel

borderColor	#07105b
bgColor	#f6f6f6
titleColor	#00084d
titleBGColor	#d9dae5
borderStyle	dotted

Windfall jobs can be preempted by a higher priority queue. Each session creates an interactive job on a node and it is unsatisfactory to be dumped in the middle of that session. A desktop session would have the same unpleasant result. Windfall can be used if you do not have enough standard time left. Consider though that a one hour session using one compute core only takes up 1 cpu hour out of your group's 100,000 hours.

Expand

title	Q. My interactive session has been disconnected, can I return to it?

Panel

borderColor	#07105b
bgColor	#f6f6f6
titleColor	#00084d
titleBGColor	#d9dae5
borderStyle	dotted

No, unfortunately when an interactive job ends it is no longer accessible. This applies to both OOD sessions and those accessed via the command line. We recommend using the standard partition rather than windfall when running interactive jobs to prevent preemption.

Versions Compared

Old Version 7

New Version 8

Key