Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Panel
borderColor#9c9fb5
bgColor#fcfcfc
titleColor#fcfcfc
titleBGColor#021D61
borderStylesolid
titleContent


Excerpt


Expand
titleQ. Why isn't my job running?


Panel
borderColor#07105b
bgColor#f6f6f6
titleColor#00084d
titleBGColor#d9dae5
borderStyledotted


There are a few reasons your job may not be running, check below for some ideas on diagnosing the issue:

  • Run squeue --job <jobid> and see if there is anything listed under "(REASON)". This may give an idea why your job is stuck in queue. We have a table in our SLURM documentation that describes what each Reason code means.

  • Due to the number of HPC users, it may not always be possible to run a submitted job immediately. If there are insufficient resources available, your job will be queued and it may take up to a few hours for it to begin executing.

  • Your group may have run out of standard hours. You can check your allocation using the command va.

  • Your group/job has reached a resource usage limit (e.g., number of GPUs that may be used concurrently by a group, or a job has requested more than the 10 day max walltime). Try running job-limits <group_name> to see what limits you're subject to and if there are any problem jobs listed.

  • You may be requesting a rare resource (e.g., 4 GPUs on a single node on Puma or a high memory node).
    • If you are requesting a single GPU on Puma and are frustrated with the wait times, you might consider checking if Ocelote will work for your analyses. There are more GPU nodes available on that cluster with shorter wait times.
    • If you are trying to run a job on a standard node and have been waiting for a very long time, try checking its status using job-history <jobid>. If you see Allocated RAM/CPU above 5gb on Puma or above 6gb on Ocelote, then you are queued for the high memory node which can have very long wait times. To queue for a standard node, cancel your job and check that your script has the correct ratios.




Expand
titleQ. Why do my jobs keep getting interrupted?


Panel
borderColor#07105b
bgColor#f6f6f6
titleColor#00084d
titleBGColor#d9dae5
borderStyledotted
If your jobs keep stopping and restarting, it's likely because you are using Windfall. Windfall is considered lower priority and is subject to preemption by higher priority jobs. Before submitting a job to Windfall, consider using your group's allotted monthly hours first. Jobs using Standard hours will queue for a shorter period of time and will not be interrupted. You can check your group's remaining hours using the command va. To see more information on your allotted hours and the different job queues, see: Allocation and Limits.



Expand
titleQ. Can I run programs on the login nodes?


Panel
borderColor#07105b
bgColor#f6f6f6
titleColor#00084d
titleBGColor#d9dae5
borderStyledotted
No. Software to run applications is not available on the login nodes. To run/test your code interactively, start an interactive session on one of the system's compute nodes. Processes running on the head node are subject to being terminated if we think they are affecting other users. Think of these as 'submit' nodes where you prepare and submit job scripts.



Expand
titleQ. Can I get root access to my compute nodes?


Panel
borderColor#07105b
bgColor#f6f6f6
titleColor#00084d
titleBGColor#d9dae5
borderStyledotted
Unfortunately, that is not possible. The compute nodes get their image from the head node and have to remain the same. If you need to install software that needs root access, for example, you can install the software locally in your account. See this example.



Expand
titleQ. Can I ssh to compute nodes?


Panel
borderColor#07105b
bgColor#f6f6f6
titleColor#00084d
titleBGColor#d9dae5
borderStyledotted
SLURM will let you ssh to nodes that are assigned to your job, but not to others.




Expand
titleQ. Why am I getting out of memory errors?


Panel
borderColor#07105b
bgColor#f6f6f6
titleColor#00084d
titleBGColor#d9dae5
borderStyledotted

There are a few reasons you might get out of memory errors:

  • You're using -c <N> to request CPUs. Based on the way our scheduler is set up, this will reduce the memory allocation for your job to ~4MB. To solve this, change your CPU request by either setting --ntasks=<N> or --ntasks=1 --cpus-per-task=<N>.
  • You may not have specified the number of nodes required for your job. For non-MPI workflows, if SLURM scatters your CPUs across multiple nodes, you will only have access to the resources on the executing node. Explicitly setting --nodes in your script should help, e.g.:

    Code Block
    languagebash
    themeMidnight
    #SBATCH --nodes=1


  • You may not have allocated enough memory to your job. Try running seff <jobid> to see your memory usage. You may consider using memory profiling techniques, allocating more CPUs, or using the high memory node.



Expand
titleQ. Why shouldn't I use Windfall with OnDemand?


Panel
borderColor#07105b
bgColor#f6f6f6
titleColor#00084d
titleBGColor#d9dae5
borderStyledotted
Windfall jobs can be preempted by a higher priority queue. Each session creates an interactive job on a node and it is unsatisfactory to be dumped in the middle of that session. A desktop session would have the same unpleasant result.  Windfall can be used if you do not have enough standard time left. Consider though that a one hour session using one compute core only takes up 1 cpu hour out of your group's 100,000 hours.



Expand
titleQ. My interactive session has been disconnected, can I return to it?


Panel
borderColor#07105b
bgColor#f6f6f6
titleColor#00084d
titleBGColor#d9dae5
borderStyledotted
No, unfortunately when an interactive job ends it is no longer accessible. This applies to both OOD sessions and those accessed via the command line.  We recommend using the standard partition rather than windfall when running interactive jobs to prevent preemption.