Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Table of Contents

Excerpt
hiddentrue

A collection of frequently asked questions and their solutions.



Account Access

Q. How do I create an account?

A. A step-by-step guide is available on our Account Creation page.

Q. Why can't I log in?

A. You haven't created an account yet.
A. Your account isn't sponsored yet.
A. You aren't using two-factor authentication (NetID+).
A. You need to wait 15 minutes. If you just created your account, it takes time before you can log in.
A. You're trying to connect using ssh NetID@login.hpc.arizona.edu. This will not work. Instead, use: ssh NetID@hpc.arizona.edu.
A. You're using NetID@hpc.arizona.edu or NetID@email.arizona.edu as your username in PuTTY. Instead, use only your NetID.

Q. Why can't I enter my password in my terminal?

A. Linux systems do not display character strokes while entering your password which can make it look like the ssh client is frozen. Even though it doesn't appear that anything is happening, the system is still logging your input. To proceed, type your password at the prompt and press enter.

Q. I get this message: You do not appear to have registered for an HPC account

A. You need to wait a little while for the request to propagate through the University systems.  It could be an hour.  It could be longer. Patience.

Q. I try to connect through SSH but am unable to login. I am getting permission denied message

A. You need an HPC account - go to account.arizona.edu.  Once you've done that you should just need to wait about 15 minutes before you can log in.  If your PI hasn't already added you to their group you'll need to wait for that as well.

Q. My HPC account is set up but I can’t log in - incorrect password

A. You need to wait about 15 minutes after your account is approved for the account to be available
 You must enroll in NetId. De
pending on the application you use to log in, you may not get the typical NetID+/DUO menu of options, or an error message indicating this is your problem

Q. I've forgotten my password, how can I reset it?

A. HPC uses the same NetID login credentials as all UA services. If you need to reset your NetID password you can do so using the NetID portal: https://netid-portal.iam.arizona.edu/

Q. How do I add members to my HPC research group?

A. The easiest method is for PIs to use the Portal under the Manage Time tab.

Q. I'm leaving the university/not affiliated with the university, can I maintain/receive access to HPC?

A. Yes, if you are a former university affiliate or campus collaborator participating in research, you may register as a Designated Campus Colleague (DCC). Once your DCC status has been approved, you will receive a NetID+ which you may use to create an HPC Account. If you already have an HPC Account, no further action is required.





General Computing / Scheduling

Q. Why isn't my job running?

A. There are a few reasons your job may not be running:

  1. Due to the number of HPC users, it may not always be possible to run a submitted job immediately. If there are insufficient resources available, your job will be queued and it may take up to a few hours for it to begin executing.
  2. Your group has run out of standard hours. You can check your allocation using the command va.

Q. Why aren't common commands working?

A. Perhaps your shell is not set to Bash. If you already had another account before joining HPC, that profile will carry over, while a brand new account will always default to bash. If your shell is not set to Bash, contact our consultants so that they can reset it for you.

Q Why do my jobs keep getting interrupted?

A. If your jobs keep stopping and restarting, it's likely because you are using Windfall. Windfall is considered lower priority and is subject to preemption by higher priority jobs. Before submitting a job to Windfall, consider using your group's allotted monthly hours first. Jobs using Standard hours will queue for a shorter period of time and will not be interrupted. You can check your group's remaining hours using the command va. To see more information on your allotted hours and the different job queues, see: Allocation and Limits.

Q. Can I run programs on the login nodes?

A. No. Software to run applications is not available on the login nodes. To run/test your code interactively, start an interactive session on one of the system's compute nodes. Processes running on the head node are subject to being terminated if we think they are affecting other users. Think of these as 'submit' nodes where you prepare and submit job scripts.

Q. Can I get root access to my compute nodes?

A. That is not possible.  The compute nodes get their image from the head node and have to remain the same.  If you need to install software that needs root access, for example, you can install the software locally in your account.  See this example.

Q. Can I ssh to  compute nodes?

A. Slurm will let you ssh to nodes that are assigned to your job, but not to others.

Q. I accidentally deleted files, can I get them back?

A. Unfortunately, backups are not made on HPC. To avoid data loss:

  • Make frequent backups, ideally in three places and two formats. Helpful information on making backups can be found on our page Transferring Data.
  • Use rm and rm -r with caution as these commands cannot be undone! Consider using rm -i when removing files/directories. The -i flag will prompt you to manually confirm file removals to make really sure they can be deleted.
  • You can open a support ticket to request assistance.  Files that are deleted are not removed from the storage array immediately, but don't wait more than a few days.

Q. Why do I get out of memory errors?

A. You may see out of memory or zoom-kill in our output. By default, Slurm will spread your job across nodes but only use the memory on the first node. Add this to your Slurm script:

Code Block
languagebash
themeMidnight
#SBATCH --nodes=1





Software/Modules

Q. Are any software modules loaded by default?

A. Yes, when you start an interactive terminal session or submit a batch script, the modules ohpc, gnu8, openmpi3, and cmake are automatically loaded. If your code uses Intel compilers, you will want to manually unload gnu8 and openmpi3 to prevent conflicts.

The exception: If you are working in a terminal in an Open OnDemand interactive desktop session, nothing is loaded by default and you will need to manually load any necessary modules.

Q. How do I install this R package/Why can't I install this R package?

A. R installations can sometimes be frustrating. We have instructions for how to set up a usable R environment, how to diagnose and troubleshoot problems, and steps to help with known troublesome packages documented in in our Using and Customizing R Packages section. 

Q. How do I install Python packages?

A. You can install python packages locally using either a virtual environment or a local conda environment. See our documentation on using Python for instructions on how to set these up.

Q. I have been using an older version of Singularity and now it is not available.

A. The current version of Singularity is 3.7.4.  Prior versions have been removed, only the latest one is considered secure.  Notify the consultants if you need help with transition to the current version. Singularity is installed on the operating systems of all compute nodes so does not need to be loaded with a module.

Q. What executables are available when I load a module?

A. Load the module, find the path to the executable by checking the $PATH variable, then list the contents.  For example:

Code Block
languagebash
themeMidnight
module load lammps
echo $PATH
ls /opt/ohpc/pub/apps/lammps/3Mar20/bin
lmp_mpi

Q. Why am I getting "command: module not found"?

A. There are three possible reasons:

  1. You are not in an interactive session. Unlike ElGato and Ocelote, modules are not available on the Puma login nodes. You may request an interactive session by using the command interactive
  2. Your shell is not set to bash. If this is the case, contact our consultants so that they can reset it for you.
  3. You have modified or deleted your ~/.bashrc. If this is the case, open (if the file exists) or create and open (if the file is missing) the file .bashrc in your home directory and add the lines:

    Code Block
    languagebash
    themeMidnight
    if [ -f /etc/bashrc ]; then
            . /etc/bashrc
    fi


Q. How can I maximize my software performance on Puma.

A. If you are able to compile your software you can take advantage of most of the AMD Zen architecture.

CompilerArch-SpecificArch-Favorable
GCC 9-march=znver2-mtune=znver2
LLVM 9-march=znver2-mtune=znver2

Neither of these compiler versions (GCC 9 or LLVM 9) is available on Puma so you will have to build that first.  If you use GCC 8.3 you can set znver1 instead.

Q. I have an application that runs on Windows and uses GPUs

A. AWS has been used successfully for Windows software with GPU needs. It’s easy to set up, cost effective, and very scalable. Amazon also has a cloud credit for research program available
https://aws.amazon.com/government-education/research-and-technical-computing/cloud-credit-for-research/ [aws.amazon.com]

Q. Is the Intel compiler faster than GCC on Puma.

A. Intel compilers are optimized for Intel processors. There is some debate around the concept of unfair CPU dispatching in Intel compilers. By default, software on the HPC clusters is built with GCC (on Puma it is GCC 8.3).  This is in keeping with our preference for community software.

Q. How do I access Gaussian or Gaussview

A. You need to belong to a special group called g03.  You can request to be added by the HPC consultants.

Q. How do I take advantage of the Distributed capability of Ansys

A. Ansys has the Distributed capability built in to increase performance. Ansys uses the Intel compiler and so uses Intel MPI.  By default, we load OpenMPI, so you will need to do this: 

Code Block
module unload gnu8 openmpi3
module load intel
module load ansys




General Data Transfer and Storage

Q. Do you allow users to NFS mount their own storage onto the compute nodes?

A. No. We NFS mount storage across all compute nodes so that data is available independent of which compute nodes are used.  See this section for how to transfer data.

Q. I can't transfer my data to HPC with an active account. What's wrong?

A. After creating your HPC Account, your home directory will not be created until you log in for the first time. Without your home directory, you will not be able to transfer your data to HPC. If you are struggling and receiving errors, sign into your account either using the CLI through the bastion or logging into OnDemand and then try again.

If you are using something like SCP and are receiving errors, make sure your hostname is set to filexfer.hpc.arizona.edu (not hpc.arizona.edu).

Q. I accidentally deleted files, can I retrieve them?

A. Unfortunately, no. Backups are not made and anything deleted is permanently erased. It is impossible for us to recover it.  We recommend data be backed up elsewhere, preferably in three places and two formats. We also recommend including the -i flag when executing the rm command:

Code Block
languagebash
themeMidnight
rm -i <filename> # for single-file deletions
# or
rm -i -R <directory> # for deleting full directories

When run, this forces you to manually confirm the deletion of each file. It may be cumbersome, but it can prevent massive headaches in the future!

Q. What do these Globus errors mean?

A. Endpoint too busy: This is most commonly seen when users are transferring directories to Google Drive. This is because Google has user limits restricting the number of files that can be transferred per unit time. When many files are being transferred at once, that limit may be exceeded. Globus will automatically hold the transfer until the limit is reset at which point it will continue. One way to avoid this is to archive your work prior to the transfer (e.g. in .tar.gz form). Additionally, archiving will also speed up your transfers considerably, sometimes by orders of magnitude.

A. Fatal FTP Response, PATH_EXISTS: Globus is finicky about the destination endpoint. If you get this error, check to see whether duplicate files/directories exist at the destination. This can happen frequently with Google Drive as multiple files/directories can exist in the same location with the same name. If duplicates exist, try moving, removing, or renaming them and reinitiate the transfer.

Q. I am getting “Authentication failed” errors when performing file transfers on HPC. It used to work. What has changed?

A. In our last maintenance update on July 20th, one of the changes was to ensure HIPAA compliance on the Data Transfer Nodes (DTNs).

This change included the insertion of required text:

Authorized uses only. All activity may be monitored and reported.

This change breaks SCP (scp) activity. Not in all cases but frequently with WinSCP, Filezilla and from a terminal. Terminal activity will likely still work from Linux or MacOS.

The solution is not to use SCP any more. SCP is considered outdated, inflexible and not readily fixed.  We recommend using more modern protocols like SFTP and rsync.

Putty supports SFTP with the “PSFTP” command.

For FileZilla, in the Toolbar, click on Edit and Settings, then click on SFTP

For Cyberduck, choose SFTP in the dropdown for protocols.





xdisk

Q. Why am I getting xdisk emails?

A. xdisk is a temporary storage space available to your research group. When it's close to its expiration date, notifications will be sent to all members of your group. For detailed information on xdisk allocations, see: Storage

Q. Why am I getting "/xdisk allocations can only be authorized by principal investigators"?

A. xdisks are managed by your group's PI by default. This means if you want to request an xdisk or modify an existing allocation (e.g., extending the time limit or increasing the storage quota), you will need to consult your PI. Your PI may either perform these actions directly or, if they want to delegate xdisk management to a group member, they may do so by following the instructions under Delegating xdisk Management Rights.

Q. How can we modify our xdisk allocation?

A. To modify your allocation's time limit or storage quota, your PI can either do so through the Web Portal under the Storage tab, or via the command line. If your PI would like to delegate management rights to a group member, they may follow the instructions under Delegating xdisk Management Rights. Once a group member has received management rights, they may manage the allocation through our web portal.

Q. Why am I getting "xdisk: command not found"?

A. If you're getting errors using xdisk commands in a terminal session, check that you are on a login node. If you are on the bastion host (hostname: gatekeeper), are in an interactive session, or are on the filexfer node, you won't be able to check or modify your xdisk. When you are on a login node, your terminal prompt should show the hostname "junonia" or "wentletrap". You can also check your hostname using the command:

Code Block
languagebash
themeMidnight
$ hostname

Q. Why am I getting errors when trying to extend my allocation?

A. If you're trying to extend your group's allocation but are seeing something like:

Code Block
languagebash
themeMidnight
(puma) [netid@junonia ~]$ xdisk -c expire -d 1
invalid request_days: 1

for every value you enter, your xdisk has likely reached its maximum time limit. To check, go to portal.hpc.arizona.edu, click Manage XDISK, and look at the box next to Duration. If you see 300, your allocation cannot be extended further. You will need to back up your data to external storage (e.g., a local machine, lab server, or cloud service). Once your xdisk has expired (either by reaching its limit or through manual deletion), you can immediately create a new allocation and restore your data. Detailed xdisk information can be found on our Storage page. You may also want to look at our page on Transferring Data

Q. Can we keep our xdisk allocation for more than 300 days? 

A. No, once an xdisk has reached its time limit it will expire. It's a good idea to start preparing for this early by making frequent backups and paying attention to xdisk expiration emails. 

Q. What happens when our xdisk allocation expires? 

A. Once an xdisk expires, all the associated data are deleted. Deleted data are non-retrievable since HPC is not backed up. It's advised to keep frequent backups of your data on different platforms, for example a local hard drive or a cloud-based service like Google Drive, or (even better) both!

Q. What's the best way to backup/transfer our data before our xdisk expires? 

A. Before your group's xdisk expires, you'll want to make an external backup of anything you need to keep. External storage options include personal computers, lab servers, external hard drives, or cloud services such as Google Drive or AWS. 

If you're moving large quantities of data, Globus is a great option. We have instructions in our storage documentation for setting up and using this software.

We strongly recommend making archives (.tar, .zip, files etc.) of large directories prior to transferring them off the system. In general, transfer software struggles with moving many small files and performs much more efficiently moving fewer large files. You will get the better transfer speeds (sometimes by orders of magnitude) if you compress your files prior to transferring them. This can be done on our filexfer node which is designed for large file management operations (hostname: filexfer.hpc.arizona.edu). 

Q. Can I preserve source file modification times when I transfer files to AWS S3 storage? 

A. S3 is an object store, not a filesystem. The local path is just a key, and the value is the content of the file. Getting time stamps on folders is futile as a result of how S3 works.  Getting timestamps other than "when the file was last changed in S3" for files can also be problematic depending on the tools one uses.

Again, we strongly recommend making archives (.tar, .zip, files etc.) of large directories prior to transferring them off the system. 

Q. Once our xdisk expires, can we request a new one?

A. Yes, a new xdisk may be requested immediately after the old partition expires. Data, however, may not be transferred directly from the old partition to the new one. 

Q. Can a PI have more than one xdisk active at a time?

A. No, only one xdisk may be active per PI at a given time. 





Interactive Sessions

Q. Why shouldn't I use Windfall with OnDemand?

A. Windfall jobs can be preempted by a higher priority queue.  Each session creates an interactive job on a node.  It is unsatisfactory to be dumped in the middle of that session.  A desktop session would have the same unpleasant result.  Windfall can be used if you do not have enough standard time left.  Consider though that a one hour session using one compute core only takes up 1 cpu hour out of your group's 100,000 hours.

Q. My interactive session has been disconnected, can I return to it?

A. No, unfortunately when an interactive job ends it is no longer accessible. This applies to both OOD sessions and those accessed via the command line. 

Q. How do I access custom python packages from an OOD Jupyter session?

A. Instructions on accessing custom packages are under Accessing Custom Packages from a Jupyter Session in our documentation on Using and Installing Python.






Errors

Q. Bad UID for job execution

A. This happens most frequently for new users.  It takes a while to propagate new accounts to all the right places.  Come back after a coffee break. However, this can occur in other circumstances. Open a support ticket with hpc-consult.

Q. My job fails with an out of memory error

A. You may be running your job across multiple nodes. This will happen if you don't specify the number of nodes you need. When your job is divided among multiple nodes, the executing node may not have enough allocated memory and will fail when its limit is exceeded. To force the job to stay on one node, add the following line to your script:

Code Block
languagebash
themeMidnight
#SBATCH --nodes=1

Q. OOD Desktop failure with "Could not connect to session bus: failed to connect to socket /tmp/dbus-” 

A. This is most commonly seen with users who have Anaconda or Miniconda initialized in their accounts. Some options for resolving this issue:


Option 1: If you'd like to temporarily remove Anaconda from your environment, open the file ~/.bashrc (a hidden file in your home directory) and comment out everything between:

Code Block
languagebash
themeMidnight
>>> conda initialize >>>
...
<<< conda initialize <<<

Additionally, if you have any lines that look like:

Code Block
languagebash
themeMidnight
export PATH=/path/to/anaconda/or/miniconda/bin:$PATH

comment these out as well. Then try starting your Desktop session again. You may uncomment these lines to add Anaconda back into your environment

Option 2: For a more permanent solution, you can run the command:

Code Block
languagebash
themeMidnight
conda config --set auto_activate_base false

This will prevent conda from auto-activating when you first log in and allow you to have more control over your environment. When you'd like to activate anaconda, run conda activate

Q. bin/bash^M: bad interpreter: No such file or directory

A. Scripts created in a Windows environment and transferred to HPC retain hidden carriage returns (^M). You can convert your Windows file to Unix format with:

Code Block
languagebash
themeMidnight
$ dos2unix <filename>


Q. ReqNodeNotAvail, Reserved for maintenance

A. When HPC services are going down for maintenance, any jobs that are submitted with a requested runtime that overlaps with that maintenance period will not run until the systems are back online.



Miscellaneous 

Q. Why is my terminal being weird (e.g., CTRL+A puts me in the middle of my command prompt)?

A. When you log into HPC, the variable $COMMAND_PROMPT is set to your current cluster (e.g.: (puma)). Sometimes this can cause formatting problems for some users. If you'd prefer to modify your $PS1, you can add the following to your ~/.bashrc:

Code Block
languagebash
themeMidnight
if [ -n "${PROMPT_COMMAND}" -a -r /usr/local/bin/slurm-selector.sh ]; then
  SavePS1=${PS1}
  Cur_Cluster=$(eval ${PROMPT_COMMAND} 2>/dev/null)
  PS1="${Cur_Cluster}${SavePS1}"
  unset PROMPT_COMMAND
  for c in puma ocelote elgato; do
     alias ${c}="PS1=\"(${c}) ${SavePS1}\"; . /usr/local/bin/slurm-selector.sh ${c}; unset PROMPT_COMMAND"
  done
  unset Cur_Cluster SavePS1
fi