Excerpt | ||
---|---|---|
| ||
If you're looking to tackle complex problems or speed up your data analyses, HPC might be just what you need! HPC is an acronym for High Performance Computing and is often used interchangeably with supercomputing. As a UArizona affiliate, you can be sponsored by a faculty member (faculty members can sponsor themselves) to receive free access to our three supercomputers Puma, Ocelote, and ElGato. These are clusters of computers that are housed in the lower level of the UITS building and are available for your analyses. |
Insert excerpt | ||||||
---|---|---|---|---|---|---|
|
Panel | ||||||||
---|---|---|---|---|---|---|---|---|
| ||||||||
What is HPC?If you're looking to tackle complex problems or speed up your data analyses, HPC might be just what you need! HPC is an acronym for High Performance Computing and is often used interchangeably with supercomputing. As a UArizona affiliate, you can be sponsored by a faculty member (faculty members can sponsor themselves) to receive free access to our three supercomputers Puma, Ocelote, and ElGato. These are clusters of computers that are housed in the lower level of the UITS building and are available for your analyses. If you're interested in a more in-depth overview of what a supercomputer is, see our page Supercomputing In Plain English. |
Panel | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Panel | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Node SummaryBefore submitting a Slurm script, you must know (or at least have a general idea) of the resources needed for your job. This will tell you which type of node to request, how much memory, and other useful information that can be provided to the system via your batch script. A detailed list of slurm batch flags are included below. General Overview
Hardware Limitations by Node Type and ClusterPlease consult the following table when crafting Slurm submission scripts. Requesting resources greater than what are available on a given cluster+node may lead to errors or delays.
See here for example Slurm requests. Other Job LimitsIn addition to fitting your jobs within the constraints of our hardware, there are other limitations imposed by the scheduler to maintain fair use.
|
Panel | |||||||||
---|---|---|---|---|---|---|---|---|---|
| |||||||||
Login nodesOnce you connect to the bastion host, you'll notice it prompts you to type
Note that the hostname has now changed. Where it used to say gatekeeper, it should now say either wentletrap or junonia. These are the HPC login nodes. A login node is a small computer that serves as a staging area where you can perform housekeeping, edit scripts, and submit your work to run on the system. It's essential to know that the login nodes are not the location where your analyses are run. Instead, the cluster's compute nodes are where the real work is done. |
Panel | ||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| ||||||||||||||||||
Compute nodesWhat are compute nodes?In contrast to the login nodes, compute nodes are high-performance machines designed for computationally intensive applications that require a significant amount of processing power and memory. For example, a standard compute node on Puma has 94 CPUs available and 470GB of RAM! Different clusters, different compute nodesEach of our supercomputers has its own cluster of compute nodes with different resources available. When you first log into HPC, your environment is configured to submit jobs to our largest (and busiest cluster), Puma. You can see which cluster you're targeting by looking at the beginning of your command line prompt. To switch your target cluster, use one of the following shortcuts: For this tutorial, let's switch clusters by entering the shortcut
How do you actually access a compute node?To connect to a compute node, you will need to use the job scheduler. A scheduler, in our case SLURM, is software that will find and reserve resources on a cluster's compute nodes as space becomes available. Resources include things like memory, CPUs, and GPUs that you want to reserve for personal use for a specified period of time. You can use the job scheduler to request two types of jobs: interactive and batch. We will cover both of these in the sections below All there any limitations to what I can request?Yes, there are some limits to what you can request. One of the important limits to understand is your group's CPU hours allocation. Allocations are a way of being "charged" to use HPC resources. Each group gets:
These allocations are automatically refreshed on the first day of each month. Once your group's allocation runs out, you will need to wait until it is refreshed before using that allocation again. Your group's account is charged for each request made to the scheduler, in other words for each job you submit. The amount charged is the number of CPUs requested multiplied by the number of hours reserved. For example, if you submit a job that requests 5 CPUs for 10 hours, your group will be charged 50 CPU hours. If your job ends early, you will be refunded any unused time. To see your group's allocation, use the command
Make a note of your group's name for later. |
Panel | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Node SummaryBefore submitting a Slurm script, you must know (or at least have a general idea) of the resources needed for your job. This will tell you which type of node to request, how much memory, and other useful information that can be provided to the system via your batch script. A detailed list of slurm batch flags are included below. General Overview
Available Compute Nodes Per Cluster
See here for example Slurm requests. |
Panel | ||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| ||||||||||||||||||
Interactive jobsLet's start with a basic interactive job to get a feel for things. Starting a sessionTo connect to a compute node to work interactively, use the command
By default, when you use the command That's it! You're now connected to a compute node (in my case, gpu66) and are ready to run some work. Let's check out some softwareSoftware packages are not available on the login nodes but are available on the compute nodes. Now that we're connected to one, we can see what's available. Software on HPC comes installed as modules. Modules make it easy to load and unload software from your environment. This allows hundreds of packages to be available on the same system without dependency or versioning conflicts. It's always good practice to specify which version of the software you need when loading to ensure a stable environment. You can view and load software modules using the command
Try running Benefits of interactive sessionsInteractive sessions are excellent development environments. When connected to a compute node, some things you can do are:
Drawbacks of interactive sessionsInteractive sessions are great testing and development environments, but may not be optimally suited for certain types of analyses. Some issues that may arise include:
What's a good solution to deal with these challenges? The answer: batch jobs! |
Panel | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Batch jobsThe basicsBatch jobs are a way of submitting work to run on HPC without the need to be present. This means you can log out of the system, turn off your computer and walk away without your work being interrupted. It also means you can submit multiple (up to 1000) jobs to run simultaneously! Running these sorts of jobs requires two steps:
Let's try creating our first job now using the outline provided above. Creating the sample codeLet's start by creating a simple Python script that we'll run in batch.
Now open the file in your favorite text editor (for example, nano or vim) and add:
Then save and exit. If we run this interactively, we'll see
Creating the batch scriptNow, let's make a new file called hello_world.slurm
Now open it in your favorite text editor. Step 1: Add the shebang and SBATCH directives
A comprehensive list of all the options you can specify in your batch script can be found on our Running Jobs with SLURM page. In this example, we'll stick with some of the basics:
Step 2: Add your code instructionsAfter the SBATCH directives, we'll add the instructions for executing our code to the same file
Now save and exit.
Submitting the job
The next step is to submit your job request to the scheduler. To do this, you’ll use the command sbatch. This will place your job in line for execution and will return a job ID. This job ID can be used to check your job’s status with squeue, cancel your job with scancel, and get your job’s history with job-history. A more comprehensive look at job commands can be found in our documentation on monitoring your jobs. Let’s run our script and check its status (substitute your own job ID below where relevant):
You can see its state is PD (for pending) which means it’s waiting to be executed by the system. Its state will go to R when it’s running and when the job has completed running, squeue will return a blank line. Let’s check the contents of our file with cat. If your run was successful, you should see:
Note that the hostname in this run is different from the hostname of the computer we're connected to. This is because it's a separate job from our interactive session and so may run on any other applicable machines on the cluster. |
Panel | ||||||||
---|---|---|---|---|---|---|---|---|
| ||||||||
Additional resourcesThat's it! You've now successfully run both a batch and interactive job on HPC. To continue learning about HPC, our online documentation has a lot more information that can help get you started. For example FAQs, additional SBATCH directives, information on HPC storage, and file transfers. Other great resources include: virtual office hours every Wednesday from 2:00-4:00pm, consultation services offered through ServiceNow, an examples Github page with sample jobs, and a YouTube channel with training videos. |