Maintenance

Overview

  • The command uptime_remaining will display the amount of time remaining before the systems are taken offline for maintenance
  • If you see one of your jobs held with the Reason code ReqNodeNotAvail, Reserved for maintenance, your job's walltime overlaps with an upcoming maintenance period. Run uptime_remaining to see when the systems will be taken offline. 


Most maintenance is performed during regular hours with no interruption to service.  System wide maintenance is usually planned ahead of time and is scheduled for Wednesdays from 8AM to 5PM with at least 10 days notice.  These will be planned to occur four times per year.

These maintenance windows represent periods when UITS may choose to drain the queues of running jobs and suspend access to the cluster operation for HPC maintenance purposes.

The notification will describe the nature and extent (partial or full) of the interruptions of HPC services. 

Batch queues will also be modified prior to scheduled downtimes to hold jobs which request more wallclock time than remains before the shutdown. Held jobs will be released to run once maintenance concludes.

Emergency Maintenance

Unavoidable (emergency) downtime may occur as a result of any of the above reasons at almost any time. Such events are rare and great effort is made to avoid these situations. However, when emergency maintenance is needed, the UITS unit responsible for the item affected will provide as much notice to users as possible and work to resolve the fault as quickly as possible.

Any emergency outages will be announced via email through the hpc-announce@list.arizona.edu mailing list.