Notices

 

Cluster Upgrade

Update 23rd Sept: controller upgraded, job submission enabled, nodes availability will be intermittent as cluster components are upgraded one by one
Please be patient while updates take place before and after 1st October 2025
Apologies for the inconvenience, use lab PCs or consult supervisors for alternate GPUs 

October 2025 planned disruption: new head node addresses (gpucluster2,3 may become inaccessible), external access via Zscaler Unified Access or from DoC Shell servers, reorganisation of partitions (see Step 6b), relaxing of some limits, Fairshare priority management. Guide steps will be updated with new information, please bookmark

 

Introduction

What is Slurm and the GPU Cluster?

Slurm is a Linux open-source task scheduling system for managing compute resources, in this case, the department's GPU resources.

Using Slurm commands such as 'sbatch' and 'salloc', your scripts (such as CUDA-based parallel computing - deep-learning, machine-learning and large language models (LLMs), using frameworks such as PyTorch and Tensorflow, or Jax, among others) are executed on our pool of NVIDIA GPU Linux servers.

Read this guide to learn how to:

  • connect to the submission host server and submit a test script
  • start an interactive job (connect directly to a GPU exclusively for a time limit)
  • compose a shell script that uses shared storage, a python environment, CUDA and your python scripts

Alternate layout (better markup)

Gitlab Pages version

Before you begin

Before you begin

Some familiarity with Department of Computing systems is desirable before using the GPU cluster:

  • logging in to DoC Lab PCs, especially Nvidia GPU-equipped PCs (Doc Lab PCs)
  • remotely connecting to lab PCs and Doc Shell servers from a Linux/Mac/Windows Terminal (Shell server guide)
  • composing bash scripts (examples are provided below - it is beyond the scope of this guide to explain shell scripting)
  • Python and virtual environments (pip, conda) (Python environments guide) - users are expected to have a grounding in python and associated tools and frameworks, modern GPU code relies on such knowledge
  • Linux command line interface (Terminal, CLI)

Tip: make sure you have tested your python scripts on your own device or a Doc Lab PC with GPU before using the GPU cluster. Prior testing will help flag errors with your scripts before using sbatch


Follow Nuri's Guide guide for an introduction to using Linux in the Department of Computing

Step by step

General Comments

ICT GPGPU resources

ICT, the central college IT services provider, has approximately one-hundred CX1 cluster nodes which have GPUs installed. It is possible to select and use these computational resources through PBS Pro job specifications.

Students cannot request access to this resource but project supervisors can apply – on behalf of their students - for access to ICT GPU resources run by the Research Computing Service team:

Research Computing Service

Other resources

If you do not need a GPU for your computation then please do not use the GPU Cluster. You could end up inconveniencing users who do need a GPU. Please instead consider:

Fair Usage Policy

The following policies are in effect on the GPU Cluster:

      • A job that runs for more than three days will be automatically terminated - this is a walltime restriction for all - configure checkpoints with your python framework to resume training.
      • Late 2025: FairShare scores will prioritise low-consumption users over those consuming higher resource sizes per job
      • As with all departmental resources, any non-academic use of the GPU cluster is strictly prohibited.
      • Any users who violate this policy will be banned from further usage of the cluster and will be reported to the appropriate departmental and college authorities.