Getting Started
Update: new queues - training ; AMD7-A100-T ; AMD7-A100-R (research) |
---|
Run up four more scripts in addition to normal two; 80GB A100 GPUs enabled (sbatch only) |
Update 19/12/24 |
Please note: as of Thursday 19th December 2024, offsite/external password-based SSH authentication is not possible on the head node servers. Logging in from a Doc Lab PC works as before. Please visit the Shell Server guide on how to set up public / private key authentication (you must be onsite at Imperial in order to create SSH keys for external remote access) |
Introduction
What is Slurm and the GPU Cluster?
Slurm is a Linux open-source task scheduling system for managing compute resources, in this case, the department's GPU resources.
Using Slurm commands such as 'sbatch' and 'salloc', your scripts (such as CUDA-based parallel computing - deep-learning, machine-learning and large language models (LLMs), using frameworks such as PyTorch and Tensorflow, or Jax, among others) are executed on our pool of NVIDIA GPU Linux servers.
Read this guide to learn how to:
- connect to the submission host server and submit a test script
- start an interactive job (connect directly to a GPU exclusively for a time limit)
- compose a shell script that uses shared storage, a python environment, CUDA and your python scripts
Before you begin
Before you begin
Some familiarity with Department of Computing systems is desirable before using the GPU cluster:
- logging in to DoC Lab PCs, especially Nvidia GPU-equipped PCs (Doc Lab PCs)
- remotely connecting to lab PCs and Doc Shell servers from a Linux/Mac/Windows Terminal (Shell server guide)
- composing bash scripts (examples are provided below - it is beyond the scope of this guide to explain shell scripting)
- Python and virtual environments (pip, conda) (Python environments guide) - users are expected to have a grounding in python and associated tools and frameworks, modern GPU code relies on such knowledge
- Linux command line interface (Terminal, CLI)
Tip: make sure you have tested your python scripts on your own device or a Doc Lab PC with GPU before using the GPU cluster. Prior testing will help flag errors with your scripts before using sbatch
Follow Nuri's Guide guide for an introduction to using Linux in the Department of Computing
Step by step
- 1a. Quick Start (submit from a DoC Lab PC)
- 1b. Quick Start (externally from a personal device)
- 1c. Quick Start (interactive shell using 'salloc')
- 2. Store your datasets under /vol/bitbucket
- 3. Creation of a Python virtual environment for your project (example)
- 4. Using CUDA (add to a script)
- 5. Example submission script
- 6. Connect to a submission host to send jobs
- 6b. GPU types
- Frequently Asked Questions
Open a Terminal window from a lab PC (Ubuntu/macOS, Windows 10/11 use Powershell built-in ssh or WSL/2), and type the following commands:
ssh gpucluster2.doc.ic.ac.uk
# or ssh gpucluster3.doc.ic.ac.uk
sbatch /vol/bitbucket/shared/slurmseg.sh
In this example, a user first logs into a Slurm head node (gpucluster2.doc.ic.ac.uk via ssh) and then submits a pre-existing script using the sbatch command. The output will be stored, by default, in the user's ~/ home directory, with the filename slurm20-{xyz}.out (or from whichever directory the user happens to run an sbatch command from)
If you have a bash script ready, replace /vol/bitbucket/shared/slurmseg.sh with the full path to your own script
Please note: as of Thursday 19th December 2024, offsite/external password-based SSH authentication is not possible on the head node servers. Logging in from a Doc Lab PC works as before. Please visit the Shell Server guide on how to set up public / private key authentication (you must be onsite at Imperial in order to create SSH keys for external remote access)
If connecting from your own personal computer or device, make sure you have set up your SSH keys:
ssh gpucluster2.doc.ic.ac.uk sbatch /vol/bitbucket/shared/slurmseg.sh |
gpucluster2.doc.ic.ac.uk and gpucluster3.doc.ic.ac.uk are now accessible from outside the college network, but if for some reason they are not accessible, use shell[1-5].doc.ic.ac.uk as a JumpHost:
ssh -J shell5.doc.ic.ac.uk gpucluster2.doc.ic.ac.uk
This 'interactive' method allows you to work as if you were using a terminal prompt on a Lab PC with GPU (for a maximum of four days)
Connect to gpucluster2 or gpucluster3 from a Lab PC or externally from your own device (remember to set up SSH keys as mentioned in Step 1a), use 'salloc' to queue your CPU, RAM and GPU resources.
Reminder: relinquish interactive jobs to allow other users to use GPUs, too many idle interactive jobs reduce the efficiency and effectiveness of the service
ssh gpucluster2.doc.ic.ac.uk salloc --gres=gpu:1 # typical output: |
Salloc will drop you straight into your allocated node, as indicated by the shell prompt eg. myaccount@gpuvm01
Run 'nvidia-smi' to show your allocated GPU. You can now commence writing your scripts and debugging with an Nvidia GPU
If you prefer to connect to your node later, for example in VSCode, use '--no-shell'
salloc --gres=gpu:1 --no-shell Make a note of you allocated node, or run 'squeue --me' to see your existing jobs. Read the Shell server guide on how to use shell1-5 or gpucluster2/3 as jumphosts to connect directly to your node externally |
---|
Connect to your running job via ssh
You can ssh directly to the node hosting your GPU, as long as your job is running in the queue. Type the following command either on the headnode (gpucluster2 or gpucluster3) or in your running session, eg gpuvm01, etc:
squeue --me #example output:
Make a note of your node from the Nodelist column, the user in this case would type : ssh username@gpuvm01.doc.ic.ac.uk You can also connect directly using IDEs such as VSCode - remember to run salloc first and find your node name. If reconnecting, make sure you ssh to gpucluster2 or 3 first and then ssh to your allocated node (or directly from a lab PC or even VPN) |
There is a department-wide network share /vol/bitbucket for data and virtual environment storage. Create your personal folder as follows:
mkdir -p /vol/bitbucket/${USER}
Read the detailed Python Virtual Environments guide for best practice in using /vol/bitbucket and creating virtual environments.
Tip: shared folders such as /vol/bitbucket or your home directory /homes/username are vital to get your scripts running on remote GPU cluster nodes. On your own laptop or computer, you would store files on local storage but for the GPU cluster, make sure you copy all necessary files to shared storage, so your scripts can access files regardless of which server they are running from.
Here are some examples how you might use /vol/bitbucket in the course of a GPU cluster project.
Please note: Use a lab PC to prepare your Python environment, avoid running 'pip' or 'git' commands when logged in to gpucluster2.doc.ic.ac.uk or gpucluster3.doc.ic.ac.uk or you may encounter 'out of space' errors. |
Installation of Python Virtual Environment:
# connect to a random lab PC - remember to use a lab PC to create envs, use pip and git
ssh shell1.doc.ic.ac.uk
/vol/linux/bin/sshtolab
cd /vol/bitbucket/${USER}
python3 -m virtualenv /vol/bitbucket/${USER}/myvenv
Again, consult the Python Virtual Environment guide for more about managing virtual environments in your account.
There exists a 'base' read-only environment, located at /vol/bitbucket/starter with Pytorch and tensorflow pre-installed using 'pip' and may suffice when first submitting jobs. Enable this in scripts using 'source /vol/bitbucket/starter/bin/activate'
Follow the previous steps when you need to create an environment using your specific required pip/conda packages.
Most GPU jobs will make use of the Nvidia CUDA tool-kit. Multiple versions of this tool-kit are available under /vol/cuda (network share). Inside those directories are numbered sub-directories for different versions of the CUDA tool-kit. If you need to use CUDA, please consult the README under any one of those directories.
Suppose that you want to use CUDA tool-kit verson 12.0.0, add the following line/s to your submission script:
If your shell is bash; note the initial dot-space (.␣)
. /vol/cuda/12.0.0/setup.sh
OR if your shell is (t)csh
source /vol/cuda/12.0.0/setup.csh
The script will set up your unix path to access commands such as nvcc.
If you are using frameworks such as TensorFlow, PyTorch and Caffe, make sure you have chosen a compatible version of the Nvidia CUDA tool-kit. For example, Pytorch comes in CPU and GPU flavours, but also different versions of CUDA - sourcing the matching CUDA distribution from /vol/cuda will help reduce errors in your output.
Reminder: calling CUDA or installing CUDA libraries in your virtual environment does not guarantee successful GPU initialisation for your job. Always check your Python code and test on lab PCs, to increase the likelihood of correct GPU usage.
Here is a template you can copy to a shell script to get started. Please adjust any paths that may point to folders you have created.
IMPORTANT: This example assumes you have followed the previous steps and installed a python environment (using virtualenv, extra lines may be needed using minconda, check the example script furthe below) as directed. Please adjust paths if you have an existing python environment, or if you already load your environment in ~/.bashrc (note: sbatch does not load ~/.bashrc, source it as per example script) . Do not uncomment #SBATCH lines, keep them as below, make sure the #SBATCH directives are directly after #!/bin/bash
#!/bin/bash #SBATCH --gres=gpu:1 #SBATCH --mail-type=ALL # required to send email notifcations #SBATCH --mail-user=<your_username> # required to send email notifcations - please replace <your_username> with your college login name or email address export PATH=/vol/bitbucket/${USER}/myvenv/bin/:$PATH # the above path could also point to a miniconda install # if using miniconda, uncomment the below line # source ~/.bashrc source activate source /vol/cuda/12.0.0/setup.sh /usr/bin/nvidia-smi uptime |
Remember to make your script executable (run this command in a shell, do not include it in your script):
chmod +x <script_name>.sh
Please note, environment variables from ~/.bashrc or ~/.cshrc are not loaded by sbatch-submitted scripts, you should source them as in the preceding script. Your script can access your own home directory, your /vol/bitbucket folder or shared volumes such as /vol/cuda
Reminder: running sbatch scripts does not guarantee successful GPU initialisation for your job (nvidia-smi only confirms which GPU is assigned for your job). Always check your Python code and test on lab PCs, to increase the likelihood of correct GPU usage.
gpucluster2.doc.ic.ac.uk and gpucluster3.doc.ic.ac.uk are submission hosts for the GPU cluster, from where you run the sbatch command to send your scripts to the remote GPU host servers.
Here is an example of the steps involved in submitting your script as a Slurm job:
- Connect to a slurm submission host (see step 2 for connecting from your own laptop):
ssh gpucluster2.doc.ic.ac.uk
# or ssh gpucluster3.doc.ic.ac.uk - Change to an appropriate directory on the host:
- # this directory may already exist after Step 3
mkdir -p /vol/bitbucket/${USER}
cd /vol/bitbucket/${USER}
- # this directory may already exist after Step 3
-
Now try running an example job. A simple shell-script has been created for this purpose. You can view the file with less, more or view. You can use the sbatch command to submit that shell-script to run it as a Slurm job on a GPU host:
sbatch /vol/bitbucket/shared/slurmseg.sh If you have composed your own script, in your bitbucket folder, for example, enter:
cd /vol/bitbucket/${USER}
sbatch /path_to_script/my_script.shSubstitute '/path_to_script/my_script.sh' for your actual script and path name.
- You can invoke the squeue command to see information on running jobs:
squeue -
The results of sbatch will output to the directory where the command was invoked, eg /vol/bitbucket/${USER}. The filenames will be derived from the invoked command or script – for example:
less slurm-XYZ.out
where XYZ is a unique Slurm job number. Visit the FAQ below to find out how to customise the job output name
The GPU hosts (or nodes) each contain :
Partition name (taught/research) | GPU | CPU | Qty |
---|---|---|---|
gpgpu / resgpu | - Tesla A40 48GB | AMD Epyc | 7 |
gpgpuB / resgpuB | - Tesla A30 24GB (from Apr 2025) | AMD Epyc | 20 |
gpgpuC / resgpuC | - Tesla T4 16GB GPUs | Intel | 16 |
gpgpuD / resgpuD | - Tesla T4 16GB GPUs | Intel | 26 |
AMD7-A100-T / AMD7-A100-R | - Tesla A100 80GB | AMD Epyc 7/9 | 12 |
a16gpu / a16resgpu | - Tesla A16 16GB GPUs | AMD Epyc 7 | 28 |
training | General purpose queue for additional scripts |
For example, to target a T4 GPU (taught students):
sbatch --partition gpgpuC /path/to/script.sh
*Research/PhD users are automatically assigned 'resgpu' versions of the above, but a smaller pool, please make use of the college HPC cluster for more resources
If you choose specific, eg., 48GB GPUs, then expect to wait for a while, use squeue --me --start for an estimated start time. Decide whether your script really needs 48GB or if 24GB or 16GB will suffice
Update: in addition to two jobs as above, up to four more scripts can be submitted and run in the 'training' partition, like so:
sbatch --partition training --gres=gpu:1 /path/to/script
Rules: no interactive jobs, only one GPU per job, currently 16GB devices are available.
- What GPU cards are installed on the GPU hosts?
Answer: Nvidia Tesla A30 (24GB RAM split into 12GB instances), Tesla T4 (16GB RAM), Tesla A40 (48GB RAM) and Tesla A100 (80GB split into 10GB instances) - What are the general platform characteristics of the GPU hosts?
Answer: 24-core/48 thread Intel Xeon CPUs with 256GB RAM and AMD EPYC 7702P 64-Core CPUs - How do I see what Slurm jobs are running?
Answer: invoke any one of the following commands on gpucluster:
# List all your current Slurm jobs in brief format
squeue
# List all your current Slurm jobs in extended format.
squeue -l
Please run man squeue on gpucluster for additional information. - How do I delete a Slurm job?
Answer: First, run squeue to get the Slurm job ID from the JOBID column, then run:
scancel <job ID>
You can only delete your own Slurm jobs. - How many GPU hosts are there?
Answer: As of July 2023, there are nine host GPU servers, with eight running DoC Cloud GPU nodes. - How do I analyse a specific error in the Slurm output file/e-mail after running a Slurm job?
Answer: If the reason for the error is not apparent from your job’s output, make a post on the Edstem CSG board , including all relevant information – for example:
- the context of the Slurm command that you are running. That is, what are you trying to achieve and how have you gone about achieving it? Have you, created a Python virtual environment? Are you using a particular server or deep learning framework?
- the Slurm script/command that you have used to submit the job. Please include the full paths to the scripts if they live under /vol/bitbucket
- what you believe should be the expected output.
- the details of any error message displayed. You would be surprised at how many forget to include this.
- I receive no output from a Slurm job. How do I go about debugging that?
Answer: This is an open-ended question. Please first confirm that your Slurm job does indeed generate output when run interactively. You may be able to use one of the 'gpu01-36' interactive lab computers to perform an interactive test. If you still need assistance, please follow the advice in the preceding FAQ entry (Number vi). - How do I customise my job submission options?
Answer: Add a Slurm comment directive to your job script – for example:
# To request 1 or more GPUs (default is 1):
#SBATCH --gres=gpu:1
# To request a 48GB Tesla A40 GPU:
#SBATCH --partition gpgpu
# Please note, there are only a few 48GB GPUs available, interactive jobs are not permitted
# For other GPUs, refer to 6b. GPU types, including the research equivalents of the above
# To receive email notifications
#SBATCH --mail-type=ALL
#SBATCH --mail-user=<your_username>
#Customise job output name
#SBATCH --output=<your_job_name>%j.out - How do I run a job interactively?
Answer: Use srun and specify a gpu, and other resources. eg. for a bash shell:
srun --pty --gres=gpu:1 bash
Update: use 'salloc' as detailed in Step 2c - I need a particular software package to be installed on a GPU host.
Answer: Have you first tried installing the package in a Python virtual environment or in your own home directory with the command:
pip install --user <packagename>
If the above options do not work then make a post on the Edstem CSG board with details of the package that you would like to be installed on the GPU server(s). Please note: CSG are only able to install standard Ubuntu packages if doing so does not conflict with any exisiting package or functionality on all the GPU servers. - My job is stuck in queued status, what does this mean?
Answer: This could be because all GPUs are in use. PD status occurs if you are already running two jobs, and will run (R) when one of your previous tasks is complete. (QOSMaxGRESPerUser) means you are using your maximum of two GPUs at any one time. - When will my job start?
Run:
squeue --me --start
An estimate is listed based on the maximum runtime of current jobs, but your job may start sooner as jobs finish before the maximum end or 'walltime' - What are the CUDA compute capabilities for each GPU?
Please consult the NVIDIA Compatiiblity Index for more information.
The cluster GPUs support the following levels:
sm75 (T4), sm80 (A30), sm86 (A40)
These should be considered when, for example, using older versions of Pytorch and receiving 'not supported' Python errors
General Comments
The following policies are in effect on the GPU Cluster:
- User can have two running jobs only (taught students), all other jobs will be queued until one of the two jobs completes running
- An additional four 1-GPU jobs can be submitted using the 'training' partition, for a total of six across the cluster
- A job that runs for more than four days will be automatically terminated - this is a walltime restriction for taught students - configure checkpoints with your python framework to resume training.
- As with all departmental resources, any non-academic use of the GPU cluster is strictly prohibited.
- Any users who violate this policy will be banned from further usage of the cluster and will be reported to the appropriate departmental and college authorities.
ICT, the central college IT services provider, has approximately one-hundred CX1 cluster nodes which have GPUs installed. It is possible to select and use these computational resources through PBS Pro job specifications.
Students cannot request access to this resource but project supervisors can apply – on behalf of their students - for access to ICT GPU resources run by the Research Computing Service team:
Other resources
If you do not need a GPU for your computation then please do not use the GPU Cluster. You could end up inconveniencing users who do need a GPU. Please instead consider:
- The departmental DoC Condor service
- The departmental batch servers:
Long Running Processes guide
Long Running Processes PDF link