A guide on how to submit GPU-enabled scripts to the departmental GPU cluster

UPDATED: Now using Slurm October 2019

Slurm logo

1. What is Slurm and the GPU cluster?

Slurm is an open-source task scheduling system for managing the departmental GPU cluster. The GPU cluster is a pool of NVIDIA GPUs that can be leveraged for machine learning using popular frameworks such as PyTorch and Tensorflow, or any CUDA-based code. This guide will show you how to submit your GPU-enabled scripts to work with the shared resource.

2. Quick Start

Open a Terminal (Ubuntu/macOS, Windows 10 use Powershell built-in ssh) session, and type the following commands:

ssh gpucluster.doc.ic.ac.uk
sbatch /vol/bitbucket/shared/slurmseg.sh

These commands first log you into the remote cluster controller (gpucluster.doc.ic.ac.uk) and then submit a pre-existing script. The output will be stored in the root of your ~/ home directory, with the filename slurm-{xyz}.out.

Follow the next steps to learn how to prepare your own scripts for submission.

3. Store your GPU data under /vol/bitbucket

CSG have created a network share named /vol/bitbucket. Create your personal folder as follows:

mkdir -p /vol/bitbucket/${USER}

You can now store your training data and virtual environments here (storing virtual environments in your home folders is not advisable as you may exceed your home quota).

Storing the output of the jobs that you run on the GPU cluster.

In addition to submitting your Slurm job(s) from the appropriate directory – that is, /vol/bitbucket/${USER} – you simply configure your job(s) to write output data to some location under that same directory at run-time. Slurm will store job standard output in the current working directory by default. It is sufficient, therefore, to submit a Slurm job on gpucluster after having run:

cd /vol/bitbucket/${USER}

Please be aware of these guidelines when using /vol/bitbucket:

  • The contents of /vol/bitbucket are not backed-up. If you store critical data under /vol/bitbucket, please ensure that you keep a back-up of that data somewhere else.

  • It is not intended for the storage of long-term data.

  • There are currently no individual user-quotas on /vol/bitbucket. Please do not abuse this: if /vol/bitbucket becomes full, all end-users will be affected and only a member of CSG will be able to resolve the situation.

  • If you use /vol/bitbucket, you must only store content under a top-level folder corresponding to your user-name or one of your group-names (for a group project).

4. Creation of a Python virtual environment for your project (example)

Here are some examples how one might use /vol/bitbucket in the course of a GPU cluster project.

Installation of Miniconda:

cd /vol/bitbucket/${USER}
curl -O https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh

bash Miniconda3-latest-Linux-x86_64.sh -b -p /vol/bitbucket/${USER}/miniconda3

Installation of a machine learning framework (python3):

#load your python environment if not loaded already (following example steps, adjust for your own folder names)
export PATH=/vol/bitbucket/${USER}/miniconda3/bin/:$PATH
source activate
pip --version #check your version of pip, pip3 is automatically used if your environment is python 3.x
pip
install tensorflow-gpu
pip install torch torchvision

Include 'export' and 'conda activate' lines in your script, but avoid adding 'pip install' lines. Any modules you install in your environment during a (Ubuntu) desktop or ssh session will be available when you submit a task, simply include the correct calls.

5. Using CUDA

Most GPU jobs will make use of the Nvidia CUDA tool-kit. Multiple versions of this tool-kit are available under /vol/cuda (network share). Inside those directories are numbered sub-directories for different versions of the CUDA tool-kit. If you need to use CUDA, please consult the README under any one of those directories.

Suppose that you want to use CUDA tool-kit verson 10.0.130, add the following line/s to your submission script:

If your shell is bash; note the initial dot-space (.␣)

. /vol/cuda/10.0.130/setup.sh

OR if your shell is (t)csh

source /vol/cuda/10.0.130/setup.csh

The script will set up your unix path to access commands such as nvcc.

If you are using frameworks such as TensorFlow, PyTorch and Caffe, please make sure that you set up a compatible version of the Nvidia CUDA tool-kit following the instructions above before you install and configure them.

6. Example Submission Script

Here is a template you can copy to a shell script to get started. Please adjust any paths that may point to folders you have created.

IMPORTANT: This example assumes you have followed the previous steps and installed a python environment as directed. Please adjust paths if you have your own python environment, or if you already load your environment in ~/.bashrc. Do not remove # signs, keep them as below, make sure the #SBATCH directives are directly after #!/bin/bash

#!/bin/bash
#SBATCH --gres=gpu:1
#SBATCH --mail-type=ALL
# required to send email notifcations
#SBATCH --mail-user=<your_username>
# required to send email notifcations - please replace <your_username> with your college login name or email address
export PATH=/vol/bitbucket/${USER}/miniconda3/bin/:$PATH
source activate
source /vol/cuda/10.0.130/setup.sh
TERM=vt100 # or TERM=xterm

/usr/bin/nvidia-smi
uptime


Remember to make your script executable (run this command in a shell, do not include it in your script):

chmod +x <script_name>.sh

Please note, whenever your job is submitted, any environment variables from ~/.bashrc or ~/.cshrc are also loaded, meaning your scripts are run in the context of your user profile. Your script can access your own home directory, your /vol/bitbucket folder or shared volumes such as /vol/cuda

7. Connect to gpucluster.doc.ic.ac.uk to submit jobs

Slurm is an open-source task-scheduler that CSG have installed on the server gpucluster and a number of GPU hosts. gpucluster.doc.ic.ac.uk is the main controller for the cluster and you submit your compute jobs from gpucluster. The GPU hosts each contain a high-end graphics card – for example, an Nvidia GeForce GTX Titan Xp or an Nvidia Tesla. You cannot access the GPU hosts directly, you instead submit your Slurm jobs through gpucluster and gpucluster schedules your jobs to run on available an available GPU.

Here is an example of the steps involved in submitting a Slurm job on gpucluster

    1. Connect to the slurm controller:

      ssh gpucluster.doc.ic.ac.uk

    2. Change to an appropriate directory on gpucluster:

      # this directory may already exist after Step 3
      mkdir -p /vol/bitbucket/${USER}

      cd /vol/bitbucket/${USER}

    3. Now try running an example job. A simple shell-script has been created for this purpose. You can view the file with less, more or view. You can use the sbatch command to submit that shell-script to run it as a Slurm job on a GPU host:

      sbatch /vol/bitbucket/shared/slurmseg.sh

      If you have composed your own script, in your bitbucket folder, for example, enter:

      cd /vol/bitbucket/${USER}
      sbatch my_script.sh

      Substitute 'my_script.sh for your actual script name.

    4. You can invoke the squeue command to see information on running jobs:

      squeue

    5. The results of sbatch will output to the directory where the command was invoked, eg /vol/bitbucket/${USER}. The filenames will be derived from the invoked command or script – for example:

      less slurm-XYZ.out

where XYZ is a unique Slurm job number. Visit the FAQ below to find out how to customise the job output name

Please note: the server gpucluster is not to be used for computation. Please do not attempt to SSH and then run resource-intensive processes on gpucluster itself. The server only has one role:

  • Allow end-users to submit Slurm jobs to GPU-equipped servers using sbatch.

Note in particular that gpucluster does not have an Nvidia CUDA-capable card in it. This is deliberate. Do not be surprised if you SSH to gpucluster, set up a virtual environment and when you run a test on gpucluster, see an error message similar to the following:

ImportError: libcuda.so.1: cannot open shared object file: No such file or directory

8. Frequently Asked Questions

      1. What graphics cards are installed on the GPU hosts?
        Answer: Nvidia Tesla (12GB RAM ) and Nvidia GeForce GTX Titan Xp (12GB RAM )

      2. What are the general platform characteristics of the GPU hosts?
        Answer: 24-core/48 thread Intel Xeon CPUs with 256GB RAM.

      3. How do I see what Slurm jobs are running?
        Answer: invoke any one of the following commands on gpucluster:

        # List all your current Slurm jobs in brief format
        squeue
        # List all your current Slurm jobs in extended format.
        squeue -l

        Please run man squeue on gpucluster for additional information.

      4. How do I delete a Slurm job?
        Answer: First, run squeue to get the Slurm job ID from the JOBID column, then run:

        scancel <job ID>

        You can only delete your own Slurm jobs.

      5. How many GPU hosts are there?
        Answer: As of July 2019, there are two GPU servers with multiple GPUs. Only one GPU is permitted to be used per job

      6. How do I analyse a specific error in the Slurm output file/e-mail after running a Slurm job?
        Answer: If the reason for the error is not apparent from your job’s output, then you need to e-mail doc-help@imperial.ac.uk, including all relevant information – for example:
        • the context of the Slurm command that you are running. That is, what are you trying to achieve and how have you gone about achieving it? Have you, created a Python virtual environment? Are you using a particular server or deep learning framework?
        • the Slurm script/command that you have used to submit the job. Please include the full paths to the scripts if they live under /vol/bitbucket
        • what you believe should be the expected output.
        • the details of any error message displayed. You would be surprised at how many forget to include this.

      7. I receive no output from a Slurm job. How do I go about debugging that?
        Answer: This is an open-ended question. Please first confirm that your Slurm job does indeed generate output when run interactively. You may be able to use one of the 'gpu01-29' interactive lab computers to perform an interactive test. If you still need assistance, please follow the advice in the preceding FAQ entry (Number vi).

      8. How do I customise my job submission options?
        Answer: Add a Slurm comment directive to your job script – for example:

        # To request 1 or more GPUs (default is 1):
        #SBATCH –gres=gpu:1

        # To receive email notifications
        #SBATCH –-mail-type=ALL
        #SBATCH –-mail-user=<your_username>

        #Customise job output name
        #SBATCH –output=<your_job_name>%j.out

      9. How do I run a job interactively?
        Answer: Use srun and specify a gpu, and other resources. eg. for a bash shell:

        srun --pty --gres=gpu:1 bash

      10. I need a particular software package to be installed on a GPU host.
        Answer: Have you first tried installing the package in a Python virtual environment or in your own home directory with the command:

        pip install --user <packagename>

        If the above options do not work then please e-mail doc-help@imperial.ac.uk with details of the package that you would like to be installed on the GPU server(s). Please note: CSG are only able to install standard Ubuntu packages if doing so does not conflict with any exisiting package or functionality on all the GPU servers.

      11. My job is stuck in queued status, what does this mean?
        Answer: This could be because all GPUs are in use. PD status occurs if you are already running two jobs, and will run (R) when one of your previous tasks is complete. (QOSMaxGRESPerUser) means you are using your maximum of two GPUs at any one time.

General Comments

Fair Usage Policy

The following policies are in effect on the GPU Cluster:

      • User can have two running jobs only (taught students), all other jobs will be queued until one of the two jobs completes running
      • A job that runs for more than four days will be automatically terminated - this is a walltime restriction for taught students - configure checkpoints with your python framework to resume training.
      • As with all departmental resources, any non-academic use of the GPU cluster is strictly prohibited.
      • Any users who violate this policy will be banned from further usage of the cluster and will be reported to the appropriate departmental and college authorities.

ICT GPGPU resources

ICT, the central college IT services provider, has approximately one-hundred CX1 cluster nodes which have GPUs installed. It is possible to select and use these computational resources through PBS Pro job specifications.

Students cannot request access to this resource but project supervisors can apply – on behalf of their students - for access to ICT GPU resources run by the Research Computing Service team:

Research Computing Service

Other resources

If you do not need a GPU for your computation then please do not use the GPU Cluster. You could end up inconveniencing users who do need a GPU. Please instead consider: