Genome informatics

The informatics activities of the Facility include the establishment and maintenance of high-performance computing (HPC) infrastructure; the development of HPC workflows for genomics data analysis; and data management.

Genome Informatics Tabs

Analysis Workflows

Our analysis workflows are built around established and well-maintained bioinformatics tools and resources.

Quality Control

We use FastQC for raw read quality control and Picard Tools for mapped read quality control.

Sequencing Read Mapping

Genomic sequencing reads are mapped to the reference sequence using BWA-MEM. The spliced read mapper TopHat2 is used to map reads from transcriptome sequencing.

Variant Calling and Annotation

For post-processing, variant calling and variant quality filtering we use the Genome Analysis Tool Kit (GATK). Copy number variation is assessed with ExomeDepth. Variants are annotated with ANNOVAR. Our annotation includes variant information from the Human Gene Mutation Database Professional.

Differential Expression Analysis and Transcriptome Reconstruction

Our differential expression analysis is based on HTseq for read counting and DESeq for differential expression calling. Differentially expressed gene sets are tested for functional enrichment using DAVID and GSEA.

Cancer Mutation Detection and Annotation

We use MuTect for point mutation calling and Somatic Indel Detector (GATK version 2.3-9) for detection of short somatic insertions/deletions. Larger genomic rearrangements are detected with CREST. The annotation of somatic mutations is carried out with Oncotator.

Metagenomic Profiling

We maintain analysis pipelines for the identification and classification of exogenous sequences in host genomic background from transcriptome sequencing which are based on IMSA and RINS.

Methylation Profiling

We carry out MeDIP-seq analysis pipeline uses MEDIPS for differential methylation calling.


The facility works closely with the Imperial HPC Service and Data Centre to implement and maintain the HPC infrastructure required for the analysis of large-scale genomics datasets. We have dedicated access to 320 CPU cores across 20 HPC nodes, with 16 cores and 128GB of memory on a compute cluster. Furthermore, we have access to an SGI Altix shared memory system with a total of 384 CPU cores and 5TB of memory. Our data is stored on secure, backed-up storage at the Imperial Data Centre. In total, we have access to more than 150TB of storage.