Genome informatics

The informatics activities of the Facility include the establishment and maintenance of high-performance computing (HPC) infrastructure; the development of HPC workflows for genomics data analysis; and data management.

Analysis Workflows

Our analysis workflows are built around established and well-maintained bioinformatics tools and resources.

Quality Control

We use FastQC for raw read quality control and Picard Tools for mapped read quality control.

Sequencing Read Mapping

Genomic sequencing reads are mapped to the reference sequence using BWA-MEM. The spliced read mapper TopHat2 is used to map reads from transcriptome sequencing.

Variant Calling and Annotation

For post-processing, variant calling and variant quality filtering we use the Genome Analysis Tool Kit (GATK). Copy number variation is assessed with ExomeDepth. Variants are annotated with ANNOVAR. Our annotation includes variant information from the Human Gene Mutation Database Professional.

Differential Expression Analysis and Transcriptome Reconstruction

Our differential expression analysis is based on HTseq for read counting and DESeq for differential expression calling. Differentially expressed gene sets are tested for functional enrichment using DAVID and GSEA.

Cancer Mutation Detection and Annotation

We use MuTect for point mutation calling and Somatic Indel Detector (GATK version 2.3-9) for detection of short somatic insertions/deletions. Larger genomic rearrangements are detected with CREST. The annotation of somatic mutations is carried out with Oncotator.

Metagenomic Profiling

We maintain analysis pipelines for the identification and classification of exogenous sequences in host genomic background from transcriptome sequencing which are based on IMSA and RINS.

Methylation Profiling

We carry out MeDIP-seq analysis pipeline uses MEDIPS for differential methylation calling.


The facility works closely with the Imperial HPC Service and Data Centre to implement and maintain the HPC infrastructure required for the analysis of large-scale genomics datasets. We have dedicated access to 320 CPU cores across 20 HPC nodes, with 16 cores and 128GB of memory on a compute cluster. Furthermore, we have access to an SGI Altix shared memory system with a total of 384 CPU cores and 5TB of memory. Our data is stored on secure, backed-up storage at the Imperial Data Centre. In total, we have access to more than 150TB of storage.


Human Gene Mutation Database (HGMD) Professional

HGMD Professional is a curated database of human inherited disease mutations. It is widely used in genetics research to facilitate the prioritisation and interpretation of variants. We have a departmental license for HGMD Pro, enabling us to include HGMD annotations with our variant annotation service. In addition, we have a licence for the online version of the database. This allows individual investigators to query the database in order to analyse candidate genes for disease linkage and predisposition. Please contact us if you would like to access HGMD Pro.

UCSC Genome Browser Mirror

The USCS Genome Browser is a widely used interactive website for browsing genome sequences of human and model organisms. Sequence information is integrated with a comprehensive catalogue of annotations, including variant information, phenotype data, and gene regulation data from ENCODE, making it an ideal tool for interrogation of variant, expression and ChIP data. We maintain a local mirror of the UCSC Genome Browser for the human reference genome sequence, allowing our users to browse their data sets in the context of genomic information without having to upload information to external servers.