Software

Resources

Licenses

The project has licenses for a number of software packages and databases for use by members of the project.

INGENUITY PATHWAY Analysis 

Unlimited datasets, 4 concurrent users, 5 years

IPAIPA is a web-based software application for the analysis, integration, and interpretation of data derived from ‘omics experiments, such as RNAseq, small RNAseq, microarrays including miRNA and SNP, metabolomics, proteomics, and smallscale experiments that generate gene and chemical lists. Powerful analysis and search tools uncover the significance of data and identify new targets or candidate biomarkers within the context of biological systems.

PARTEK

1 floating license, 3 years

PartekPartek Genomics Suite software is a desktop analysis software package which has a number of built-in workflows for a variety of genomic applications that guide researchers through every step of the analysis process. Integrated workflows include: (1) Gene expression, miRNA expression, Exon expression, Copy number, Allele specific copy number, Loss of Heterozygosity (LOH), Association, Trio and ChIP-chip for microarray data; (2) RNA-Seq, miRNA-Seq, ChIP-Seq, DNA-Seq and Methylation for next generation sequencing data.

SPOTFIRE

5 licenses, 5 years

SpotfireSpotfire is a visualization that uses in-memory processing and good user interface design to develop highly interactive displays of data. Business intelligence reporting is intended to improve business and financial analytics. Spotfire's interactive and highly visual analytical environment helps achieve this with a self-configuring visual data analysis environment that lets users query, visualize and explore data in real time. 

MATLAB

Distributed computing 128 cores  3 years - and  toolboxes

matlabMATLAB (matrix laboratory) is a numerical computing environment and programming language. It allows matrix manipulations, plotting of functions and data, implementation of algorithms, creation of user interfaces, and interfacing with programs written in other languages. Optional packages allow access to symbolic computing abilities, graphical multi-domain simulation and model-based design for dynamic and embedded systems.

KEGG

Non-commercial license, 5 years

keggKEGG is a database resource for understanding high-level functions and utilities of the biological system, such as the cell, the organism and the ecosystem, from molecular-level information, especially large-scale molecular datasets generated by genome sequencing and other high-throughput experimental technologies.

USEARCH

40 cores, perpetual

USEARCH is a unique sequence analysis tool with thousands of users world-wide. It offers search and clustering algorithms that are often orders of magnitude faster than BLAST.

Virtual machines

open nebula

OpenNebulaThe Open Nebula cluster is currently being used to run a variety of infrastructure services for UK Med-Bio including the help-desk software (RT) and web access gateways. In addition, it enables us to provide VMs to serve specific services on behalf of UK Med-Bio, both internally and externally available. It is also used to offer sand-boxed development and pre-production environments for our software developers.

Examples of current internal-use hosted VMs include:

 Externally available resources hosted on our Open Nebula cluster include:

  • CorrMapper, written by Dr Daniel Homola in the Department of Surgery and Cancer, helps to explore, integrate and visualise the data of complex biological studies. It requires one or two OMICS datasets, along with a metadata table, which holds clinically or biologically relevant information about these samples.

A full list of available services is available from the Tools and services webpage.

Galaxy

GalaxyWe will shortly be making available our own local Galaxy server VM. Galaxy is a widely used workflow and pipeline system, developed for bioinformatics applications. Workflows can be created, re-run and shared via a web interface, and the open source project has a very active user community. You can find out more information about the Galaxy project itself from its site. Initially, we will focus on common NGS workflows (e.g. for primary RNA-Seq analyses) but can add specific software and additional workflows from the Galaxy ToolShed as requested.

This has been configured to allow large-scale analyses as follows:

  • support for multi-threading jobs
  • a large selection of centrally-maintained reference databases available
  • use of Pulsar to push jobs requiring large compute resources (RAM, number of cores) out to our nodes on the HPC cluster
  • mounting of additional scratch working space

If you are interested in developing and running your own Galaxy workflows, or want a private instance for a specific project, please get in touch.  We can offer a ‘vanilla’ Galaxy VM suitable for development of specific project workflows, particularly if they require other systems such as Docker.

other servers

Other servers are currently in use as standalone interactive work nodes for guaranteed fast turnaround processing of specific jobs, and for profiling large jobs (cores, RAM, runtime),  before resources are allocated for repeated runs on HPC resources. 

Shared databases

Many types of analysis require access to reference databases and curated datasets which frequently are large, requiring considerable download times and disk utilisation, and typically require indexing using a method appropriate for the desired analysis method. The indexing process itself will also frequently take over 12 hours for a mammalian genome. Frequently, the whole process can take 2-3 days, and is generally carried out by every researcher needing to carry out such analysis requiring reference datasets.

To help reduce the overhead associated with obtaining and preparing reference datasets, we are establishing a centralised collection of reference datasets which are available to researchers across the college. These are accessible on UK Med-Bio, HPC and BDSG systems, and remotely via NFS and HTTP.

Databases are automatically updated weekly where appropriate.

The following datasets are currently available:

  •     Illumina iGenomes (reference database indexed for BWA, bowtie and bowtie2 searches; fasta format chromosome files; annotations as e.g. GTF files). A full list of the datasets available through iGenomes is available at http://support.illumina.com/sequencing/sequencing_software/igenome.html. Of these, we currently hold:
    • Human GRCh37 (Ensembl annotation)
    • Human GRCh38 (NCBI annotation)
    • Mouse GRCm38 (Ensembl annotation)
    • C. elegans WS220 (Ensembl annotation)
    • E. coli K12 DH10B (Ensembl annotation)
  • Human genome resources: Additional reference datasets for related to the Human genome
    • 1000 genomes variant calls
    • GATK resource bundle (including dbSNP, 1000 genomes, hapmap, gold standard indels preformatted for use with the Genome Analysis ToolKit (GATK)
  • Kegg pathway databases - available to registered users only
  • Blast databases:
    • NCBI NR
    • NCBI NT
    • RefSeq Genomic
    • Refseq RNA
    • Refseq protein
    • Uniprot Swissprot
    • Uniprot Trembl

The current holdings are a minimal set we believe to be frequently used at present. If you require access to a reference datasets not listed here, please contact medbio-help@imperial.ac.uk and let us know of your requirements.