Variation with genomic sequences is exhibited as a phenotype in an individual. This can be involved in causing the differences observed between members of a species, but is also associated with many diseases. Some of these are fairly readily identifiable and follow distinct lines of heredity, whereas other follow much weaker relationships. There are four types of variation typically considered:
- SNVs - Single nucleotide variants (also known as SNPs - single nucleotide polymorphisms)
- INDELs - Insertions and Deletions
- SVs - Structural Variations i.e. large scale reorgansiations
- CNVs - Copy Number Variations
Of these, SNVs are most commonly researched, and consequently there are more tools available for identification of these than other variants. Many tools are capable of identifying more than one type of variant, but the different approaches required between identification of SNVs or short INDELs and larger scale rearrangements mean that the methods for SNVs and short INDELs are not appropriate for identifying larger scale rearrangements.
Variant Identification Software
Many online tutorials for NGS variant identification will use samtools, which is a set of utilities for interacting with SAM/BAM format alignments in conjunction with bcftools. These offer a straightforward method for identifying both SNVs and INDELS. Variants will typically require a considerable degree of filtering by based on quality/depth to reduce the numb of false-positive calls, which are typically considerably higher for INDELs than SNVs. Samtools/bcftools work on both haploid and diploid organisms, although the allele frequencies reported are based on diploids.
The Genome Analysis Toolkit (GATK) was developed by the Broad Institute for the 1000 Genomes project and provides a comprehensive suite of tools for variant analysis. It includes sophisticated tools for filtering variant calls, which result in a low false positive rate and correspondingly high positive predictive value. Completing an analysis using GATK can be a daunting prospect, however. A Best Practices Guide is available which describes the various measures to take to ensure the best variant calls are obtained. These include additional preprocessing stages such as marking duplicate reads, realignment of reads around INDELs (requiring some prior knowledge of INDELs in the organism in question) and recalibration of base score qualities based on empirical measurement. Two variant calling algorithms are available, the more conventional UnifiedGenotyper, which is purely alignment based and targeted at diploid organisms, or the HaplotypeCaller which carries out de-novo assembly around variant loci and is capable working with non-diploid species. Following variant calling, a machine-learning based system can be used to attempt to reduce false positive calls. Very good results can be obtained using GATK, although it does take considerable effort to run, and does require existing genetic resources. A bundle of appropriately formatted databases i.e. dbSNP, HAPMAP are provided for Human variant analysis, and are available on BSS resources.
Most variant callers just make use of read-alignment to the reference genome, however this makes the variant identification susceptible to errors during read mapping around divergent sequence regions of INDELs. Platypus combines the use of read alignment with local de-novo assembly around variant loci, which can copy better around divergent regions since it is not dependent upon alignment to the reference. It can also carry out population-level calling, where the occurrence of variants in closely related samples can be used to help inform the identification in regions of low-coverage, for example. Overall variant calls with platypus show high sensitivity and specificity, comparable with those obtained with GATK, although without requiring the degree of preprocessing and reference datasets, and is run in a single stage. As with samtools, platypus can be used with haploid organisms although it reports diploid allele frequencies.
Isaac is a combined read-aligner and variant caller from Illumina which is optimised to make full use of modern compute resources to enable considerably speed improvements in both alignment and variant calling when compared to BWA and GATK. The alignment outputs a sorted, duplicate-marked bam file whereas conventional methods require the large alignments to be read and written multiple times. Variant calling is through a bayesian method computing probabilities over diploid genotypes. The cost of the high-speed alignment is through the algorithms memory usage, were 48 Gb RAM is required to hold the indexed human genome in memory whilst carrying out the alignment.
Breakdancer is capable of detecting insertions, deletions, inversions and translocations, both inter-chromosomal and intra-chromosomal. It requires aligned reads in a bam file which have been tagged with read-groups, and uses two separate algorithms for detection of short INDELs and larger scale reorgansiations. Breakdancer can be used to identifty structural variations in tumour/normal pairs from cancer samples and segregating variations in populations.
Pindel can detect a range of structural variations including deletions, insertions, inversions and tandem duplications. It requires paired reads mapped against a reference genome in a bam file. It first identifies read-pairs potentially associated with rearrangements where they contain indels, or only one read of the pair is mapped, where the unmapped read may therefore span a breakpoint. From a known starting point of the mapped read, it searches for the location where the unmapped read should go, splitting it across the breakpoint.