Read alignment is one of the most common processes applied to high-throughput sequence data, being one of the first stages required for many different types of analysis. The number of available read alignment algorithms continues to grow, with over 70 separate packages now available, which makes selecting the best one something of a minefield. The majority of these packages have been made available since 2008, evolving as sequencing technology and algorithmic methods have matured.
The goal of read alignment is to map comparatively short sequencing reads effieciently to a large reference genome to identify the 'correct' genomic loci from which the read originated whilst taking account of errors in the sequence reads whilst aligning correctly around variations between the reference genome and the sequenced sample. Different types of analysis add additional complications, such as RNA-Seq analysis, where reads may be split across long introns, or bisulphite sequencing for methylation analysis, which can benefit from using aligners targeted at the analysis in question.
General Purpose Aligners
The Burrows Wheeler Aligner (BWA) uses an FM-index of the reference genome, allowing the presence of a query sequence within the reference sequence to be determined in linear time with respect to the query length. Both single and paired read libraries are supported through three different algorithms targeted at differing read lengths. Short reads (<100bp) should use the bwa-bt algorithm to identify the optimal mapping of each read of a library being aligned separately against the reference genome, followed by the generation of alignments in either single-ended or paired-read libraries, which also requires the association between the paired-reads to be established. Longer reads are supported by the bwa-sw algorithm which carries out an initial seed alignment using the FM-index, followed by Smith-Waterman alignment extension. Finally the most recent bwa-mem algorithm also supports longer reads (up to 1Mb in length), but produced local alginments and offers speed and accuracy improvements over bwa-sw.
Similar to BWA, Bowtie (and Bowtie2) use an FM-index to enable fast identification of reads matching the reference sequence in a small memory footprint. Bowtie2 is the most recent version of the algorithm which offers support of reads up to 1000s of bases long, whereas the earlier Bowtie 1 releases is optimal for reads up to 50 bases in length. Bowtie2 also supports gapped alignments, local alignment of reads and outputs a range of mapping qualities when producing SAM format output, wheras Bowtie1 would produce SAM output incompatible with some downstream tools. Bowtie2 should be used in preference to Bowtie apart from where reads shorter than 50bp are being mapped.
RNA-Seq Specific Aligners
Tophat is a wrapper around bowtie and bowtie2 whch provide enhancements appropriate for use in RNA-Seq experiments, where reads are derivied from mRNA sequences and consequently may span introns. Tophat does not require prior knowledge of transcript structures, but can find splice junctions by firstly mapping the reads against the reference genome to identify potential exon locations and uses this to identify potential splice junctions before remapping reads directly to these to confirm their locations.
Star is an aligner specifically intended for use with RNA-Seq reads, supporting spliced read alignments of reads up to full-transcript in length. Star carries out it's alignments in two phases, firstly identifying seed mappings through a suffix-array algorithm, which are then used to generate full alignments by joining the seed alignments. If genomic annotations are available, these can be used to increase the sentisitivity of splice site detection. Star performs considerably faster than tophat, although at the expense of it's memory usage, requiring ~32Gb to map against the human genome.
General purpose aligners provide excellent results for SNV/short INDEL identification, and for this purpose they report the best mapping for each read, although for carrying out analysis of structural variation it is necessary for all mappings for a read to be returned. While such aligners can report alignments in this manner, there performance suffers badly as a result. mrsFAST-ultra is developed particularly for this usage, and includes optimisations to distinguish between sequence variants and sequencing errors by referring to a SNP database to allow for known variations to be accounted for during the alignment process, hence allowing more variant loci within a read and increasing the proportion of reads which can be mapped. It reportedly performs ~6x faster than bowtie2 when reporting multi-mapping reads in addition to being more sensitive.
Methylation of 5-methyl cytosine has been established to play an important role in process such as gene regulation and genomic imprinting, and is a heritable characteristic. Pre-treatment of DNA with bisulphite prior to sequencing converts non-methylated cytosine residues to uracil (which are converted to thymine by subsequent PCT), but does not affect 5-methylcytosine, allowing methylated cytosines to be distinguished from non-methylated cytosines through sequencing. Bismark uses bowtie for alignment against reference sequences converted to allow alignment against the methylated bases, and allows identification of the origin strand of the reads. Following alignment, it can produce a report on the methylation status of the cytosine bases in the reference.
Long Read Alignment
Conventional NGS read aligners are not efficient (or not capable) of aligning the extremely long reads generated by PacBio's sequencing technology, not sufficiently sensitive to handle the high error rate present in the reads. BLASR is capable of efficiently aligning such reads to reference genomes by carrying out the alignment in a number of stages. Firstly, short exact matches are determined using either a suffix array or Burrows-Wheeler Transform FM index. A sparse dynamic programming algorithm is then used to carry out a rough alignment around these short exact matches, which is then used to guide a detailed alignment, again using dynamic programming.