Sequence QC Software
There are many tools available to help with assessing and manipulating sequence files to ensure they meet desired quality criteria. These will typically allow
- Assessment of a number of quality metrics, frequently with graphical outputs
- Trimming of bases/reads falling below required quality thresholds
- Trimming of residual adapter/linker sequences from sequence reads
- A combination of the above.
The packages discussed below are amongst the best for carrying out these tasks, and are all available on codon.
One of the most versatile tools for sequence quality assessment is FastQC, which can either be run as a command-line tool or through a graphical interface. FastQC allows multiple fastq files to be assessed in the same run making it easy to carry out bulk analysis runs, and provides a separate report for each fastq file. Each metric assessed is assigned a rating of 'pass', 'warning' or 'fail', however these thresholds are a little arbitrary (although they can be modified at runtime) and there can be valid reasons for sequence to be perfectly good but receive a 'warning' or 'fail' assessment from fastqc. Key criteria assessed include
- per-base sequence quality
- per sequence GC content
- Sequence duplication levels
- Adapter content
An alternative to FastQC is PRINSEQ, which assesses many similar metrics, but can also assess some additional factors. Specifically if can provide a more detailed assessment of duplicated sequences, including reads which are identical to the 5' or 3' of a longer sequence (applicable to e.g. 454/IonTorrent data). PRINSEQ can also carry out an assessment of sequence complexity.
Tools for trimming sequence reads may be targeted at quality trimming, removal of residual sequencing adapter sequence or a combination of both quality and adapter trimming.
Sickle is a quality trimmer which uses an adaptive window based upon the length of the read to trim the read from a point in the read where the average score within the window falls below a specified threshold. A minimum read length threshold can also be defined, where reads which fall below this threshold are discarded. Sickle supports paired reads, so that if one read of a pair is discarded, the second read of the pair will be written to a separate file of singleton reads to ensure the correct pairing of the reads in the fastq files is not disrupted.
Cutadapt can remove a range of adapter sequences including 5' or 3' adapters which may be anchored to the ends of reads, supports paired-read trimming and removal of multiple adapters from reads. Removal of low-quality sequence from the end of reads and regions consisting of N's is also supported.
Trimmomatic combines quality and adapter trimming including support of paired-reads, and is targeted at Illumina sequences. Quality trimmings is through a sliding window algorithm, cutting the read where the average score falls below a specified threshold, and can trim both the 5' and 3' ends of the read.
Trim Galore! is a script which combines quality and adapter trimming using cutadapt and quality assessment through FastQC, providing an easy to use, single port-of-call for sequence preprocessing. Note that FastQC is run following trimming, so this shows the results of the quality-trimmed data, whereas we would also recommend running FastQC on the unprocessed reads to ensure there is a good understanding of any quality issues affecting a dataset prior to proceeding with analysis.
There is limited software available to help identify contamination of a sequencing library, however the following tools may be useful.
FastQ Screen allows your reads to be aligned against a series of sequence databases to determine the likely source of the sequences in the library. These sequence databases can include things like organism specific database, E.coli, vector sequences and sequencing controls such as PhiX. It is necessary first create bowtie or bowtie2 indices for each sequence database, then setup a configuration file defining the databases to be searched. A summary and histogram of the composition of the library based upon these alignments is produced.
While not strictly a contamination screen, CookieCutter can be used to separate sequences matching a provided reference fasta file from a dataset. This can be useful when carrying out studies such as pathogen sequencing, where there is contamination of the sequencing library with reads from the host genome.
The GC content of a sequence library can provide evidence of contamination or the presence of sequence from multiple organisms. Creating a GC plot with, for example, FastQC, would be expected to produce a smooth, unimodal distribution. The existance of shoulders, or in more extreme cases a bimodal distribution, could be indicative of the presence of sequence reads from an organism with a different GC content.