de-novo assembly is the process of assembling sequencing reads into contiguous sections of sequence without reference to existing assembled data. While this was a common process with traditional Sanger sequencing data, the move to short-read high throughput sequencing technologies resulted in much greater computational problems and necessitated the development of new assembly algoritihms capable of handling the large numbers of short reads. As the sequencing platforms have matured, read lengths have increased leading to increased contiguity of assemblers, and further changes to algorithms to cope with the continuing technological advancements.
Assembly algorithm types
Two major class of assembly algorithm are in use today which vary in their applicability according to the kind of sequencing data available.
Overlap-layout-consensus (OLC)/string graph assemblers
The overlap-layout-consensus (OLC) class of assemblers initially identify overlaps between pairs of reads, which are used to build a graph of the relationships between the reads. This stage is very computational intensive, with the computational complexity increasing with the number of reads involved in the assembly, making this approach untenable with outputs from Illumina sequencers, for example, where an assembly would involve millions of short reads. Once a graph is constructed, obtaining an assembly requires the path through the graph which visits each node once and once only, termed a Hamiltonian path. Determining such a path is far from straightforward, hence the 'Layout' stage of the process works to simplify the initial graph, condensing regions which unambiguously originated from the same genomic loci (i.e. a sub-graph of nodes with many related connections) up to the point where the path diverges with multiple possible paths (indicating the end of a non-repetitive sequence). These sub-graphs give rise to contig sequences, which may be classed as unitigs (which can be unambiguously assembled) or repeat contigs (which are connected to an unusally high number of other contigs, or have an excessivly high coverage of sequence reads). Unitigs are then associated into scaffolds with other unitigs, generally on the basis of read-pair relationships between unitigs. The final stage of the process ('consensus'), involves walking the contig sub-graphs and deriving the consensus sequence of the reads involved in each. String graph assemblers differ slighly in that they simplify the initital overlap graph through the removal of transitive edges, which contain redundant information.
de-Bruijn graph assemblers
de-Bruijn graph assemblers break reads down into sub-sequences of length k (termed k-mers), and, similarly to OLC assemblers, are used to construct a graph, however in this case the nodes in the graph represent k-mers and the edges represent adjacent k-mers which overlap by k-1 bases. The reads are not directly represented in the graph, but are implied by paths through the graph. Since this structure is based upon exact identity between k-mers, it is subject to sequencing errors causing divergent paths (i.e. bubbles) in the graph which reduce the lengths of the paths that can be extracted. Assemblers typically track the k-mer coverage of each node permitting the graph to be cleaned by removing low-coverage bubbles and 'tips'. De-bruijn graph based assemblers are typically very memory-intensive, although various techniques are being employed to reduce their memory usage. They also have the downside that since the graph is based on a sub-string of the read, then contiguity inherent in the original sequence is being lost during the assemby process. Some assemblers will use the full length of the read to validate paths through the graph.
Choice of assembler
The choice of assembler for a project should be based up a number of factors.
- Sequencing technology: short reads (<100 bp) are really only appropriate for de-bruijn graph assemblers. Illumina MiSeq instruments are now generating reads of sufficient length that use of OLC type assemblers is becoming an option.
- Genome size and complexity: all assemblers are capable of assembling simple prokaryotic genomes (with varying degrees of success), however some are not capable of assembling larger genomes, which may be due to, for example, excessive memory requirements, or difficulities in handling hetorozygous polyploid genomes.
- Source of sequencing data: de-novo assembly initially targeted at genomic sequences, but in recent years methods have been developed for de-novo assembly of cDNA/transcriptome sequences, and also metagenomic analysis. Both these pose their own challenges.
Genomic OLC assemblers
The Celera assembler was one of the first OLC assemblers, however has continued to evolve with the additions of new algorithms targeted at Roche 454 sequences (CABOG) and PacBio long reads (PBcR). It was used in the early assemblies of Drosophila and the Celara's private Human genome initiative. It is ideal for use on assemblies incorporating traditional Sanger sequence reads as well as 'intermediate' length next-generation sequence reads such as 454 is one of the few options currently available for long-molecule assembly.
MIRA is capable of performing assemblies from a wide range of sequence types, including Sanger, 454, IonTorrent, Illumina and PacBio (although is currently restricted to the shorter PacBio CCS reads or long CLR reads which have already been error corrected). It is intended to provide assemblies requiring minimal downstream finishing. MIRA is capable of producing extremely good assemblies, however has an extremely large number of run-time parameters to allow tuning of the assembly process and while these allow improvements to assemblies to be made, they can also have a considerably detrimental effect if not used with caution.
The string graph assembler (SGA) uses a modified approach to conventional OLC assemblers as described in the 'Assembly algorithm types' section above. It makes use of an FM-index to greatly accelerate the initial identification of read overlaps making the OLC approach more tenable for assemblies consisting of large numbers of reads.It has considerably lower memory overheads than a de-bruin graph based assembler. It has been a consistently good performer in the Assemblathon comparisons, and is appropriate for longer high-throughput sequencing reads i.e. > 200bp.
de-Bruijn genome assemblers
Velvet came to prominence as being the first assembler capable of producing assemblies from very short, early next-generation sequence reads (i.e. 25bp), but also gained algorithms to handle longer (i.e. 454) reads to scaffold contig sequences together. It is an extremely popular assembler for certain usages (i.e. prokaryotic assembly pipelines), however has extremely high memory requirements. It has been somewhat superseded in recent years by newer algorithms which improve on various aspects of the assembly process.
The SPAdes assembler was designed with single-cell and prokaryotic sequences in mind, and is designed to cope with uneven coverage in the input data. It incorporates an initial read error correction phase to try to reduce sequencing errors present in the input reads, before building a de-Bruin which utilises multiple sizes of k-mer to overcome the limitations of using a single size of k. In benchmarks it has been demonstrated to produce some of the highest-quality contigs of available assemblers, while producing contigs with similar or better contig N50 values.
The SOAPdenovo assembler is appropriate for larger, more complex genome assemblies, and has been used for number of high-profile assembly projects, including the first mammalian genome assembly from entirely short reads. It is more memory efficient than some other assemblers, requiring ~150Gb RAM to assemble a human genome, and incorporates algorithms to help with error correction.
ABySS is also capable of assembling mammalian-sized genomes from short reads, however has been built around MPI parallelisation to permit large assemblies to be carried out utilising a number of machines simultaneously. It can make use of paired k-mers consisting of two k-mers separated by a fixed distance which is functionally equivalent to a single large k-mer spanning the length of the k-mer pair. Some benchmarks have indicated it produces assemblies with less errors than SOAPdenovo, however this must not be taken as a hard rule.
Reconstruction of transcripts from cDNA sequencing can be a useful tool in assisting with determining the gene complement of an organisms without carrying out a genome assembly, and can also provide additional data to support genomic assemblies such as indication of splice variants. It must be born in mind that only genes being actively expressed during the sampling will be represented in the sequence data, consequently if a representation of the full transcript repertoire is required then many libraries must be sequenced from different tissues and experimental conditions. Transcriptome assembly faces additional challenges to genome assembly, since the latter typically assumes uniform coverage of a genome is represented within a sequencing library, and uses read coverage as an indication of repetitive contigs, however in a transcriptome the coverage will be variable and correlated with the expression of the transcript.
The Trinity RNASEQ package provides a pipeline for de-novo transcript reconstruction from Illumina sequences. It can be run as a purely de-novo system, or be provided with an existing genome assembly which can be used to guide the transcript assembly process. It consists of three phases (hence it's name), the first, named inchworm, assembling unique cDNA transcripts. These are then cluster during the second, chrysalis, phase and constructs de-Bruin graphs for each cluster and distributes the sequence reads across the various graphs. Finally, the butterfly stage walks the de-Bruijn graphs to output the resulting sequence transcripts.
Trans-ABySS is a modification of the ABySS de-Bruijn graph assembler designed to cope with the varied coverage inherent in transcript libraries by utilising assemblies with multiple k-mers, in both single-ended and paired then merging the resulting contig sequences by alignment of the contigs from each assembly in an all-against-all manner to identify reciprocal best hits. It is also capable of producing qualification of expression levels based on alignment to a reference genome.
Assembly of metagenomic sequences, derived from sequencing of environmental samples consisting of a large number of unknown organisms, poses additional difficulties over genome assembly. Firstly, as is the case with transcriptome assembly, the assumption of uniform coverage made by genome assemblers is invalid since the coverage of each genome will be dependent upon the abundance of that organisms within the sample. Secondly, association of similar sequences from different organisms gives a very high chance of producing chimeric sequences.
IDBA-UD is a de-Bruijn graph based assembler designed to avoid the problems associated with variable depth in a metagenome population bu iterating through various values of k, removing short and low coverage contigs no each iteration. Paired-reads are then aligned against the contigs to improve low-coverage regions. An assessment of metagenome assemblers by the BSS has indicated that IDBA-UD produced one of the lowest numbers of chimeric contigs of the assessed tools, while having significantly better contig N50 sizes.
MetaVelvet is an extension of the velvet assembler which separates the initial de-Bruijn graph into separate sub-graphs then build sequences from these in isolation. While producing longer contigs than stand-alone velvet with metagenome data, in our hands it produced considerably shorter contig N50 values than IDBA-UD.