Genome assembly primer


Our genetic heritage, as well as that of all living organisms, is encoded in a set of DNA molecules called chromosomes.  Each such molecule can be represented as a string of just four letters: A, C, T, G; representing the four basic molecules (called nucleotides or bases) that compose the DNA.   In most organisms DNA is structured as two molecules twisted around each other into a double-helix and held together by bonds between complementary nucleotides A-T and C-G.   The two molecules (also called strands) have a natural orientation defined by their chemical structure and are twisted together in opposite orientations.  Both strands, thus, contain the same information and the sequence of one strand can be obtained from the sequence of the other strand by reverse complementation, namely by reversing it's sequence and then replacing each nucleotide with its complement (replacing each A with a T, each G with a C and so on).  This characteristic of DNA is essential to life as it allows the replication of the DNA molecule -- an important element of our ability to pass genetic information to our offspring. 


Table of Contents



Shotgun sequencing and assembly


The process through which scientists decode the DNA sequence of an organism is called sequencing.  In 1975 Frederick Sanger developed the basic sequencing technology that is still widely used today.  While this technology has been continuously improved over the past 30 years, we can only decode between 1,000 and 2,000 base-pairs of DNA at a time -- a significant limitation given that even the simplest viruses contain tens of thousands of base-pairs, bacteria contain millions, and mammalian genomes contain billions of base-pairs.  To overcome this limitation, scientists have developed a technique called shotgun sequencing whereby the  DNA sequence of an organism is sheared into a large number of small fragments (Figure 1), the ends of the fragments are sequenced (Figure 2), then the resulting sequences are joined together using a computer program called an assembler (Figure 3).

shotgun fragments
Figure 1. Original DNA is broken into a collection of fragments

sequenced shotgun fragments
Figure 2.  The ends of each fragment (drawn in green) are sequenced

assembled shotgun reads
Figure 3.  The sequence reads are assembled together based on sequence similarity


Assembly statistics


The assembler relies on the basic assumption that two sequence reads (two strings of letters produced by the sequencing machine) that share a same string of letters originated from the same place in the genome (Figure 3).  Using such overlaps between the sequences, the assembler can join the sequences together in a manner similar to solving a jigsaw puzzle.  It is important to note that the shotgun sequencing process is inherently "wasteful" as, due to the randomness of the shearing process, assembly is only possible once enough sequences are generated to cover the genome 8 to 10 times.  Intuitively, this phenomenon can be understood by thinking of a sidewalk as it begins to rain.  As raindrops fall randomly across the sidewalk, dry spots persist for quite a while, corresponding to regions of the genome that are not represented in the set of shotgun reads.  Mathematically, this phenomenon was modeled by Eric Lander and Michael Waterman in 1988.  They examined the correlation between the oversampling of the genome (also called coverage) and the number of contiguous pieces of DNA (commonly called contigs) that can be re-constructed by an idealized assembly program.  Figure 4 shows a plot of the Lander-Waterman equation for a genome of 1Mbp (mega base pairs = 1,000,000 base pairs).  Between 8 and 10-fold coverage the model predicts that most of the genome will be assembled into a small number of contigs (approx. 5 for a 1Mbp genome). 

lander waterman plot
Figure 4.  Lander-Waterman estimation of number of contigs w.r.t. genome coverage


Assembly challenges: non-random shearing and repeats


Ideally, an assembly program should produce one contig for every chromosome of the genome being sequenced.  In all but the simplest cases, however, many contigs are produced due to a combination of factors.  Even at 8-10 fold coverage, there is a non-zero probability that some portion of the genome remains unsequenced.  More importantly, however, the distribution of the sheared fragments along the genome cannot be modeled as a perfect Poisson process.  Sanger sequencing requires many copies of each fragment in order for the sequencing chemistry to be possible.  Each shotgun fragment must be cloned, a procedure usually performed by inserting the fragment into the cell of the Escherichia coli bacterium (called a vector) and allowing this bacterium to grow, thereby replicating the fragment as E. coli replicates its own genome.  In most genomes, certain regions are toxic to the E. coli bacterium, leading to the presence of gaps in the coverage.

The ability of an assembly program to produce a single contig is also limited by regions of the genome that occur in multiple near-identical copies throughout the genome (repeats).  The reads originating from different copies of a repeat appear identical to the assembler and cause assembly errors.  A simple example is shown in Figure 5, where the assembler incorrectly collapses the two copies of repeat A leading to the creation of two contigs instead of one (Figure 6).

repeats

Figure 5. Two copies of a repeat along a genome.  The reads colored in red and those colored in yellow appear identical to the assembly program.

collapsed repeat

Figure 6.  Genome mis-assembled due to a repeat.  The assembly program incorrectly combined the reads from the two copies of the repeat leading to the creation of two separate contigs


Scaffolding


The contigs produced by an assembly program can be ordered and oriented along a chromosome using additional information contained in the shotgun data.  In most sequencing projects, the sizes of the fragments generated through the shotgun process are carefully controlled, thus providing a link between the sequence reads generated from the ends of a same fragment (called paired ends or mate pairs).  In a typical shotgun project, multiple libraries -- collections of fragments of similar sizes -- are usually generated, providing the assembler with additional constraints: within the assembly the paired end reads must be placed at a distance consistent with the size of the library from which they originate and must be oriented towards each other.  Within an assembly each read is assigned an orientation corresponding to the DNA strand from which the read was generated.  The constraints provided by mate pairs lead to constraints on the relative order and orientation of the contigs (Figure 7).  The process through which the read pairing information is used to order and orient the contigs along a chromosome is called scaffolding


scaffold

Figure 7.  A scaffold of 3 contigs (the thick arrows) held together by mate pairs.  Thin lines connect the paired ends.


Finishing


The ultimate goal of any sequencing project is to determine every single base-pair of the original set of chromosomes.  As we described above, rarely is an assembly program able to reconstruct a single piece of DNA per chromosome, leading to gaps in the reconstruction of the genome.  These gaps are filled in through directed sequencing experiments in a process called finishing or gap closure.  At this stage in the sequencing project, additional laboratory experiments and extensive manual curation are performed to validate the correctness of the final assembly, leading to a high-quality reconstruction of the original genome.


Assembly algorithms


The many assembly programs available to researchers differ in the details of their implementation and of the algorithms employed, however they all primarily fall into four general categories:

Greedy assemblers
- The first assembly programs followed a simple but effective strategy in which the assembler greedily joins together the reads that are most similar to each other.  An example is shown in Figure 8, where the assembler joins, in order,  reads 1 and 2 (overlap = 200 bp), then reads 3 and 4 (overlap  = 150 bp), then reads 2 and 3 (overlap = 50 bp) thereby creating a single contig from the four reads provided in the input.  One disadvantage of the simple greedy approach is that because local information is considered at each step, the assembler can be easily confused by complex repeats, leading to mis-assemblies.

greedy assembly

Figure 8.  Greedy assembly of four reads.

Overlap-layout-consensus - The relationships between the reads provided to an assembler can be represented as a graph, where the nodes represent each of the reads and an edge connects two nodes if the corresponding reads overlap.  The assembly problem thus becomes the problem of identifying a path through the graph that contains all the nodes - a Hamiltonian path (Figure 9).  This formulation allows researchers to use techniques developed in the field of graph theory in order to solve the assembly problem.  An assembler following this paradigm starts with an overlap stage during which all overlaps between the reads are computed and the graph structure is computed.  In a layout stage, the graph is simplified by removing redundant information.  Graph algorithms are then used to determine a layout (relative placement) of the reads along the genome.  In a final consensus stage, the assembler builds an alignment of all the reads covering the genome and infers, as a consensus of the aligned reads, the original sequence of the genome being assembled.

overlap layout consensus assembly
Figure 9.  Overlap graph for a bacterial genome.  The thick edges in the picture on the left (a Hamiltonian cycle) correspond to the correct layout of the reads along the genome (figure on the right).  The remaining edges represent false overlaps induced by repeats (exemplified by the red lines in the figure on the right)

Eulerian path  -  Eulerian path approaches are based on early attempts to sequence genomes through a technique called sequencing by hybridization.  In this technique, instead of generating a set of reads, scientists identified all strings of length k (k-mers) contained in the original genome.   While this experimental method did not produce a viable alternative to Sanger sequencing, it led to the development of an elegant approach to sequence assembly.  This approach, also based on a graph-theoretic model, breaks up each read into a collection of overlapping k-mers.  Each k-mer is represented in a graph as an edge connecting two nodes corresponding to its k-1 bp prefix and suffix respectively.  It is easy to see that, in the graph containing the information obtain from all the reads, a solution to the assembly problem corresponds to a path in the graph that uses all the edges - an Eulerian path.  One advantage of the Eulerian approach is that repeats are immediately recognizable while in an overlap graph they are more difficult to identify. (figures to be added)

Align-layout-consensus -  As more and more genomes become available in public databases, it is increasingly the case that a completed genome exists that is closely related to the genome being assembled.  The assembly problem thus becomes easier as the relative placement of reads can be inferred from their alignment to the related genome (or reference), in a process called comparative assembly.  Thus, the overlap stage of assembly (often one of the most computationally intensive assembly tasks) is replaced by an alignment step.  The layout stage is also greatly simplified due to the additional constraints provided by the alignment to the reference.

BAC-by-BAC (hierarchical) sequencing - In order to avoid some of the complexity involved in assembling large genomes, scientists developed a hierarchical approach.  First, the genome is broken up into a collection of large fragments (between 40 and 200 kbp) called Bacterial Artificial Chromosomes or BACs.  The BACs location along the genome is then mapped using specialized laboratory experiments.  A minimal tiling path of BACs is chosen such that each base in the genome is covered by at least one BAC, and the overlap between BACs is minimized.  Each BAC is then sequenced through the standard shotgun method, the resulting assemblies being combined into an assembly for each chromosome using the information provided by the tiling paths (Figure 10).


bac by bac assembly

Figure 10. BAC-by-BAC approach.  The long lines represent individual BACs.  The minimal tiling path is represented by thick lines.  Each BAC in the tiling path is then sequenced through the shotgun method.

Notable Assembly Programs

  • Allpaths-LG - Probably the best assembler for large genomes. Requires specific 'recipe' comprising multiple library sizes
  • Abyss - Popular assembler for large genomes. Versions for transcriptome analysis also available
  • Velvet - One of the first widely-used implementations of the deBruijn graph approach
  • Spades - Among the best assemblers for bacterial data, particularly for uneven coverage datasets
  • Masurca - Assembler developed at UMD extending upon Celera Assembler. Among the best performing assemblers
  • metAmos - Assembly and analysis package for metagenomic data, integrates most other assemblers
  • iMetAmos - MetAmos workflow for isolate genome assembly. Automatically selects best assembler and assembly parameters
  • Valet - validation pipeline for genome and metagenome assembly

Key Publications