Genome sequence assembly primer
Our genetic heritage, as well as that of all living organisms, is
encoded in a set of DNA molecules called
chromosomes. Each such molecule can be represented as a string of
just four letters: A, C, T, G; representing the four basic molecules
(called nucleotides or bases) that compose the
DNA. In most organisms DNA is structured as two molecules
twisted around each other into a double-helix and held together
by bonds between complementary nucleotides A-T and C-G. The
two molecules (also called strands)
have a natural orientation defined by their chemical structure and are
twisted together in opposite orientations. Both strands, thus,
contain the same information and the sequence of one strand can be
obtained from the sequence of the other strand by reverse complementation, namely by
reversing it's sequence and then replacing each nucleotide with its
complement (replacing each A with a T, each G with a C and so
on). This characteristic of DNA is essential to life as it allows
the replication of the DNA molecule -- an important element of our
ability to pass genetic information to our offspring.
Table of Contents
Shotgun sequencing and assembly
The process through which scientists decode the DNA sequence of an
organism is called sequencing. In 1975 Frederick Sanger
developed the basic sequencing technology that is still widely used
today. While this technology has been continuously improved over
the past 30 years, we can only decode between 1,000 and 2,000
base-pairs of DNA at a time -- a significant limitation given that even
the simplest viruses contain tens of thousands of base-pairs, bacteria
contain millions, and mammalian genomes contain billions of
base-pairs. To overcome this limitation, scientists have
developed a technique called shotgun sequencing whereby
the DNA sequence of an organism is sheared into a large number
of small fragments (Figure 1), the ends of the fragments are sequenced
(Figure 2), then the resulting sequences are joined together using a
computer program called an assembler (Figure 3).
Figure 1. Original DNA is broken into
a collection of fragments
Figure 2. The ends of each
fragment (drawn in green) are sequenced
Figure 3. The sequence reads are
assembled together based on sequence similarity
Assembly statistics
The assembler relies on the basic assumption that two sequence
reads
(two strings of letters produced by the sequencing machine) that share
a same string of letters originated from the same place in the genome
(Figure 3). Using such overlaps
between the sequences, the assembler can join the sequences together in
a manner similar to solving a jigsaw puzzle. It is important to
note that the shotgun sequencing process is inherently "wasteful" as,
due to the randomness of the shearing process, assembly is only
possible once enough sequences are generated to cover the genome 8 to
10 times. Intuitively, this phenomenon can be understood by
thinking of a sidewalk as it begins to rain. As raindrops fall
randomly across the sidewalk, dry spots persist for quite a while,
corresponding to regions of the genome that are not represented in the
set of shotgun reads. Mathematically, this phenomenon was modeled
by Eric Lander
and Michael
Waterman in 1988. They examined the correlation between the
oversampling of the genome (also called coverage) and the number of
contiguous pieces of DNA (commonly called contigs) that can be
re-constructed by an idealized assembly program. Figure 4 shows a
plot of the Lander-Waterman equation for a genome of 1Mbp (mega base
pairs = 1,000,000 base pairs). Between 8 and 10-fold coverage the
model predicts that most of the genome will be assembled into a small
number of contigs (approx. 5 for a 1Mbp genome).
Figure 4. Lander-Waterman
estimation of number of contigs w.r.t. genome coverage
Assembly challenges: non-random shearing and repeats
Ideally, an assembly program should produce one contig for every
chromosome of the genome being sequenced. In all but the simplest
cases, however, many contigs are produced due to a combination of
factors. Even at 8-10 fold coverage, there is a non-zero
probability that some portion of the genome remains unsequenced.
More importantly, however, the distribution of the
sheared fragments along the genome cannot be modeled as a perfect
Poisson process. Sanger sequencing requires many copies of each
fragment in order for the sequencing chemistry to be possible.
Each shotgun fragment must be cloned, a procedure usually
performed by inserting the fragment into the cell of the Escherichia coli bacterium
(called a vector) and allowing this
bacterium to grow, thereby replicating the fragment as E. coli replicates its own
genome. In most genomes, certain regions are toxic to the E. coli bacterium, leading to the
presence of gaps in the coverage.
The ability of an assembly program to produce a single contig is also
limited by regions of the genome that occur in multiple near-identical
copies throughout the genome (repeats).
The reads originating from different copies of a repeat appear
identical to the assembler and cause assembly errors. A simple
example is shown in Figure 5, where the assembler incorrectly collapses
the two copies of repeat A leading to the creation of two contigs
instead of one (Figure 6).
Figure 5. Two copies of a repeat along
a genome. The reads colored in red and those colored in yellow
appear identical to the assembly program.
Figure 6. Genome mis-assembled
due to a repeat. The assembly program incorrectly combined the
reads from the two copies of the repeat leading to the creation of two
separate contigs
Scaffolding
The contigs produced by an assembly program can be ordered and
oriented
along a chromosome using additional information contained in the
shotgun data. In most sequencing projects, the sizes of the
fragments generated through the shotgun process are carefully
controlled, thus providing a link between the sequence reads generated
from the ends of a same fragment (called paired ends or mate pairs). In a typical
shotgun project, multiple libraries
-- collections of fragments of similar sizes -- are usually
generated, providing the assembler with additional constraints: within
the assembly the paired end reads must be placed at a distance
consistent with the size of the library from which they originate and
must be oriented towards each other. Within an assembly each read
is assigned an orientation corresponding to the DNA strand from which
the read was generated. The constraints provided by mate pairs
lead to constraints on the relative order and orientation of the
contigs (Figure 7). The process through which the read pairing
information is used to order and orient the contigs along a chromosome
is called scaffolding.
Figure 7. A scaffold of 3
contigs (the thick arrows) held together by mate pairs. Thin
lines connect the paired ends.
Finishing
The ultimate goal of any sequencing project is to determine every
single base-pair of the original set of chromosomes. As we
described above, rarely is an assembly program able to reconstruct a
single piece of DNA per chromosome, leading to gaps in the
reconstruction of the genome. These gaps are filled in through
directed sequencing experiments in a process called finishing or gap closure. At this stage in
the sequencing project, additional laboratory experiments and extensive
manual curation are performed to validate the correctness of the final
assembly, leading to a high-quality reconstruction of the original
genome.
Assembly algorithms
The many assembly programs available to researchers differ in the
details of their implementation and of the algorithms employed, however
they all primarily fall into four general categories:
Greedy assemblers - The first assembly programs followed a
simple but effective strategy in which the assembler greedily joins
together the reads that are most similar to each other. An
example is shown in Figure 8, where the assembler joins, in
order, reads 1 and 2 (overlap = 200 bp), then reads 3 and 4
(overlap = 150 bp), then reads 2 and 3 (overlap = 50 bp) thereby
creating a single contig from the four reads provided in the
input. One disadvantage of the simple greedy approach is that
because local information is considered at each step, the assembler can
be easily confused by complex repeats, leading to mis-assemblies.
Figure 8. Greedy assembly of
four reads.
Overlap-layout-consensus - The
relationships between the reads provided to an assembler can be
represented as a graph, where the nodes represent each of the reads and
an edge connects two nodes if the corresponding reads overlap.
The assembly problem thus becomes the problem of identifying a path
through the graph that contains all the nodes - a Hamiltonian path (Figure
9). This formulation allows researchers to use techniques
developed in the field of graph theory in order to solve
the assembly problem. An assembler following this paradigm starts
with an overlap stage during
which all overlaps between the reads are computed and the graph
structure is computed. In a layout
stage, the graph is simplified by removing redundant information.
Graph algorithms are then used to determine a layout (relative
placement) of the reads along the genome. In a final consensus stage, the assembler
builds an alignment of all the reads covering the genome and infers, as
a consensus of the aligned reads, the original sequence of the genome
being assembled.
Figure 9. Overlap graph for a
bacterial genome. The thick edges in the picture on the left (a
Hamiltonian cycle) correspond to the correct layout of the reads along
the genome (figure on the right). The remaining edges represent
false overlaps induced by repeats (exemplified by the red lines in the
figure on the right)
Eulerian path -
Eulerian path approaches are based on early attempts to sequence
genomes through a technique called sequencing
by hybridization. In this technique, instead of generating
a set of reads, scientists identified all strings of length k (k-mers)
contained in the original genome. While this experimental
method did not produce a viable alternative to Sanger sequencing, it
led to the development of an elegant approach to sequence
assembly. This approach, also based on a graph-theoretic model,
breaks up each read into a collection of overlapping k-mers. Each
k-mer is represented in a graph as an edge connecting two nodes
corresponding to its k-1 bp prefix and suffix respectively. It is
easy to see that, in the graph containing the information obtain from
all the reads, a solution to the assembly problem corresponds to a path
in the graph that uses all the edges - an Eulerian path. One advantage
of the Eulerian approach is that repeats are immediately recognizable
while in an overlap graph they are more difficult to identify. (figures
to be added)
Align-layout-consensus -
As more and more genomes become available in public databases, it is
increasingly the case that a completed genome exists that is closely
related to the genome being assembled. The assembly problem thus
becomes easier as the relative placement of reads can be inferred from
their alignment to the related genome (or reference), in a process called comparative assembly. Thus,
the overlap stage of assembly (often one of the most computationally
intensive assembly tasks) is replaced by an alignment step. The
layout stage is also greatly simplified due to the additional
constraints provided by the alignment to the reference.
BAC-by-BAC (hierarchical) sequencing
- In order to avoid some of the complexity involved in assembling large
genomes, scientists developed a hierarchical approach. First, the
genome is broken up into a collection of large fragments (between 40
and 200 kbp) called Bacterial Artificial Chromosomes
or BACs. The BACs
location along the genome is then mapped using specialized laboratory
experiments. A minimal tiling
path of BACs is chosen such that each base in the genome is
covered by at least one BAC, and the overlap between BACs is
minimized. Each BAC is then sequenced through the standard
shotgun method, the resulting assemblies being combined into an
assembly for each chromosome using the information provided by the
tiling paths (Figure 10).
Figure 10. BAC-by-BAC approach.
The long lines represent individual BACs. The minimal tiling path
is represented by thick lines. Each BAC in the tiling path is
then sequenced through the shotgun method.
Representing assemblies
A description of how assemblies are represented in some of the most
common assembly format is presented here.
Notable assembly programs
TIGR Assembler -
Assembly program developed at the
Institute for Genomic Research (TIGR). This assembler was
used to generate the first sequence of a free living organism Haemophilus influenzae,
accomplishment reported in the journal Science in 1995.
phrap
- Assembly program developed at the University of Washington.
Despite its age, phrap is one of the most widely used assembly
programs, used throughout the years in the assembly of many bacterial
and eukaryotic genomes. Most notably, phrap was the main
workhorse in the public effort to sequence the human genome.
Celera Assembler - Assembly
program developed at Celera Genomics.
Celera Assembler demonstrated the applicability of the shotgun method
to the assembly of a whole eukaryotic genome by successfully assembling
the genome of the fruit fly Drosophila melanogaster.
Until this achievement, the assembly of large genomes was done using
BAC-by-BAC sequencing. Celera Assembler was a key element in the
successful assembly of the human genome by Celera Genomics and is
currently used in numerous bacterial and eukaryotic projects.
Arachne
- Program developed at the Broad
Institute of MIT, widely used in
genome projects both at the Broad Institute and other research
organizations. Arachne and Celera Assembler are arguably the best
assemblers available to the scientific community for the assembly of
large eukaryotic genomes.
Phusion - The main workhorse
for assembly at the Sanger Center,
one of the leading genomic centers in the world. Phusion relies
on phrap for low-level assembly and uses sophisticated algorithms to
reduce the complexity of the data provided to phrap and also to correct
mistakes in the resulting assemblies.
Atlas - Assembly program
developed at the Baylor College of
Medicine. Atlas is specifically designed to optimize the
assembly of BAC-by-BAC projects and uses a hybrid approach combining
some of the advantages of BAC-by-BAC sequencing and whole-genome
shotgun. Like Phusion, Atlas uses phrap as a low-level assembly
tool.
|
|