Genome Assembly (the
next generation!)
Deciphering the genome of an organism is
a computationally challenging task akin to solving a very large
1-dimensional puzzle. This is because we currently do not know of a way
to somehow "read" the letters of the genome from start to finish.
Various sequencing technologies, however, do exist that can read short
stretches of DNA (30-1000 bases) modulo some experimental error. A
common approach then is to shred DNA into small pieces, read the
sequence of these pieces and then use software to put them together
into longer sequences (also called Whole Genome Shotgun sequencing).
Recently, several new sequencing technologies (
454,
Illumina,
SOLiD) have
been introduced that sequence fast and cheap (by several orders of
magnitude). The reads however can be very small (~30 bases for some)
and often the genomes reconstructed from these sequences can be highly
fragmented (thousands of pieces). A promising solution to this problem
is the use of ordered
restriction
maps (such as
optical
maps and
nanocode
maps) to order the sequence fragments in a genomewide map. In
recent work, we designed a robust system for "scaffolding" genomic
sequences onto such maps that can handle sequencing errors and detect
misassemblies (
SOMA
- Scaffolding using Optical Map Alignment). The SOMA package (
Nagarajan
et al., 2008) is freely available and has been used to
scaffold nearly a dozen bacterial genomes (see also:
Yersinia genomes).