Assembly of data produced by novel sequencing technologies
CBCB faculty: Mihai Pop
For the past thirty years, the main method for sequencing DNA has been
a technique called Sanger sequencing,
named after it's developer Frederick Sanger.
Despite continued advances -- modern automated sequencing machines can
sequence up to 2000 bp at a time -- this sequencing method has two
major limitations. First, the length of the DNA that can be
reliably sequenced is limited due to limitations in the power of
discrimination between fragment sizes during electrophoresis.
Second, the DNA being sequenced must be present at a high enough
concentration before sequencing can proceed. Therefore, Sanger
sequencing is not applicable to the sequencing of small amount of DNA,
such as the DNA contained in a single eukaryotic cell. Furthermore,
the techniques used to amplify the DNA to the concentrations necessary
for sequencing can induce mutations, thereby reducing the accuracy of
the sequencing process.
Scientists have been developing new sequencing technologies that have
the potential to overcome these limitations of Sanger sequencing.
Two main areas of research show greatest promise:
- Nanopore technologies -
the DNA is "dragged" through a pore, one base at a time, allowing
a detection mechanism to identify each of the bases as it passes
through the pore.
- Pyrosequencing -
pyrosequencing harnesses the same mechanism organisms use to replicate
their DNA. Starting with a single strand of DNA, an enzyme
(called polymerase),
builds, base by base, a complementary strand in order to create the
well known double-helix structure. In pyrosequencing, the
polymerase reaction is modified to emit light as each base
gets incorporated. By controlling the addition of bases to the
reaction, scientists can decode the DNA from the pattern of the light
emitted.
The data produced by such sequencing technologies pose challenges to
assembly software. The patterns of sequencing errors are
different from Sanger sequencing, for example,
pyrosequencing-based methods are known to mis-estimate the number of
bases in homopolymers - stretches of DNA composed of the repetition of
a particular base. Such issues require the development of new
software for assigning qualities to the sequence data and for error
correction. In addition, the sequence reads produced by novel
sequencing methods are generally short. So far, the longest reads
produced by the technology developed by 454
Life Sciences, are on the order of approximately 150 bp.
Current assembly algorithms rely on the longer reads produced by Sanger
sequencing, and also on the information contained in mate-pairs,
information currently not produced by other sequencing technologies.
Our group is currently involved in developing assembly algorithms that
take into account the specific characteristics of the sequences
produced by 454 Life Sciences.
This project is a collaboration with Robert Boissy from the Center for
Genomic Sciences.
|