Assembly of data produced by novel sequencing technologies

CBCB faculty: Mihai Pop

For the past thirty years, the main method for sequencing DNA has been a technique called Sanger sequencing, named after it's developer Frederick Sanger. Despite continued advances -- modern automated sequencing machines can sequence up to 2000 bp at a time -- this sequencing method has two major limitations. First, the length of the DNA that can be reliably sequenced is limited due to limitations in the power of discrimination between fragment sizes during electrophoresis. Second, the DNA being sequenced must be present at a high enough concentration before sequencing can proceed. Therefore, Sanger sequencing is not applicable to the sequencing of small amount of DNA, such as the DNA contained in a single eukaryotic cell. Furthermore, the techniques used to amplify the DNA to the concentrations necessary for sequencing can induce mutations, thereby reducing the accuracy of the sequencing process.

Scientists have been developing new sequencing technologies that have the potential to overcome these limitations of Sanger sequencing. Two main areas of research show greatest promise:

  • Nanopore technologies - the DNA is "dragged" through a pore, one base at a time, allowing a detection mechanism to identify each of the bases as it passes through the pore.
  • Pyrosequencing - pyrosequencing harnesses the same mechanism organisms use to replicate their DNA. Starting with a single strand of DNA, an enzyme (called polymerase), builds, base by base, a complementary strand in order to create the well known double-helix structure. In pyrosequencing, the polymerase reaction is modified to emit light as each base gets incorporated. By controlling the addition of bases to the reaction, scientists can decode the DNA from the pattern of the light emitted.

The data produced by such sequencing technologies pose challenges to assembly software. The patterns of sequencing errors are different from Sanger sequencing, for example, pyrosequencing-based methods are known to mis-estimate the number of bases in homopolymers - stretches of DNA composed of the repetition of a particular base. Such issues require the development of new software for assigning qualities to the sequence data and for error correction. In addition, the sequence reads produced by novel sequencing methods are generally short. So far, the longest reads produced by the technology developed by 454 Life Sciences, are on the order of approximately 150 bp. Current assembly algorithms rely on the longer reads produced by Sanger sequencing, and also on the information contained in mate-pairs, information currently not produced by other sequencing technologies.

Our group is currently involved in developing assembly algorithms that take into account the specific characteristics of the sequences produced by 454 Life Sciences. This project is a collaboration with Robert Boissy from the Center for Genomic Sciences.