Genome Assembly with Short Reads
The recent availability of
high-throughput sequencing technologies has "democratized" genome
sequencing by providing individual labs with a sequencing capacity
similar to what was previously only available at large genome centers.
Several companies have announced the availability of genome
sequencers capable of sequencing up to about 2Gbp of DNA in a single
run for costs as low as $1000. Machines produced by 454 Life Sciences/Roche, and Solexa/Illumina, are already being actively used in many labs, and competing technologies from Applied Biosystems and Helicos have recently become available.
These
new technologies have several characteristics that complicate the
analysis of the resulting data using software tools originally
developed for the "traditional" Sanger sequencing technology. In
particular, the sequence reads are much shorter than those produced
through Sanger sequencing. The 454 technology currently generates reads
of approximately 250bp (compared to about 1000bp commonly achieved
through the Sanger method), while the other technologies generate reads
of just 30-40bp in length. Furthermore, the new sequencing
machines generate large amounts of data, up to several terabytes per
run, requiring the development of highly-efficient software for
analyzing the resulting sequences.
Researchers at the CBCB are
actively involved in the development of new software tools and
algorithms for the analysis of the data generated by the new
technologies. Several software tools are already available for
this purpose, and we have already been applying this software to
several sequencing projects. The relevant software as well as
instructions on how to use it are described below.
All our software is freely released, without restrictions, under the open-source Artistic License.
Comparative assembly with short reads Comparative
assembly refers to the assembly of a genome using the sequence of a
close relative as a reference, and is frequently refered to as
"templated assembly" or "resequencing". Our software, AMOScmp,
was originally developed in the context of Sanger data however with
small modifications is directly applicable to short read sequencing
data.
AMOScmp - The main page describing the AMOScmp package, also containing instructions on how to obtain and install the software.
Additional scripts useful in handling short read data :
* AMOScmp-shortRead
* AMOScmp-shortReads-alignmentTrimmed
Short tutorial on using AMOScmp with short read data
MUMmmerGPU
- Extension of our genome aligner MUMmer allowing it to take advantage
of the specialized hardware available in many graphics cards to
dramatically speed up the alignment process. MUMmerGPU is
specifically targeted at the large volumes of data generated by new
generation sequencing machines. Short
read mapper - Efficient read mapper for short read data based on
bit-wise operations on compressed reads. Initial tests indicate
our tool is several times faster than other aligners. (...coming
soon)
De novo assembly with short reads Minimus
- Lightweight assembler originally developed for the assembly of small
sets of reads. In conjunction with an efficient overlapper,
Minimus can be applied to large short-read datasets. Minimus
avoids mis-assembling repeats (a common challenge when analyzing
short-read data) by using a highly conservative assembly algorithm.
Short
read overlapper - Efficient read overlapper for short read data based
on bit-wise operations. Currently the only general purpose
overlapper available for this taks (...coming soon).
Selected Publications
- Mihai Pop and Steven L. Salzberg. Bioinformatics challenges of new sequencing technology. Trends in Genetics. 24(3):142-149. 2008
- Schatz, M.C., Trapnell, C., Delcher, A.L., Varshney, A. High-throughput sequence alignment using Graphics Processing Units. BMC Bioinformatics 8:474, 2008.
- Mihai Pop, Adam M. Phillippy, Arthur L. Delcher and Steven L. Salzberg. Comparative
genome
assembly. Briefings in Bioinformatics. 5(3), pp. 237-248, 2004.
|
|