Genome Sequence Assembly
CBCB faculty: Steven Salzberg, Jim Yorke, Art Delcher, Mihai Pop
CBCB students and staff: Adam Phillippy, Mike Schatz, Dan Sommer
The IPST genome assembly group
at the UMD Institute for Physical Science & Technology
1. Current Research
2. Assembly Software
3. Assembly Data
4. AMOS documentation project
Current Research
Despite the fact that the assembly of bacterial genomes
has become a
routine task at major sequencing centers, the assembly problem is far
from being solved. Many new challenges are uncovered as scientists
tackle diverse new organisms. Furthermore new sequencing technologies
will change the assumptions currently made on the characteristics of
the data being assembled.
Current sequencing technologies only allow us to "read"
up to 1000 -
2000 bases of DNA at a time. To overcome this limitation, sequencing of
entire organisms is performed through a process called shotgun-sequencing,
wherein the DNA is sheared into smaller fragments whose ends are then
sequenced. The reconstruction of the original DNA sequence is handled
by specialized computer programs called assemblers. The output
of assembly programs consists in a collection of contiguous pieces
(contigs) - rarely are entire chromosomes reconstructed into a single
piece. An additional computer program - the scaffolder - uses
the information linking together sequencing reads from the ends of
fragments to order and orient the contigs with respect to each other
along a chromosome.
Learn more about assembly in our assembly
primer.
Despite continued advances in the development of assembly algorithms,
few tools are available that evaluate the correctness of the assemblies
generated.
With the exception of the few genomes that are manually curated by
experts during an expensive process called finishing, most genome data
is published as "draft" assemblies whose quality is uncertain.
Our group has been developing assembly validation tools that make use
of all available information about an assembly to determine its quality
and correct any misassemblies.
Metagenomics is a new field of research in which scientists analyze the
genomes of organisms recovered directly from the environment. Most
naturally occuring bacteria cannot be cultured and therefore cannot be
analyzed by traditional means.
Metagenomic studies, however, overcome this limitation and provide us
with a mechanism for analyzing previously unknown organisms and have a
wide range of applications, from environmental studies to human health.
For the past thirty years, the main method for sequencing DNA has been
a technique called Sanger sequencing. Despite continued advances, this
sequencing method has major limitations and remains prohibitively
costly for many genome projects.
Scientists have been developing new sequencing technologies that have
the potential to overcome these limitations, but the data produced by
such sequencing technologies pose new challenges to assembly software.
• Additional research areas
- Automatic finishing techniques
- Automatic sequencing error correction
- Handling of polymorphic data
- Repeat resolution
- Representation of assembly data in public databases
Software
|

|
|
AMOS
is a consortium committed
to
the
development of
open-source whole genome assembly software. The project acronym (AMOS)
represents our primary goal -- to produce A Modular,
Open-Source whole
genome
assembler. The main thrust
of the AMOS project is to provide the scientific community with an open
standard that will enable active collaborations in assembly research,
by allowing researchers to concentrate on specific assembly challenges
without the need to implement a full assembly program.
Several modules of the AMOS assembler are already
available:
- Core
libraries and API - API for handling and manipulating AMOS
messages, data-banks and internal assembly data structures such as
sequencing reads, contigs, scaffolds, etc.
- AMOScmp
- comparative sequence assembler that allows users to assemble one
genome using another one as a reference
- AutoEditor
Automatic correction of genome sequencing errors by focused
chromatogram reanalysis
- Bambus
- hierarchical scaffolding package
- minimus
- lightweight assembly tool for performing small assembly tasks
AMOS needs and wants - This page
contains a list of utilities that would be handy to have but we haven't
yet managed to write. If you wish to implement one of these
please let us know.
|
|
|
|
|
|
The program used to assemble the
human genome at Celera
Genomics in 2001. Also used to assemble the mouse, rat, fruit fly,
mosquito, and several other bacterial and eukaryotic genomes. It uses
sophisticated string and graph algorithms based on the
overlap-layout-consensus assembly paradigm.
Follow the links to a short tutorial
on running the Celera Assembler and a guide
for interpretting the results.
|
|
|
|

|
|
MUMmer
is a modular system
for the
rapid whole genome alignment of
finished or draft sequence. This package provides an efficient suffix
tree library, seed-and-extend alignment, SNP detection, repeat
detection, and visualization tools. |
|
|
|
|
|
|
The Slice Tools and libSlice
library can assess quality of and
manipulate consensus bases as slices of underlying read data. The Slice
Tools use the libSlice library and the slicing methods to modify
multiple alignments and consensus sequences in various ways. The
architecture of the Slice Tools is centered around the Slice XML format
which allows
the output of one tool to become the input to another creating
ad-hoc assembly pipelines. |
|
|
|
|
|
|
The TIGR Assembler is the classic
assembly tool developed by TIGR to build a consensus sequence from
smaller sequence fragments. TIGR
Assembler is comparable to Phrap and other greedy algorithm based
assemblers. |
Assembly Data
|
| Assembly
Benchmark Data |
|
As part of our
efforts to develop a
new open source genome assembler,
we are collecting
a set of benchmark data to use in testing and comparing our assembler
to others.
In the interests of promoting progress in assembly development
more
widely, we are making these benchmark sets freely available through
this site.
Although genome sequences are frequently published in final form,
the
raw data underlying these genomes is almost never available. This
data
may prove useful not only for testing assemblers, but also for
searching for
polymorphisms and for answering other scientific questions. |
|
|
|
| Production Assemblies |
|
In addition to our research in
developing novel assembly algorithms, we commonly provide assistance to
scientists performing sequencing projects. This "production" aspect of
our work is very important both by providing scientists with better
assemblies of their genomes, but also by providing us with valuable
insights in the nature of the problems encountered in the practice.
These collaborations allow us to tailor our research to solving
problems of importance to the biological community. |
Selected Publications
- Schatz, M.C., Phillippy, A.M., Shneiderman, B., Salzberg, S.L.
Hawkeye: a visual analytics tool for genome assemblies. Genome Biology 8:R34. 2007.
- M. Roberts, B.R. Hunt, J.A. Yorke, R.A. Bolanos and A.L. Delcher.
A
preprocessor for shotgun assembly of large genomes. Journal of
Computational Biology. Vol. 11, No. 4: 734-752. 2004.
- M. Roberts, W. Hayes, B.R. Hunt, S.M. Mount and J.A. Yorke. Reducing
storage requirements for biological sequence comparison.
Bioinformatics. 20(18):3363-3369; 2004.
- M. Pop, A. Phillippy, A.L. Delcher and S.L. Salzberg. Comparative
genome
assembly. Briefings in Bioinformatics. 5(3), pp. 237-248, 2004.
- M. Pop. Shotgun
sequence
assembly. Advances in Computers vol. 60, M. Zelkowitz ed. June
2004.
- M. Pop, D. Kosack. Using the TIGR Assembler in
shotgun-sequencing projects. in Bacterial Artificial Chromosomes
vol. 1, S. Zhao and M. Stodolsky eds. Humana Press, pp. 279-294, March
2004.
- M. Pop, D.S. Kosack, S.L. Salzberg. Hierarchical
scaffolding
with Bambus. Genome Research 14(1), pp. 149-159, 2004
- P. Gajer, M. Schatz, S.L. Salzberg. Automated
correction of
genome sequence errors. Nucleic Acids Research 32(2), pp. 562-569,
2004.
- M. Pop, S. L. Salzberg, M. Shumway. Genome Sequence
Assembly: Algorithms and Issues. IEEE Computer 35(7) 2002, pp.
47-54. Copyright 2002 IEEE. Reproduced with permission from IEEE.
|