Genome Sequence Assembly

CBCB faculty: Steven Salzberg, Jim Yorke, Art Delcher, Mihai Pop
CBCB students and staff: Adam Phillippy, Mike Schatz, Dan Sommer
The IPST genome assembly group at the UMD Institute for Physical Science & Technology

1. Current Research
2. Assembly Software
3. Assembly Data
4. AMOS documentation project


Current Research


Despite the fact that the assembly of bacterial genomes has become a routine task at major sequencing centers, the assembly problem is far from being solved. Many new challenges are uncovered as scientists tackle diverse new organisms. Furthermore new sequencing technologies will change the assumptions currently made on the characteristics of the data being assembled.

Current sequencing technologies only allow us to "read" up to 1000 - 2000 bases of DNA at a time. To overcome this limitation, sequencing of entire organisms is performed through a process called shotgun-sequencing, wherein the DNA is sheared into smaller fragments whose ends are then sequenced. The reconstruction of the original DNA sequence is handled by specialized computer programs called assemblers. The output of assembly programs consists in a collection of contiguous pieces (contigs) - rarely are entire chromosomes reconstructed into a single piece. An additional computer program - the scaffolder - uses the information linking together sequencing reads from the ends of fragments to order and orient the contigs with respect to each other along a chromosome.

Learn more about assembly in our assembly primer.

• Assembly Validation

Despite continued advances in the development of assembly algorithms, few tools are available that evaluate the correctness of the assemblies generated. With the exception of the few genomes that are manually curated by experts during an expensive process called finishing, most genome data is published as "draft" assemblies whose quality is uncertain. Our group has been developing assembly validation tools that make use of all available information about an assembly to determine its quality and correct any misassemblies.

• Metagenomics

Metagenomics is a new field of research in which scientists analyze the genomes of organisms recovered directly from the environment. Most naturally occuring bacteria cannot be cultured and therefore cannot be analyzed by traditional means. Metagenomic studies, however, overcome this limitation and provide us with a mechanism for analyzing previously unknown organisms and have a wide range of applications, from environmental studies to human health.

• Development of assembly algorithms for data produced by novel sequencing technologies

For the past thirty years, the main method for sequencing DNA has been a technique called Sanger sequencing. Despite continued advances, this sequencing method has major limitations and remains prohibitively costly for many genome projects. Scientists have been developing new sequencing technologies that have the potential to overcome these limitations, but the data produced by such sequencing technologies pose new challenges to assembly software.

• Additional research areas

  • Automatic finishing techniques
  • Automatic sequencing error correction
  • Handling of polymorphic data
  • Repeat resolution
  • Representation of assembly data in public databases


Software


AMOS
 
AMOS is a consortium committed to the development of open-source whole genome assembly software. The project acronym (AMOS) represents our primary goal -- to produce A Modular, Open-Source whole genome assembler. The main thrust of the AMOS project is to provide the scientific community with an open standard that will enable active collaborations in assembly research, by allowing researchers to concentrate on specific assembly challenges without the need to implement a full assembly program.

Several modules of the AMOS assembler are already available:

  • Core libraries and API - API for handling and manipulating AMOS messages, data-banks and internal assembly data structures such as sequencing reads, contigs, scaffolds, etc.
  • AMOScmp - comparative sequence assembler that allows users to assemble one genome using another one as a reference
  • AutoEditor Automatic correction of genome sequencing errors by focused chromatogram reanalysis
  • Bambus - hierarchical scaffolding package
  • minimus - lightweight assembly tool for performing small assembly tasks

AMOS needs and wants - This page contains a list of utilities that would be handy to have but we haven't yet managed to write.  If you wish to implement one of these please let us know.




The program used to assemble the human genome at Celera Genomics in 2001. Also used to assemble the mouse, rat, fruit fly, mosquito, and several other bacterial and eukaryotic genomes. It uses sophisticated string and graph algorithms based on the overlap-layout-consensus assembly paradigm.

Follow the links to a short tutorial on running the Celera Assembler and a guide for interpretting the results.




MUMmer

MUMmer is a modular system for the rapid whole genome alignment of finished or draft sequence. This package provides an efficient suffix tree library, seed-and-extend alignment, SNP detection, repeat detection, and visualization tools.




The Slice Tools and libSlice library can assess quality of and manipulate consensus bases as slices of underlying read data. The Slice Tools use the libSlice library and the slicing methods to modify multiple alignments and consensus sequences in various ways. The architecture of the Slice Tools is centered around the Slice XML format which allows the output of one tool to become the input to another creating ad-hoc assembly pipelines.




The TIGR Assembler is the classic assembly tool developed by TIGR to build a consensus sequence from smaller sequence fragments. TIGR Assembler is comparable to Phrap and other greedy algorithm based assemblers.


Assembly Data


Assembly Benchmark Data
As part of our efforts to develop a new open source genome assembler, we are collecting a set of benchmark data to use in testing and comparing our assembler to others.  In the interests of promoting progress in assembly development more widely, we are making these benchmark sets freely available through this site.  Although genome sequences are frequently published in final form, the raw data underlying these genomes is almost never available.  This data may prove useful not only for testing assemblers, but also for searching for polymorphisms and for answering other scientific questions.



Production Assemblies
In addition to our research in developing novel assembly algorithms, we commonly provide assistance to scientists performing sequencing projects. This "production" aspect of our work is very important both by providing scientists with better assemblies of their genomes, but also by providing us with valuable insights in the nature of the problems encountered in the practice. These collaborations allow us to tailor our research to solving problems of importance to the biological community.


Selected Publications


  • Schatz, M.C., Phillippy, A.M., Shneiderman, B., Salzberg, S.L. Hawkeye: a visual analytics tool for genome assemblies. Genome Biology 8:R34. 2007.
  • M. Roberts, B.R. Hunt, J.A. Yorke, R.A. Bolanos and A.L. Delcher. A preprocessor for shotgun assembly of large genomes. Journal of Computational Biology. Vol. 11, No. 4: 734-752. 2004.
  • M. Roberts, W. Hayes, B.R. Hunt, S.M. Mount and J.A. Yorke. Reducing storage requirements for biological sequence comparison. Bioinformatics. 20(18):3363-3369; 2004.
  • M. Pop, A. Phillippy, A.L. Delcher and S.L. Salzberg. Comparative genome assembly. Briefings in Bioinformatics. 5(3), pp. 237-248, 2004.
  • M. Pop. Shotgun sequence assembly. Advances in Computers vol. 60, M. Zelkowitz ed. June 2004.
  • M. Pop, D. Kosack. Using the TIGR Assembler in shotgun-sequencing projects. in Bacterial Artificial Chromosomes vol. 1, S. Zhao and M. Stodolsky eds. Humana Press, pp. 279-294, March 2004.
  • M. Pop, D.S. Kosack, S.L. Salzberg. Hierarchical scaffolding with Bambus. Genome Research 14(1), pp. 149-159, 2004
  • P. Gajer, M. Schatz, S.L. Salzberg. Automated correction of genome sequence errors. Nucleic Acids Research 32(2), pp. 562-569, 2004.
  • M. Pop, S. L. Salzberg, M. Shumway. Genome Sequence Assembly: Algorithms and Issues. IEEE Computer 35(7) 2002, pp. 47-54. Copyright 2002 IEEE. Reproduced with permission from IEEE.