Benchmark Data for Genome Assembly

A set of complete genomes along with all the underlying sequences, quality values, and ancillary data

As part of our efforts to develop a new open source genome assembler, we are collecting a set of benchmark data to use in testing and comparing our assembler to others.  In the interests of promoting progress in assembly development more widely, we are making these benchmark sets freely available through this site.  Although genome sequences are frequently published in final form, the raw data underlying these genomes is almost never available.  This data may prove useful not only for testing assemblers, but also for searching for polymorphisms and for answering other scientific questions.

These genomes are all available with no restrictions.  Others are free to use them for testing assemblers, for making biological discoveries, or for any other purpose, including redistribution. 

For each genome, the data collected here includes:

  1. the set of all sequence "reads" as generated by the automated sequencers.  These sequencing machines were primarily ABI 3700s, 377s, and 3730s
  2. the quality values for all sequences
  3. the range of the high-quality sequence after trimming out vector and low-quality bases
  4. clone-mate information for sequences generated by paired-end sequencing

To further assist those running assembler tests, the data for each genome is divided into four separate files, each containing all the data types listed above.  The files are first divided into the reads generated during the random sequencing ("shotgun") phase and the closure phase, which includes closing gaps and finishing regions with low coverage.  Each of these two files is further split in two by pulling out sequences that match the final assembly at 80% identity or more.  The remaining sequences should be a combination of contaminents and low quality sequences, although others are welcome to dig through them.

To map the individual reads to the genome, we used NUCmer, which is part of the MUMmer system, release 3 (Kurtz S et al., Genome Biology (2004), 5:R12.).  This allows us to map the reads to the genome extremely quickly, but on occasion a read that is just over 80% identical might be missed and put into the set of non-matching reads.

The currently available genomes, and the benchmark data, can be downloaded by clicking on the links below.  We expect this set to grow over time; check this site for updates.  For greatly detailed information about any of these genomes, visit the CMR at http://www.tigr.org/CMR.

Brucella suis.  Funding source: DARPA and NIH. Brucella suis is a bacterial pathogen and potential bioterrorism agent that could be targeted against humans or livestock. Publication: I.T. Paulsen et al. (2002), "The Brucella suis suis genome reveals fundamental similarities between animal and plant pathogens and symbionts." Proc Natl Acad Sci U S A 99(20): 13148-53.  Its genome consists of two circular chromosomes with lengths 2,107,792 and 1,207,381 bp, with an overall GC content of 57.2%.
Click here to download the complete set of sequences and ancillary data.

Wolbachia sp. Funding source: NIH/NIAID. Wolbachia are endosymbiotic bacteria that live inside many invertebrate species; this genome is of a Wolbachia endosymbiont of Drosophila melanogaster.   Its genome is a single circular molecule of 1,267,782 bp, with overall GC content of 35.1%.  An interesting problem for assembly is the high percentage of contamination from Drosophila in the sequencing library, a consequence of the fact that the organism had to be grown inside flies to prepare its DNA for sequencing.  Those sequences are included in the data here.
Note: Thanks to Jonathan Eisen and Scott O'Neill for permission to post this data prior to publication of the genome.
Click here to download the complete set of sequences and ancillary data.

Shewanella oneidensis.   Funding source: Dept. of Energy (OBER). This is a metal ion-reducing bacterium that has great potential as a bioremediation agent to remove toxic metals from the environment.   Publication: J.F. Heidelberg et al., "Genome sequence of the dissimilatory metal ion-reducing bacterium Shewanella oneidensis." Nature Biotechnology 20:11 (2002), 1118-1123.  Its genome consists of a circular chromosome of 4,969,803 bp and a circular plasmid of 161,613 bp, with an overall GC content of 45.9%.
Click here to download the complete set of sequences and ancillary data.

Staphylococcus aureus COL.   Funding source: NIH/NIAID.  This bacterium is a major cause of food poisoning and of hospital-acquired infections. This genome has a GC-content of 32.8% and consists of a circular chromosome of 2,809,421bp and a circular plasmid of 4440 bp. Thanks to Steven Gill for permission to post this data prior to publication of the genome.
Click here to download the complete set of sequences and ancillary data.

Staphylococcus epidermidis RP62A.   Funding source: NIH/NIAID. This bacterium is a major cause of skin infections, in particular hospital-acquired infections.  This genome has a GC-content of 32.1% and consists of a 2,616,530 bp circular chromosome and a 27,310 bp circular plasmid. Thanks to Steven Gill for permission to post this data prior to publication of the genome.
Click here to download the complete set of sequences and ancillary data.