Benchmark Data for Genome Assembly
A set of complete genomes along with all the underlying
sequences,
quality values, and ancillary data
As part of our efforts to develop a new open source genome
assembler, we are collecting
a set of benchmark data to use in testing and comparing our assembler
to others.
In the interests of promoting progress in assembly development
more
widely, we are making these benchmark sets freely available through
this site.
Although genome sequences are frequently published in final form,
the
raw data underlying these genomes is almost never available. This
data
may prove useful not only for testing assemblers, but also for
searching for
polymorphisms and for answering other scientific questions.
These genomes are all available with no restrictions. Others are
free
to use them for testing assemblers, for making biological discoveries,
or
for any other purpose, including redistribution.
For each genome, the data collected here includes:
- the set of all sequence "reads" as generated by the
automated sequencers.
These sequencing machines were primarily ABI 3700s, 377s, and
3730s
- the quality values for all sequences
- the range of the high-quality sequence after trimming out vector
and
low-quality bases
- clone-mate information for sequences generated by paired-end
sequencing
To further assist those running assembler tests, the data for
each genome
is divided into four separate files, each containing all the data types
listed
above. The files are first divided into the reads generated
during the random sequencing ("shotgun") phase and the closure phase,
which
includes closing gaps and finishing regions with low coverage.
Each
of these two files is further split in two by pulling out sequences
that match
the final assembly at 80% identity or more. The remaining
sequences
should be a combination of contaminents and low quality sequences,
although
others are welcome to dig through them.
To map the individual reads to the genome, we used NUCmer,
which is part
of the MUMmer
system, release 3 (Kurtz S et al., Genome Biology (2004), 5:R12.). This
allows us to map the reads to the genome extremely quickly, but on
occasion a read that is just over 80% identical might be missed and
put into the set of non-matching reads.
The currently available genomes, and the benchmark data, can
be downloaded
by clicking on the links below. We expect this set to grow over
time;
check this site for updates. For greatly detailed information
about
any of these genomes, visit the CMR at http://www.tigr.org/CMR.
Brucella
suis. Funding source: DARPA and NIH. Brucella suis is
a bacterial pathogen
and potential bioterrorism agent that could be targeted against humans
or
livestock. Publication:
I.T. Paulsen et al. (2002), "The Brucella suis suis
genome reveals
fundamental similarities between animal and plant pathogens and
symbionts."
Proc Natl Acad Sci U S A 99(20): 13148-53. Its genome
consists
of two circular chromosomes with lengths 2,107,792 and 1,207,381 bp,
with
an overall GC content of 57.2%.
Click
here to download the complete set of sequences and ancillary data.
Wolbachia
sp. Funding source: NIH/NIAID. Wolbachia are
endosymbiotic bacteria
that live inside many invertebrate species; this genome is of a
Wolbachia
endosymbiont of Drosophila melanogaster. Its genome
is a single circular molecule of 1,267,782 bp, with overall
GC content of 35.1%. An interesting problem for assembly is the
high
percentage of contamination from Drosophila in the sequencing library,
a
consequence of the fact that the organism had to be grown inside flies
to
prepare its DNA for sequencing. Those sequences are included in
the
data here.
Note: Thanks to Jonathan Eisen and Scott O'Neill for permission to post
this data prior
to publication of the genome.
Click
here to download the complete set of sequences and ancillary data.
Shewanella
oneidensis. Funding source: Dept. of Energy (OBER).
This is a metal ion-reducing
bacterium that has great potential as a bioremediation agent to remove
toxic
metals from the environment. Publication:
J.F. Heidelberg et al., "Genome sequence
of the dissimilatory metal ion-reducing bacterium Shewanella
oneidensis."
Nature Biotechnology 20:11 (2002), 1118-1123. Its genome
consists
of a circular chromosome of 4,969,803 bp and a circular plasmid of
161,613
bp, with an overall GC content of 45.9%.
Click
here to download the complete set of sequences and ancillary data.
Staphylococcus
aureus
COL. Funding source:
NIH/NIAID. This bacterium is a major cause of food poisoning and
of hospital-acquired infections. This genome has a GC-content of 32.8%
and consists of a circular chromosome of 2,809,421bp and a circular
plasmid of 4440 bp. Thanks to Steven Gill for permission to post this
data prior to publication of the genome.
Click
here to download the complete set of sequences and ancillary data.
Staphylococcus
epidermidis
RP62A. Funding source: NIH/NIAID. This bacterium is a
major
cause of skin infections, in particular hospital-acquired infections.
This genome has a GC-content of 32.1% and consists of a 2,616,530
bp circular chromosome and a 27,310 bp circular plasmid. Thanks to
Steven Gill for permission to post this data prior to publication of
the genome.
Click
here to download the complete set of sequences and ancillary data.
|