We describe an application of de Bruijn graphs - a theoretical framework
underlying several modern genome assembly programs - to analyze the
global repeat structure of prokaryotic genomes. We provide the 1st
survey of the repeat structure of a large number of genomes, and make
publicly available the resulting simplied de Bruijn graphs. The resulting
data provides an upper-bound on the performance of genome assemblers for
de novo reconstruction of genomes across a wide range of read lengths
and/or insert sizes, thereby providing a benchmark for new software tools
developed in the context of next generation sequencing data. Further,
we demonstrate that the majority of genes in prokaryotic genomes can
be reconstructed uniquely even if the genomes themselves cannot. The
non-reconstructible genes are overwhelmingly related to mobile elements
(transposons, IS elements, and prophages) indicating that it is at least
theoretically possible to reconstruct the protein-coding genes of an
organisms using very short reads.
Here are the compressed and simplified de Bruijn graphs constructed using
the techniques described in the paper. The files are written in Graphviz
".dot" format. The node label specifies a comma separated list of offsets in
the genome, followed by colon and node length. For example, the graph below
has 3 nodes, with lengths 1425, 1346478, and 4305032. The sequence for node 1
occurs twice in the genome starting at offsets 2309674 and 36555581.