Brugia Malayia
Art Delcher
January 2005
Estimating genome length
There are two factors that make it difficult to
estimate the length of the Brugia genome.
1. There are too many singleton reads.
Of the 176,099 singletons, 94,147 have no overlaps.
Of those 50,924 (54.1%) have GC-content >= 40%.
Of the 1,056,906 non-singleton reads, 50,088 (4.7%)
have GC-content >= 40%.
So the singletons without overlaps have a distinctly
different GC composition than the rest of the genome.
A plot of the GC content of reads in contigs, degenerates,
singletons, and singletons without overlaps is in
reads.gc.png (attached). The bump near 20% GC for
singletons and degenerates also looks suspicious.
I tried BLASTing a few of the high-GC singletons and found
31 matches to E.coli, and lots of matches to Ascaris (a
nematode) rRNA which may be similar to the Brugia rRNA, but
I don't think this accounts for anywhere near all the
singletons. For several of the reads the best BLAST matches
were to mouse sequence.
There are also a significant number of small (<=3 reads)
degenerate contigs with GC-content >= 40%.
I think there is either some kind of contaminant in the data,
or else the genome has distinct, high-GC regions that are
severely underrepresented in our shotgun sequencing.
Perhaps someone doing annotation could look into what
some of these sequences are.
2. Reads are not uniformly distributed in contigs.
If reads were sampled uniformly at random from the genome,
one would expect the coverage within contigs to peak
around the average coverage value. I.e., if the reads
cover the genome at 9x, then the depth of coverage in the
contigs should peak around 9x. The Brugia coverage
varies widely with large regions at very deep (>= 20x)
coverage, and the peak coverage of large contigs occurs
between 4x and 5x--well below the average coverage value.
File brg-tvg.coverage.png (attached) is a plot of the
coverage of large (>=5kb) contigs for Brugia, and also
for Trichomonas as a comparison. The average coverage of
Trichomonas is between 6x and 7x.
Another way to see this is in kmer frequencies. I counted
the number of occurrences of 22-mers in all the reads, and
plotted the percentage of distinct 22-mers that occurred once,
twice, 3 times, etc. One expects a spike for unique kmers
caused by sequencing error (any kmer containing a sequencing
error is likely to be unique), and then a peak near the
average coverage depth. File brg-tvg.kmerfreq.png (attached)
is a plot of the kmer frequency distribution for BRG and TVG.
TVG has the peak as expected; BRG has no peak.
Despite these problems, I tried to estimate the genome length
anyway.
1. From the assembly itself.
There are 70.7Mb in scaffolds and the total span is 77.5Mb. So
there are 6.8Mb in gaps. There are 17.5Mb in degenerates.
Using the degenerates to fill the gaps and then adding what's
left over would give a genome size of 88.2Mb. Adding the
singletons would make it way bigger--the total length of
singletons is 108Mb. I think the best estimate
from the assembly is between 80 and 90Mb.
2. From coverage of unique regions in the assembly.
Using frequent kmers to mask out repetitive regions of the
assembly we can compute the coverage of the remaining regions
and extrapolate this to estimate the genome length. Using
22-mers occurring >=40 times to indicate repeats, the
average coverage in remaining regions of contigs and degenerates
is 8.1x. If we divide the total length of all reads by
this coverage the implied genome length is 109Mb. If
we discount the singleton reads, the implied genome length
drops to 95Mb.
3. From the distribution of start positions of read overlaps.
Again masking out repetitive regions with frequent kmers
(this time 22-mers occurring >=30 times), I use a Poisson
distribution to model the number of overlaps that occur in a
given position of a read. The fit to the model isn't very
good, but the resulting genome length is 88Mb.
Overall best guess: somewhere around 90Mb, but could be much
larger if all those singletons really belong.
How final assembly was done
In the initial run of the Celera Assembler (CA), we observed a
substantial number of large unitigs with coverage depth that was
approximately double that of the majority of large unitigs. CA
treats high-coverage unitigs as potential repeats and allows
"surrogate" copies of them to be placed in multiple places in
the assembly. In our assembly the large, double-depth unitigs
typically occurred in exactly two places, on the ends of
contigs, with mate pairs indicating that the two occurrences
should instead be a single occurrence with respect to the
neighbouring contigs.
To force these large unitigs to have a single occurrence in the
assembly we identified all surrogate unitigs containing at least
50 reads, with at most two occurrences in the assembly, and all
these occurrences on the ends of contigs. The coverage value of these
unitigs was artificially reset to a value to force them to have
a unique occurrence and the contigging/scaffolding stages of CA
rerun.
Technical Notes:
Let S be the set of surrogate unitigs with >= 50 reads and <= 2
occurrences in the original assembly. Working in from
both ends of each contig, choose unitigs that are in S,
stopping at the first unitig not in S. Reset the
a-stat coverage statistic of the unitigs to 6.01 and
rerun the assembler starting at cgw.
|