Brugia Malayia

Art Delcher
January 2005


Estimating genome length


There are two factors that make it difficult to estimate the length of the Brugia genome.

1. There are too many singleton reads.

Of the 176,099 singletons, 94,147 have no overlaps. Of those 50,924 (54.1%) have GC-content >= 40%. Of the 1,056,906 non-singleton reads, 50,088 (4.7%) have GC-content >= 40%. So the singletons without overlaps have a distinctly different GC composition than the rest of the genome.

A plot of the GC content of reads in contigs, degenerates, singletons, and singletons without overlaps is in reads.gc.png (attached). The bump near 20% GC for singletons and degenerates also looks suspicious.

I tried BLASTing a few of the high-GC singletons and found 31 matches to E.coli, and lots of matches to Ascaris (a nematode) rRNA which may be similar to the Brugia rRNA, but I don't think this accounts for anywhere near all the singletons. For several of the reads the best BLAST matches were to mouse sequence.

There are also a significant number of small (<=3 reads) degenerate contigs with GC-content >= 40%.

I think there is either some kind of contaminant in the data, or else the genome has distinct, high-GC regions that are severely underrepresented in our shotgun sequencing. Perhaps someone doing annotation could look into what some of these sequences are.

2. Reads are not uniformly distributed in contigs.

If reads were sampled uniformly at random from the genome, one would expect the coverage within contigs to peak around the average coverage value. I.e., if the reads cover the genome at 9x, then the depth of coverage in the contigs should peak around 9x. The Brugia coverage varies widely with large regions at very deep (>= 20x) coverage, and the peak coverage of large contigs occurs between 4x and 5x--well below the average coverage value. File brg-tvg.coverage.png (attached) is a plot of the coverage of large (>=5kb) contigs for Brugia, and also for Trichomonas as a comparison. The average coverage of Trichomonas is between 6x and 7x.

Another way to see this is in kmer frequencies. I counted the number of occurrences of 22-mers in all the reads, and plotted the percentage of distinct 22-mers that occurred once, twice, 3 times, etc. One expects a spike for unique kmers caused by sequencing error (any kmer containing a sequencing error is likely to be unique), and then a peak near the average coverage depth. File brg-tvg.kmerfreq.png (attached) is a plot of the kmer frequency distribution for BRG and TVG. TVG has the peak as expected; BRG has no peak.

Despite these problems, I tried to estimate the genome length anyway.

1. From the assembly itself.

There are 70.7Mb in scaffolds and the total span is 77.5Mb. So there are 6.8Mb in gaps. There are 17.5Mb in degenerates. Using the degenerates to fill the gaps and then adding what's left over would give a genome size of 88.2Mb. Adding the singletons would make it way bigger--the total length of singletons is 108Mb. I think the best estimate from the assembly is between 80 and 90Mb.

2. From coverage of unique regions in the assembly.

Using frequent kmers to mask out repetitive regions of the assembly we can compute the coverage of the remaining regions and extrapolate this to estimate the genome length. Using 22-mers occurring >=40 times to indicate repeats, the average coverage in remaining regions of contigs and degenerates is 8.1x. If we divide the total length of all reads by this coverage the implied genome length is 109Mb. If we discount the singleton reads, the implied genome length drops to 95Mb.

3. From the distribution of start positions of read overlaps.

Again masking out repetitive regions with frequent kmers (this time 22-mers occurring >=30 times), I use a Poisson distribution to model the number of overlaps that occur in a given position of a read. The fit to the model isn't very good, but the resulting genome length is 88Mb.

Overall best guess: somewhere around 90Mb, but could be much larger if all those singletons really belong.



How final assembly was done


In the initial run of the Celera Assembler (CA), we observed a substantial number of large unitigs with coverage depth that was approximately double that of the majority of large unitigs. CA treats high-coverage unitigs as potential repeats and allows "surrogate" copies of them to be placed in multiple places in the assembly. In our assembly the large, double-depth unitigs typically occurred in exactly two places, on the ends of contigs, with mate pairs indicating that the two occurrences should instead be a single occurrence with respect to the neighbouring contigs.

To force these large unitigs to have a single occurrence in the assembly we identified all surrogate unitigs containing at least 50 reads, with at most two occurrences in the assembly, and all these occurrences on the ends of contigs. The coverage value of these unitigs was artificially reset to a value to force them to have a unique occurrence and the contigging/scaffolding stages of CA rerun.

Technical Notes:

Let S be the set of surrogate unitigs with >= 50 reads and <= 2 occurrences in the original assembly. Working in from both ends of each contig, choose unitigs that are in S, stopping at the first unitig not in S. Reset the a-stat coverage statistic of the unitigs to 6.01 and rerun the assembler starting at cgw.