Xanthomonas oryzae
Michael Schatz
Art Delcher
April 2005
Art ran an assembly of the data with a 1.5% error rate for the unitigger using
his tip of CA. This created two large scaffolds. He investigated this and
found there were mates off the ends of the scaffolds into a degenerate contig:
Left Scaffold <=> Degen <=> Right Scaffold
He found this by investigating the Contig link messages in the CA asm file,
but the exact technique is unknown to me. He then manually edited the A-stat
on the degenerate contig separating two large scaffold so that it would be
placed in the scaffold, and reran cgw. This created a single large scaffold
(as expected) with roughly 70 gaps.
Around this same time, I performed an assembly with Arachne. It performed
better than a default CA assembly but not as well as Art's. We sent this to
Broad, and they sent back a better assembly, but not as good as Art's.
I also began to investigate and correct a number of likely collapsed repeats.
These regions were identified by running '/local/asmg/Linux/bin/cavalidatea
prefix' (Part of the AMOS distribution) where prefix is the prefix to the
frg and asm files. This script is a wrapper for Mihai's asmQC mate happiness
validator, and my own scripts for finding correlated SNPs. The assembly is
converted into AMOS format for asmQC, and the features are written into the
bank as well. The features are then clustered into a
"prefix.suspicious.regions" file with contig range and aspects that are
suspicious.
I found that I could correct the collapsed regions by performing a local
assembly of the reads and mates in the regions using run_CA with default
parameters. The strategy going in was to perform the local assembly, and
then stitch the local assembly back into the original contigs, if it was
successful. At some point, it occured to me that it would be significantly
easier to replace the entire contig with a correct version, leading me to
experiment with adjusting the error rate setting on the unitigger. I
found that by performing a local assembly with the reads and mates in the
contig in question with a strict error rate, a corrected contig was created,
so there would be no need to stitch it back together.
The error rate determines how agreesively reads should be joined together
into a single unitig- as a rule of thumb bigger unitigs are better, but a
larger unitig has a higher potential for being an overcollapsed repeat. The
trick is to find an error rate setting which separates the repeat copies into
separate unitigs while not going so low as to break the contigs prematurely
by ordinary sequencing errors. I eventually found that a .3% error rate
(unitigger -e 0.003) was the balance for this genome. The error rate is
stored as an discrete value in .1% increments 0.0 -> 9.9. The standard
run_CA using a 6% error rate; prior to this genome, no one had never
gone lower than 1.5%. Art (always?) usually uses 1.5%.
The sweet spot error rate for a particular genome will vary by how repetitive
the genome is, how much difference there is between repeat copies, the depth
of coverage of the assembly, how well the trimming was performed, and how
"clean" the clear range is. There is an error correction module in CA before
unitigger that adjusts the error rate for a given overlap that is sensitive
to the depth of coverage, so it is thought that this genome at 10x benefitted
from this error correction. Art conjectured that its GC content was also
beneficial, but it is unknown if what effect (if any) this really had.
From this work, I created a small (5) number of contigs that I was going to
replace in the assembly which corrected collapsed repeats, but otherwised
matched the original. This became very complicated as I tried to replace the
contigs: replacing the contigs was trivial, but I had to be sure to also fix
the surrogates, features, degenerates, and scaffold with the new contigs. I
got partially through this and decided as an experiment to run the assembly
globally with the strict error rate. My expectation was that a small number of
scaffolds (5-10) would form and the contigs would be more fragmented, and then
we would have to decide if it was easier to fix the new scaffold or push
through replacing contigs.
To my surprize, a single scaffold was formed with fewer gaps than ever- even
fewer than by using the AutoJoiner on the prior best. I aligned this to the
prior assembly and found that the new assembly resolved a number of
over-collapsed repeats including beyond those that I had fixed though local
assembly. The mates and correlated SNPs does indicate a few spots, but
significantly better than any prior Xanthomonas assembly.
Attempting to improve the assembly even further Art and I inspected the
scaffold, including the reads and unitigs that could be placed inside the
gaps (CA file prefix.gapreads). We found some of the larger gaps did in
fact have reads that could be placed inside, but those reads were "trapped"
in degenerate contigs. We then "blasted" all of the reads in degenerate or
surrogate unitigs into singleton unitigs by creating the appropriate messages
for cgw. We than rescaffolded using the original placed unitigs and the
singleton unitigs. This was effective at reducing the mean size of the gaps
from 148 to 81bp and the bases in scaffold increased by 5kb, but it did
split 1 contig into two pieces (no gaps were closed in this process).
I then ran the CA backend with AutoJoiner on both the strict assembly and
the blasted assembly and found that it did best on the blasted assembly. I
ran AutoJoiner with a somewhat stricter criterion for joining contigs
together because of the high GC nature, but it still closed 15% of the gaps
with a mean gap size of 20bp.
The experiment with blasting apart the the degenerate and surrogate unitigs is
interesting, but had little effect as it was done. I now have some scripts for
performing local assemblies based on the regions cavalidate detects, but we
need better tools for updating assemblies before that can really be used
effectively. Running Arachne on this dataset was useful experience, but in the
end the CA assembly was better- even after Manfred spent some time on it
(granted not a lot).
|