Drosophilia Virilis
Art Delcher
Mihai Pop
Michael Schatz
Steven Salzberg
January 2005
This assembly was primarily carried out by a team at TIGR using the
Celera Assembler, with significant participation by groups from
the Venter Institute (VI) and the University of Maryland (UMD). This
file contains a high-level recipe describing what we did.
1. Trim all reads for quality using Lucy, throwing away "shorts" (less
than 64bp).
2. Trim all reads for vector using NUCmer (part of the MUMmer
package), throwing away vector-only reads.
3. Retrim all reads to remove linker sequences, using Mihai Pop's 8mer counting
program. Remove 8mers that are highly over-represented on the 5' end of
sequences.
4. Trim the output of (3) further using the UMD overlapper and retrimming
routines. For this step, these routines were used only to trim, not to extend reads.
5. Assemble using the Celera Assembler with "bubble smoothing" turned
on. This is not a default option because it sometimes crashes.
6. Picking up the assembly from step 5 at a checkpoint following
standard contig and scaffold construction using the cgw module,
we made two additional passes over the assembly. The first
(using the extendClearRanges module) attempted to close intra-scaffold gaps by extending
fragment clear ranges and allowing lower-quality alignments; 1159 gaps
were closed in this fashion. The second (using resolveSurrogates) attempted
to uniquely place individual reads from surrogate unitigs, based on mate pairs;
27,735 fragments were placed. (Surrogate unitigs are repetitive contigs,
assumed to appear in more than one place in the final assembly. Although they
are used to construct the final consensus sequence, the reads contained within
them are initially not mapped to the consensus because they would appear in
more than one place.)
7. Run Mike Schatz's AutoJoiner to close gaps. This closed about 5.5% of the
intra-scaffold gaps that still remained after the steps above.
8. Recruit additional degenerate contigs to the assembly. Use NUCmer to compare
the 43,409 degenerate scaffolds (not normally included in the final assembly)
to the "real" contigs. From the set of degenerates that didn't match the
assembly up to this point, we identified 23 that were greater than 2000bp in
length and added them to the assembly. The largest was 13kb.
Some overall statististics:
[Scaffolds]
TotalScaffolds 1186
TotalContigsInScaffolds 7939
TotalBasesInScaffolds 165 Mbp (approx)
Max Scaffold Bases: 19,890,461
Max Scaffold Span: 20387473
N50ScaffoldBases: 8,064,686
TotalSpanOfScaffolds 179,596,666
[Top5Scaffolds=contigs,size,span]
388,19890461,20387473
346,16659257,17114172
189,13489559,13698929
500,11400460,11728403
158,9613644,9738782
[Contigs]
MaxContigSize 472205
N50ContigBases 69847
[BigContigs_greater_10000]
TotalBigContigs 2957
BigContigLength 149044481
[Top5Contigs=reads,bases]
7685,472205
6738,437639
6428,436539
6530,394065
5896,379600
|