Drosophilia Virilis

Art Delcher
Mihai Pop
Michael Schatz
Steven Salzberg
January 2005

This assembly was primarily carried out by a team at TIGR using the Celera Assembler, with significant participation by groups from the Venter Institute (VI) and the University of Maryland (UMD). This file contains a high-level recipe describing what we did.

1. Trim all reads for quality using Lucy, throwing away "shorts" (less than 64bp).

2. Trim all reads for vector using NUCmer (part of the MUMmer package), throwing away vector-only reads.

3. Retrim all reads to remove linker sequences, using Mihai Pop's 8mer counting program. Remove 8mers that are highly over-represented on the 5' end of sequences.

4. Trim the output of (3) further using the UMD overlapper and retrimming routines. For this step, these routines were used only to trim, not to extend reads.

5. Assemble using the Celera Assembler with "bubble smoothing" turned on. This is not a default option because it sometimes crashes.

6. Picking up the assembly from step 5 at a checkpoint following standard contig and scaffold construction using the cgw module, we made two additional passes over the assembly. The first (using the extendClearRanges module) attempted to close intra-scaffold gaps by extending fragment clear ranges and allowing lower-quality alignments; 1159 gaps were closed in this fashion. The second (using resolveSurrogates) attempted to uniquely place individual reads from surrogate unitigs, based on mate pairs; 27,735 fragments were placed. (Surrogate unitigs are repetitive contigs, assumed to appear in more than one place in the final assembly. Although they are used to construct the final consensus sequence, the reads contained within them are initially not mapped to the consensus because they would appear in more than one place.)

7. Run Mike Schatz's AutoJoiner to close gaps. This closed about 5.5% of the intra-scaffold gaps that still remained after the steps above.

8. Recruit additional degenerate contigs to the assembly. Use NUCmer to compare the 43,409 degenerate scaffolds (not normally included in the final assembly) to the "real" contigs. From the set of degenerates that didn't match the assembly up to this point, we identified 23 that were greater than 2000bp in length and added them to the assembly. The largest was 13kb.

Some overall statististics:

[Scaffolds]
TotalScaffolds 1186
TotalContigsInScaffolds 7939 
TotalBasesInScaffolds 165 Mbp (approx)
Max Scaffold Bases: 19,890,461
Max Scaffold Span:  20387473
N50ScaffoldBases:    8,064,686
TotalSpanOfScaffolds 179,596,666

[Top5Scaffolds=contigs,size,span]
388,19890461,20387473
346,16659257,17114172
189,13489559,13698929
500,11400460,11728403
158,9613644,9738782

[Contigs]
MaxContigSize 472205
N50ContigBases 69847

[BigContigs_greater_10000]
TotalBigContigs 2957
BigContigLength 149044481

[Top5Contigs=reads,bases]
7685,472205
6738,437639
6428,436539
6530,394065
5896,379600