Drosophila pseudoobscura

Art Delcher
Mihai Pop
Steven Salzberg
July 2003

This improved assembly of Drosophila pseudoobscura was initially created in July 2003, using sequence data from the Baylor Human Genome Sequencing Center that we downloaded from the NCBI Trace Archive.

In July 2004, we discovered two errors in the data that allowed us to improve the assembly significantly:

1. The trace data contained 34,317 ESTs that were not erroneously listed as whole-genome shotgun data. These ESTs in some cases spanned introns, meaning that they would result in mis-assemblies. We did not fix our large contigs, but we identified and removed approximately 4000 very small contigs and scaffolds (< 2kb) that contained ESTs.

2. The trace data contained 12,439 BAC ends that were erroneously listed as belonging to short (2kb) insert libraries. Of these, 7,364 had mates (meaning that there were 3182 pairs). This BAC end data allowed us to build dramatically larger scaffolds.

The original assembly was created by the Celera Assembler. This new assembly contains the contigs created by that assembler, but scaffolding was done by BAMBUS (M. Pop, D. Kosack, and S.L. Salzberg. Genome Research 14(2004), 149-159) using the original clone-mate information plus the newly-discovered BAC-end information. Thus we call this a "CABA" assembly.

The new assembly is much improved, with an N50 scaffold size more than double our original assembly. The statistics of the assembly, as well as comparisons to the Baylor freeze assembly from August of 2003, can be found in the ".statistics" file in this directory. The files are:

The first file contains all the scaffolds, while the second contains the contigs in those scaffolds. The "assembly.statistics" file contains a variety of quality control statistics, such as contig and scaffold N50 sizes, total sizes, largest contigs, number of contigs and scaffolds, etc. Many scaffolds contain just one contig, while many others contain multiple contigs. For those multi-contig scaffolds, we have inserted 100 N's between the contigs to indicate the positions of the gaps. The "chaff.contigs" file contains all the contigs that appear to be repetitive, low quality, or otherwise did not fit into scaffolds. We include these for completeness, and for those who want to mine every scrap of data from this assembly. Finally, the "oo" file contains the order and orientation of all contigs within each scaffold. Each scaffold ID is followed by a list of its contigs with the letters "BE" or "EB" indicating forward and reverse orientation respectively.



Assembly Procedure

First I removed all the short contigs and scaffolds that had ESTs on them from the assembly dp-j20 in the dir Newj-20. I got these by searching 34,000+ ESTs against all contigs, finding those that hit, and then removing contigs if:
1. the contig was shorter than 2kb
2. the scaffold containing it was a single-contig scaffold.

Then I made the files in this directory containing about 6000 scaffolds rather than the 10,000 in the original assembly, plus the associated "placed.fasta" file containing all contigs.

Then I pulled all the (newly discovered - July 2004) BAC end sequences from the Trace Archive, library/project ID OBBGP. These were in the original trace entries, so I'm told, but were not labeled as BAC ends. So they had different IDs and we cannot find them in the Celera Assembler records. Instead, I ran NUCMER to find out where they hit, like this:

 nucmer -b 20 -c 200 -l 50 -p ContigsVsBacEnds NoESTAsm.placed.fasta ../Inputs/bac_ends/allbacends.fasta

Due to memory constraints I had to run on an alpha. Then we needed to find the unambiguous tilings of reads to contigs

Creating mates:

sed 's/>//' ../Inputs/bac_ends/headers-with-mates | 
sed 's/name:[^ ]*//' | sed 's/mate_name:.*//' | 
sed 's/mate:/gnl|ti|/' | sed 's/ /       /' > BacEnds.mates

used vi to add line:

library	bac	40000	200000

Generating linking data from prior CA assembly:

  /local/asmg/scratch/mihai/WGA/bin/ca2scaff -i ../Newj-20/dp-j20.asm -o dp-j20

Running bambus:

  goBambus -c ContigsVsBacEnds.ctg -m BacEnds.mates -x dp-j20.links -o try1