Drosophila pseudoobscura
Art Delcher
Mihai Pop
Steven Salzberg
July 2003
This improved assembly of Drosophila pseudoobscura was initially created in
July 2003, using sequence data from the Baylor Human Genome Sequencing
Center that we downloaded from the NCBI Trace Archive.
In July 2004, we discovered two errors in the data that allowed us to
improve the assembly significantly:
1. The trace data contained 34,317 ESTs that were not erroneously
listed as whole-genome shotgun data. These ESTs in some cases spanned
introns, meaning that they would result in mis-assemblies. We did not
fix our large contigs, but we identified and removed approximately
4000 very small contigs and scaffolds (< 2kb) that contained ESTs.
2. The trace data contained 12,439 BAC ends that were erroneously
listed as belonging to short (2kb) insert libraries. Of these, 7,364
had mates (meaning that there were 3182 pairs). This BAC end data
allowed us to build dramatically larger scaffolds.
The original assembly was created by the Celera Assembler. This new
assembly contains the contigs created by that assembler, but
scaffolding was done by BAMBUS (M. Pop, D. Kosack, and S.L. Salzberg.
Genome Research 14(2004), 149-159) using the original clone-mate
information plus the newly-discovered BAC-end information. Thus
we call this a "CABA" assembly.
The new assembly is much improved, with an N50 scaffold size more than
double our original assembly. The statistics of the assembly, as well
as comparisons to the Baylor freeze assembly from August of 2003, can
be found in the ".statistics" file in this directory. The files are:
The first file contains all the scaffolds, while the second contains
the contigs in those scaffolds. The "assembly.statistics" file
contains a variety of quality control statistics, such as contig and
scaffold N50 sizes, total sizes, largest contigs, number of contigs
and scaffolds, etc. Many scaffolds contain just one contig, while
many others contain multiple contigs. For those multi-contig
scaffolds, we have inserted 100 N's between the contigs to indicate
the positions of the gaps. The "chaff.contigs" file contains all the
contigs that appear to be repetitive, low quality, or otherwise did
not fit into scaffolds. We include these for completeness, and for
those who want to mine every scrap of data from this assembly.
Finally, the "oo" file contains the order and orientation of all
contigs within each scaffold. Each scaffold ID is followed by a list
of its contigs with the letters "BE" or "EB" indicating forward and
reverse orientation respectively.
Assembly Procedure
First I removed all the short contigs and scaffolds that had ESTs on them from
the assembly dp-j20 in the dir Newj-20. I got these by searching 34,000+ ESTs
against all contigs, finding those that hit, and then removing contigs if:
1. the contig was shorter than 2kb
2. the scaffold containing it was a single-contig scaffold.
Then I made the files in this directory containing about 6000 scaffolds rather
than the 10,000 in the original assembly, plus the associated "placed.fasta"
file containing all contigs.
Then I pulled all the (newly discovered - July 2004) BAC end sequences from
the Trace Archive, library/project ID OBBGP. These were in the original
trace entries, so I'm told, but were not labeled as BAC ends. So they had
different IDs and we cannot find them in the Celera Assembler records.
Instead, I ran NUCMER to find out where they hit, like this:
nucmer -b 20 -c 200 -l 50 -p ContigsVsBacEnds NoESTAsm.placed.fasta ../Inputs/bac_ends/allbacends.fasta
Due to memory constraints I had to run on an alpha.
Then we needed to find the unambiguous tilings of reads to contigs
Creating mates:
sed 's/>//' ../Inputs/bac_ends/headers-with-mates |
sed 's/name:[^ ]*//' | sed 's/mate_name:.*//' |
sed 's/mate:/gnl|ti|/' | sed 's/ / /' > BacEnds.mates
used vi to add line:
library bac 40000 200000
Generating linking data from prior CA assembly:
/local/asmg/scratch/mihai/WGA/bin/ca2scaff -i ../Newj-20/dp-j20.asm -o dp-j20
Running bambus:
goBambus -c ContigsVsBacEnds.ctg -m BacEnds.mates -x dp-j20.links -o try1
|