Running Celera Assembler
Celera Assembler was originally developed at Celera Genomics and is now an Open
Source project at
SourceForge. Celera Assembler was developed specifically to
enable the assembly of the human genome through the Whole Genome
Shotgun (WGS) method. Until scientists at Celera completed the
WGS sequencing and assembly of the fruit fly genome in 2000, most large
eukaryotic sequencing projects used the hierarchical BAC-by-BAC method
(see assembly primer). Gene Myers and his
team later completed the assembly of the human genome through the WGS
method, and reported the achievement in the journal Science in 2001.
For more information on Celera Assembler you can refer to the following
publications:
- Venter
JC, et al. The sequence of the
human genome. Science.
2001 Feb 16;291(5507):1304-51
- Myers
EW, et al. A
whole-genome assembly of Drosophila. Science.
2000 Mar 24;287(5461):2196-204.
Table of Contents
Prerequisites
Install Celera Assembler - obtainable from SourceForge as project wgs-assembler.
Install AMOS - obtainable from SourceForge as project AMOS.
Install Lucy
- obtainable from TIGR: ftp://ftp.tigr.org/pub/software/Lucy
Install the Perl scripts contained in the CAutils.tar.gz
file somewhere
in the executable path. Several of these scripts may need to be
modified as described throughout this document. For more
information contact Mihai Pop (mpop at umiacs umd edu)
The Perl module Statistics::Descriptive must be
installed for some of the scripts to correctly function.
Celera Assembler pipeline
Celera Assembler is modular system composed of multiple programs that
interact through well defined interfaces. The sequential order of
programs and their parameters can be tuned for the specific needs of
each assembly project. At the CBCB,
we define these pipelines in the format used by the runAmos
utility from the AMOS
package. This flexible format is easily readable and makes it
easy to quickly modify the pipeline for specific assembly needs.
The remainder of this document assumes you have already installed the
AMOS package.
Inputs
Conversion tool: tarchive2ca
Most communications between Celera Assembler programs is performed
through a specialized message-interchange format that contains
information about the main assembly objects being communicated.
Detailed documentation on this format is available within the software
package distributed through SourceForge (if you can run Celera
Assembler you also have the relevant documentation).
Additionally, you can easily parse the message format using the AMOS::AmosLib
Perl module. The inputs
to the assembler are provided in a "fragment" file, containing
information about the sequence and quality of the shotgun reads, as
well as library sizes and read pairing information. The AMOS
package includes a program (tarchive2ca)
that allows the creation of
Celera Assembler input from Trace Archive formatted
data.
tarchive2ca -o <out_prefix> [-c <clear_ranges>] [-l <libinfo>] fasta1 ... fastan
<out_prefix> - prefix for the output files <clear_ranges> - file containing clear ranges for the reads. If this file is provided, any sequence that does not appear in it is excluded from the output. <libinfo> - tab-delimited file of lib_id, mean, stdev fasta1 ... fastan - list of files to be converted. The program assumes that for each program called <file>.seq there is a <file>.qual and a <file>.xml.
Trace Archive download
From the Trace Archive ftp site you will need to download the following
files (assuming your genome name is "genome"):
fasta.genome.[nnn] qual.genome.[nnn] xml.genome.[nnn]
where [nnn] represents the partition number - each Trace Archive genome
is partitioned into files containing 500,000 sequences or less.
Simple run
Assuming a project with just 3 partitions (fasta.genome.001,
fasta.genome.002, fasta.genome.003), the simplest way to generate
Celera Assembler inputs is to run:
tarchive2ca -o genome fasta.genome.001 fasta.genome.002 fasta.genome.003
resulting in the creation of a single file: genome.frg. Note that
the qual.* and xml.* files must be present in the same directory as the
fasta.* files in order for tarchive2ca
to find them. Additionally, tarchive2ca
can unzip these files on the fly, allowing you to keep them compressed
and thus reduce disk usage.
Specifying trimming coordinates
Note, however, that running Celera Assembler directly on untrimmed
Trace Archive data is likely to result in an unsatisfactory
assembly. Celera Assembler is very sensitive to incorrect data
thus it is recommended you trim (clip) the shotgun reads for both
vector and quality. Tarchive2ca
accepts an optional parameter (-c) that indicates a file containing
trimming coordinates. This file contains one line per read
(identified by the name listed in the TRACE_NAME tag in the Trace
Archive XML file). The read identifier is followed by
space-delimited trimming coordinates. Note that any read not
present in this file will be excluded from the output. This
provides you with a simple way of excluding contaminant reads or reads
of poor overall quality. Assuming the trimming coordinates are
provided in the file genome.clr, the invocation of tarchive2ca is:
tarchive2ca -o genome -c genome.clr fasta.genome.001 fasta.genome.002 fasta.genome.003
Trimming the input data
Several assembly programs (such as phrap and Arachne)
perform their own trimming. If you regularly use such an assembly
program it is likely you do not have a specialized trimming pipeline
and therefore cannot generate the genome.clr file. To assist in
this task we provide the program runlucy
based on the trimming program Lucy.
Before using runlucy you must
edit the top few lines as follows:
The locations of the lucy
binary and of gzip are listed
here:
my $lucy = "/usr/local/bin/lucy"; my $gzip = "/usr/local/bin/gzip";
The vector sequences used in trimming are in the directory listed in
the $vecdir variable.
The path of this directory is relative to the directory you run lucy from (usually where the Trace
Archive files are stored):
my $vecdir = "Vectors";
The arrays @vecs and @splices contain the names of the
files containing the complete vector sequence and splice sites for the
vectors used in the project. Each vector file must be accompanied
by a splice site file containing the sequence within the vector that is
adjacent to the splice sites used in the project. In case your
project uses an adapter it should be included in the splice file.
my @vecs = ("pUC18Sma"); my @splices = ("pUC18Sma.splice");
The vector file must contain a single FASTA-formatted sequence
representing the entire sequencing vector. The splice file
contains 4 FASTA records corresponding to approximately 200 bp flanking
either side of the splice site, presented in both the forward and
reverse-complemented orientation.
Finally, you may modify the specific parameters provided to lucy by changing variable $lucyparms. For a description
of these parameters please refer to the original publication describing
Lucy: Hui-Hsien
Chou and Michael H. Holmes.Bioinformatics. 2001.
17(12):1093-1104.
my $lucyparms = "-error 0.052 0.10 -window 50 0.06 -bracket 10 0.10";
Once runlucy has been edited
as described above you can run it as follows:
runlucy fasta.genome.001 fasta.genome.002 fasta.genome.003
The output will be placed in the file OUT.CLR
which can then be provided to tarchive2ca:
tarchive2ca -o genome -c OUT.CLR fasta.genome.001 fasta.genome.002 fasta.genome.003
"Blind" vector trimming
In many cases the exact vector sequence is not known. This is the
case for many project where this information was not provided to the
Trace Archive or the UniVec
database either due to a simple omission or because a proprietary
vector was used. Even when the vector sequence is provided, it is
often the case that the adapters used in cloning are not reported,
leading to insufficient sequence trimming. It is possible,
however, to quickly identify the location where the vector sequence
ends and the usable sequence begins by identifying 8-mers that
are more highly represented within the first 100 bp of the untrimmed
reads than within the remainder of the reads. These 8-mers likely
correspond to vector sequence and should be trimmed.
Currently we are working on a more robust mechanism for identifying and
removing the vector sequence, however the simple procedure described
below generally produces satisfactory results, although it tends to
over-trim the sequences:
1. Identify all 8-mers within the sequence and report the number of
times it occurs at the beginning of a read as well as the number of
times it occurs at the end of a read:
kmers.pl fasta.genome.001 fasta.genome.002 fasta.genome.003 > kmers.all
2. Generate a list of 8-mers that occur more frequently at the
beginning than the end of a read (8-mers containing Ns are ignored):
awk '{if ($2 > $3) print $1}' kmers.all | grep -v 'N' > kmers.beg.list
3. Recompute the clear ranges from the genome.frg file:
trimfrg.pl genome.frg kmers.beg.list > genome.trim.frg
Running Celera Assembler
Once a .frg file is available you can run Celera Assembler by using one
of the following commands: runCA.prok,
runCA.euk, runCA.meta; depending on whether you are doing a
prokaryotic, eukaryotic, or metagenomic project. Note that it
might be necessary to edit these files in order to adapt them to your
specific environment. Below is the user-modifiable header
included in all these files. The paths to the locations of Celera
Assembler, runAmos and the support scripts need to be set to values
relevant to your system. The remaining values have already been
set to values reasonable for the type of project you are trying to
assemble.
#!/fs/sz-user-supported/Linux-i686/bin/runAmos -C
# this is where Celera Assembler is installed CAPATH = /fs/sz-user-supported/Linux-i686/CA3.06
# this is where support scripts are installed BIN = /fs/sz-user-supported/common/bin
# this is how much memory your machine has (used by the overlapper) # allowable values: 256MB, 1GB, 2GB, 4GB, 8GB MEM = 2GB
# this is how many overlapper threads you can allow # set it to number of processors on your machine NPROC = 2
# do bubble smoothing (say yes if you expect haplotype differences)? # 1 = yes, 0 = no BUBBLE = 0
# unitigger error rate (0.002 for good quality reads, 0.015 default) ERATE = 0.002
# minimum A-stat considered unique DNA # default 1, -20 used for metagenomic data ASTAT = 1
# rocks aggresivity (number of rounds of rock recruitment) # 4 is usually used. 5 is more aggressive ROCKS = 4
Assuming the genome.trim.frg
file you created with tarchive2ca
corresponds to a bacterial genome, you can run Celera Assembler with
the simple command:
runCA.prok genome.trim
or
runCA.prok genome.trim.frg
A summary of the execution will be reported on the screen.
Also, a detailed account of all commands run will be written into genome.trim.runAmos.log or genome.trim.frg.runAmos.log
depending on the specific invocation.
Outputs
Besides the log file, the execution should result in the creation of a
multitude of files. Among these files, genome.trim.qc and genome.trim.asm are the most
important. The first, contains a collection of statistics
computed from the output of the assembler. The various values
reported are described here.
Clean up
Celera Assembler tends to produce copious amounts of output. If
you don't plan to iteratively change parameters and restart the
assembly (this might be necessary for complex assemblies) you can
remove most of the files by running runCA.clean
with the same exact parameter used to run runCA.prok. Note that the .runAmos.log file will be
overwritten, thus you may want to copy it over to a different location
before running this command:
runCA.clean genome.trim.frg
Extract contig information
The program parsecasm.pl can
be used to convert the contig information from the genome.trim.asm file
into more usable formats. By simply calling:
parsecasm.pl genome.trim.asm
it generates the following files (note that genome.trim.frg must be present in
the same directory as genome.trim.asm
for this program to work):
genome.trim.placed.fasta
- multi-fasta file of all placed contigs, i.e. contigs that Celera
Assembler placed into scaffolds
genome.trim.place.fasta.qual
- phred quality values for the placed contigs
genome.trim.placed.contig
- GDE .align -like file for the placed contigs. This file can be
provided as input to Bambus
genome.trim.degenerate.fasta
- multi-fasta file of all degenerate
contigs, i.e. contigs that Celera Assembler could not place into
scaffolds
genome.trim.degenerate.fasta.qual
- phred quality values for the degenerate contigs
genome.trim.degenerate.contig
- GDE .align -like file for the degenerate contigs. This file can
be provided as input to Bambus
genome.trim.surrogate.fasta
- multi-fasta file of all surrogate
contigs, i.e. multiple alignments of reads that belong to regions
identified by CeleraAssembler to be repeats.
genome.trim.surrogate.fasta.qual
- phred quality values for the surrogate contigs
genome.trim.surrogate.contig
- GDE .align -like file for the surrogate contigs. This file can
be provided as input to Bambus
genome.trim.libs
- file describing Celera Assembler's re-estimation of library
sizes. If significantly different from the original input to
Celera Assembler you may want to re-run the assembly with the
re-estimated values. The library sizes are presented as (mean, stdev).
Extract scaffold information
The programs ca2scaff (from
the AMOS package) and scaff2fasta.pl can be used to
extract information about the scaffolds produced by Celera
Assembler. Ca2scaff
converts Celera Assembler output to a format similar to the output of Bambus:
ca2scaff -i genome.trim.asm -o genome.trim
produces:
genome.trim.sum
- summary of the scaffolds: scaffold_id
size span
genome.trim.oo
- order/orientation file. Formatted like a FASTA file, one
scaffold/record, each line within the record represents a contig in the
forward orientation (BE) or reverse orientation (EB).
genome.trim.gaps
- similar to the .oo file
however it only lists the estimated gap size between adjacent
contigs. Single-contig scaffolds are represented as empty FASTA
records.
genome.trim.links
- Bambus XML
input file. Can be used to bootstrap Bambus with the
scaffolds produced by Celera Assembler.
Scaff2fasta.pl can be used to
convert Celera Assembler output to a multi-FASTA file:
scaff2fasta.pl genome.trim.asm
produces the file genome.trim.scaffolds.fasta,
where each scaffold is represented by a FASTA record consisting of a
concatenation of the sequences of the contigs in the proper
orientation, separated by 100 N characters.
Option -gaps to scaff2fasta.pl yields a spacing
proportional to the actual gap sizes (a minimum of 100 Ns is added in
each gap that is shorter than 100 bp). Option -linker allows the user to provide
a sequence that will be used as a spacer between adjacent contigs.
Extract singleton information
Singletons are reads that have not been used to build the
assembly. Contaminant sequences, insufficiently trimmed reads, as
well as reads belonging to ubiquitous repeats are generally found as
singletons. To identify the singleton reads within the output of
Celera Assembler you can use the ca2ingletons.pl
program:
ca2singletons.pl -i genome.trim.asm -f genome.trim.frg -o genome.trim.singletons.fasta
In this simplest invocation, the output is a multi-FASTA file of the
entire sequence (untrimmed) of the singleton reads. Using
option -clear, the output
will be just the good quality portion of the singleton reads.
Option -list will simply
return a list of singleton IDs.
Convert to .ACE format
You can convert the output of Celera Assembler to .ACE format (similar
to the output of phrap,
viewable in consed) using the
ca2ace program from the AMOS package. Note, that
while we have been able to view these files in consed, we do not
guarantee that the files are perfectly formatted for use by the
consed/autofinish pipeline. Also note that surrogate contigs are
represented in these files as "fake" reads. The reads
corresponding to surrogate contigs are not placed in the assembly by
Celera Assembler and therefore this measure was necessary to avoid the
presence of coverage gaps within the contigs.
The conversion can be done simply by calling:
ca2ace genome.trim.asm
leading to the creation of a file genome.trim.ace.
AMOS Assembly Viewer
Unlike Consed or other ACE based viewer, the AMOS Assembly Viewer works
natively
from AMOS datatypes, and can display all relevant assembly information
in a single interactive
tool- including contigs and inserts within scaffolds, the
multiple-alignment of reads within
contigs, and even the chromatograms of the individual reads. To view a
Celera Assembler
assembly with the AMOS Assembly Viewer, you first must convert to AMOS
format, but then you
can immediately view the assembly.
toAmos -f genome.trim.frg -a genome.trim.asm -o genome.trim.afg bank-transact -m genome.trim.afg -b genome.bnk -c bankViewer genome.bnk
More Information is available on the AMOS Assembly Viewer
website.
|