Running Celera Assembler

Celera Assembler was originally developed at Celera Genomics and is now an Open Source project at SourceForge.  Celera Assembler was developed specifically to enable the assembly of the human genome through the Whole Genome Shotgun (WGS) method.  Until scientists at Celera completed the WGS sequencing and assembly of the fruit fly genome in 2000, most large eukaryotic sequencing projects used the hierarchical BAC-by-BAC method (see assembly primer).  Gene Myers and his team later completed the assembly of the human genome through the WGS method, and reported the achievement in the journal Science in 2001.

For more information on Celera Assembler you can refer to the following publications:
  • Venter JC, et al. The sequence of the human genome. Science. 2001 Feb 16;291(5507):1304-51
  • Myers EW, et al. A whole-genome assembly of Drosophila. Science. 2000 Mar 24;287(5461):2196-204.


Table of Contents


Prerequisites

Install Celera Assembler - obtainable from SourceForge as project wgs-assembler.
Install AMOS - obtainable from SourceForge as project AMOS.
Install Lucy - obtainable from TIGR: ftp://ftp.tigr.org/pub/software/Lucy
Install the Perl scripts contained in the CAutils.tar.gz file somewhere in the executable path.  Several of these scripts may need to be modified as described throughout this document.   For more information contact Mihai Pop (mpop at umiacs umd edu)
The Perl module Statistics::Descriptive must be installed for some of the scripts to correctly function.


Celera Assembler pipeline

Celera Assembler is modular system composed of multiple programs that interact through well defined interfaces.  The sequential order of programs and their parameters can be tuned for the specific needs of each assembly project.  At the CBCB, we define these pipelines in the format used by the runAmos utility from the AMOS package.  This flexible format is easily readable and makes it easy to quickly modify the pipeline for specific assembly needs.  The remainder of this document assumes you have already installed the AMOS package.



Inputs


Conversion tool: tarchive2ca

Most communications between Celera Assembler programs is performed through a specialized message-interchange format that contains information about the main assembly objects being communicated.  Detailed documentation on this format is available within the software package distributed through SourceForge (if you can run Celera Assembler you also have the relevant documentation).  Additionally, you can easily parse the message format using the AMOS::AmosLib Perl module.  The inputs to the assembler are provided in a "fragment" file, containing information about the sequence and quality of the shotgun reads, as well as library sizes and read pairing information.  The AMOS package includes a program (tarchive2ca) that allows the creation of Celera Assembler input from Trace Archive formatted data.

   tarchive2ca -o <out_prefix> [-c <clear_ranges>] [-l <libinfo>] fasta1 ... fastan

<out_prefix> - prefix for the output files
<clear_ranges> - file containing clear ranges for the reads. If this file
is provided, any sequence that does not appear in it is excluded
from the output.
<libinfo> - tab-delimited file of lib_id, mean, stdev
fasta1 ... fastan - list of files to be converted.
The program assumes that for each program called <file>.seq there
is a <file>.qual and a <file>.xml.

Trace Archive download

From the Trace Archive ftp site you will need to download the following files (assuming your genome name is "genome"):
fasta.genome.[nnn]
qual.genome.[nnn]
xml.genome.[nnn]
where [nnn] represents the partition number - each Trace Archive genome is partitioned into files containing 500,000 sequences or less.


Simple run

Assuming a project with just 3 partitions (fasta.genome.001, fasta.genome.002, fasta.genome.003), the simplest way to generate Celera Assembler inputs is to run:
tarchive2ca -o genome fasta.genome.001 fasta.genome.002 fasta.genome.003
resulting in the creation of a single file: genome.frg.  Note that the qual.* and xml.* files must be present in the same directory as the fasta.* files in order for tarchive2ca to find them.  Additionally, tarchive2ca can unzip these files on the fly, allowing you to keep them compressed and thus reduce disk usage.


Specifying trimming coordinates

Note, however, that running Celera Assembler directly on untrimmed Trace Archive data is likely to result in an unsatisfactory assembly.  Celera Assembler is very sensitive to incorrect data thus it is recommended you trim (clip) the shotgun reads for both vector and quality.  Tarchive2ca accepts an optional parameter (-c) that indicates a file containing trimming coordinates.  This file contains one line per read (identified by the name listed in the TRACE_NAME tag in the Trace Archive XML file).  The read identifier is followed by space-delimited trimming coordinates.  Note that any read not present in this file will be excluded from the output.  This provides you with a simple way of excluding contaminant reads or reads of poor overall quality.  Assuming the trimming coordinates are provided in the file genome.clr, the invocation of tarchive2ca is:
tarchive2ca -o genome -c genome.clr fasta.genome.001 fasta.genome.002 fasta.genome.003

Trimming the input data

Several assembly programs (such as phrap and Arachne) perform their own trimming.  If you regularly use such an assembly program it is likely you do not have a specialized trimming pipeline and therefore cannot generate the genome.clr file.  To assist in this task we provide the program runlucy based on the trimming program Lucy.  Before using runlucy you must edit the top few lines as follows:

The locations of the lucy binary and of gzip are listed here:
my $lucy = "/usr/local/bin/lucy";
my $gzip = "/usr/local/bin/gzip";
The vector sequences used in trimming are in the directory listed in the $vecdir variable.  The path of this directory is relative to the directory you run lucy from (usually where the Trace Archive files are stored):
my $vecdir = "Vectors";
The arrays @vecs and @splices contain the names of the files containing the complete vector sequence and splice sites for the vectors used in the project.  Each vector file must be accompanied by a splice site file containing the sequence within the vector that is adjacent to the splice sites used in the project.  In case your project uses an adapter it should be included in the splice file. 
my @vecs = ("pUC18Sma");
my @splices = ("pUC18Sma.splice");
The vector file must contain a single FASTA-formatted sequence representing the entire sequencing vector.  The splice file contains 4 FASTA records corresponding to approximately 200 bp flanking either side of the splice site, presented in both the forward and reverse-complemented orientation. 
Finally, you may modify the specific parameters provided to lucy by changing variable $lucyparms.  For a description of these parameters please refer to the original publication describing Lucy: Hui-Hsien Chou and Michael H. Holmes.Bioinformatics. 2001. 17(12):1093-1104.
my $lucyparms = "-error 0.052 0.10 -window 50 0.06 -bracket 10 0.10";
Once runlucy has been edited as described above you can run it as follows:
runlucy fasta.genome.001 fasta.genome.002 fasta.genome.003
The output will be placed in the file OUT.CLR which can then be provided to tarchive2ca:
tarchive2ca -o genome -c OUT.CLR fasta.genome.001 fasta.genome.002 fasta.genome.003

"Blind" vector trimming

In many cases the exact vector sequence is not known.  This is the case for many project where this information was not provided to the Trace Archive or the UniVec database either due to a simple omission or because a proprietary vector was used.  Even when the vector sequence is provided, it is often the case that the adapters used in cloning are not reported, leading to insufficient sequence trimming.  It is possible,  however, to quickly identify the location where the vector sequence ends and the usable sequence begins by identifying 8-mers that are more highly represented within the first 100 bp of the untrimmed reads than within the remainder of the reads.  These 8-mers likely correspond to vector sequence and should be trimmed. 

Currently we are working on a more robust mechanism for identifying and removing the vector sequence, however the simple procedure described below generally produces satisfactory results, although it tends to over-trim the sequences:

1. Identify all 8-mers within the sequence and report the number of times it occurs at the beginning of a read as well as the number of times it occurs at the end of a read:
kmers.pl fasta.genome.001 fasta.genome.002 fasta.genome.003 > kmers.all

2. Generate a list of 8-mers that occur more frequently at the beginning than the end of a read (8-mers containing Ns are ignored):
awk '{if ($2 > $3) print $1}' kmers.all | grep -v 'N' > kmers.beg.list

3. Recompute the clear ranges from the genome.frg file:
trimfrg.pl genome.frg kmers.beg.list > genome.trim.frg

Running Celera Assembler


Once a .frg file is available you can run Celera Assembler by using one of the following commands: runCA.prok, runCA.euk, runCA.meta; depending on whether you are doing a prokaryotic, eukaryotic, or metagenomic project.  Note that it might be necessary to edit these files in order to adapt them to your specific environment.   Below is the user-modifiable header included in all these files.  The paths to the locations of Celera Assembler, runAmos and the support scripts need to be set to values relevant to your system.  The remaining values have already been set to values reasonable for the type of project you are trying to assemble.
#!/fs/sz-user-supported/Linux-i686/bin/runAmos -C

# this is where Celera Assembler is installed
CAPATH = /fs/sz-user-supported/Linux-i686/CA3.06

# this is where support scripts are installed
BIN = /fs/sz-user-supported/common/bin

# this is how much memory your machine has (used by the overlapper)
# allowable values: 256MB, 1GB, 2GB, 4GB, 8GB
MEM = 2GB

# this is how many overlapper threads you can allow
# set it to number of processors on your machine
NPROC = 2

# do bubble smoothing (say yes if you expect haplotype differences)?
# 1 = yes, 0 = no
BUBBLE = 0

# unitigger error rate (0.002 for good quality reads, 0.015 default)
ERATE = 0.002

# minimum A-stat considered unique DNA
# default 1, -20 used for metagenomic data
ASTAT = 1

# rocks aggresivity (number of rounds of rock recruitment)
# 4 is usually used. 5 is more aggressive
ROCKS = 4

Assuming the genome.trim.frg file you created with tarchive2ca corresponds to a bacterial genome, you can run Celera Assembler with the simple command:
runCA.prok genome.trim
or
runCA.prok genome.trim.frg
A summary of the execution will be reported on the screen.   Also, a detailed account of all commands run will be written into genome.trim.runAmos.log or genome.trim.frg.runAmos.log depending on the specific invocation. 


Outputs


Besides the log file, the execution should result in the creation of a multitude of files.  Among these files, genome.trim.qc and genome.trim.asm are the most important.  The first, contains a collection of statistics computed from the output of the assembler.  The various values reported are described here


Clean up

Celera Assembler tends to produce copious amounts of output.  If you don't plan to iteratively change parameters and restart the assembly (this might be necessary for complex assemblies) you can remove most of the files by running runCA.clean with the same exact parameter used to run runCA.prok.  Note that the .runAmos.log file will be overwritten, thus you may want to copy it over to a different location before running this command:
runCA.clean genome.trim.frg

Extract contig information

The program parsecasm.pl can be used to convert the contig information from the genome.trim.asm file into more usable formats.  By simply calling:
parsecasm.pl genome.trim.asm
it generates the following files (note that genome.trim.frg must be present in the same directory as genome.trim.asm for this program to work):

genome.trim.placed.fasta            - multi-fasta file of all placed contigs, i.e. contigs that Celera Assembler placed into scaffolds
genome.trim.place.fasta.qual      - phred quality values for the placed contigs
genome.trim.placed.contig          - GDE .align -like file for the placed contigs.  This file can be provided as input to Bambus
genome.trim.degenerate.fasta     - multi-fasta file of all degenerate contigs, i.e. contigs that Celera Assembler could not place into scaffolds
genome.trim.degenerate.fasta.qual - phred quality values for the degenerate contigs
genome.trim.degenerate.contig   - GDE .align -like file for the degenerate contigs.  This file can be provided as input to Bambus
genome.trim.surrogate.fasta        - multi-fasta file of all surrogate contigs, i.e. multiple alignments of reads that belong to regions identified by CeleraAssembler to be repeats.
genome.trim.surrogate.fasta.qual - phred quality values for the surrogate contigs
genome.trim.surrogate.contig     - GDE .align -like file for the surrogate contigs.  This file can be provided as input to Bambus
genome.trim.libs                          - file describing Celera Assembler's re-estimation of library sizes.  If significantly different from the original input to Celera Assembler you may want to re-run the assembly with the re-estimated values.  The library sizes are presented as (mean, stdev).


Extract scaffold information

The programs ca2scaff (from the AMOS package) and scaff2fasta.pl can be used to extract information about the scaffolds produced by Celera Assembler.  Ca2scaff converts Celera Assembler output to a format similar to the output of Bambus:
ca2scaff -i genome.trim.asm -o genome.trim
produces:

genome.trim.sum       - summary of the scaffolds: scaffold_id size  span
genome.trim.oo          - order/orientation file.  Formatted like a FASTA file, one scaffold/record, each line within the record represents a contig in the forward orientation (BE) or reverse orientation (EB).
genome.trim.gaps       - similar to the .oo file however it only lists the estimated gap size between adjacent contigs.  Single-contig scaffolds are represented as empty FASTA records.
genome.trim.links      - Bambus XML input file.  Can be used to bootstrap Bambus with the scaffolds produced by Celera Assembler.

Scaff2fasta.pl can be used to convert Celera Assembler output to a multi-FASTA file:
scaff2fasta.pl genome.trim.asm
produces the file genome.trim.scaffolds.fasta, where each scaffold is represented by a FASTA record consisting of a concatenation of the sequences of the contigs in the proper orientation, separated by 100 N characters.

Option -gaps to scaff2fasta.pl yields a spacing proportional to the actual gap sizes (a minimum of 100 Ns is added in each gap that is shorter than 100 bp).  Option -linker allows the user to provide a sequence that will be used as a spacer between adjacent contigs.


Extract singleton information

Singletons are reads that have not been used to build the assembly.  Contaminant sequences, insufficiently trimmed reads, as well as reads belonging to ubiquitous repeats are generally found as singletons.  To identify the singleton reads within the output of Celera Assembler you can use the ca2ingletons.pl program:
ca2singletons.pl -i genome.trim.asm -f genome.trim.frg -o genome.trim.singletons.fasta
In this simplest invocation, the output is a multi-FASTA file of the entire sequence (untrimmed) of the singleton reads.   Using option -clear, the output will be just the good quality portion of the singleton reads.  Option -list will simply return a list of singleton IDs.


Convert to .ACE format

You can convert the output of Celera Assembler to .ACE format (similar to the output of phrap, viewable in consed) using the ca2ace program from the AMOS package.  Note, that while we have been able to view these files in consed, we do not guarantee that the files are perfectly formatted for use by the consed/autofinish pipeline.  Also note that surrogate contigs are represented in these files as "fake" reads.  The reads corresponding to surrogate contigs are not placed in the assembly by Celera Assembler and therefore this measure was necessary to avoid the presence of coverage gaps within the contigs. 

The conversion can be done simply by calling:
ca2ace genome.trim.asm 
leading to the creation of a file genome.trim.ace.


AMOS Assembly Viewer

Unlike Consed or other ACE based viewer, the AMOS Assembly Viewer works natively from AMOS datatypes, and can display all relevant assembly information in a single interactive tool- including contigs and inserts within scaffolds, the multiple-alignment of reads within contigs, and even the chromatograms of the individual reads. To view a Celera Assembler assembly with the AMOS Assembly Viewer, you first must convert to AMOS format, but then you can immediately view the assembly.
toAmos -f genome.trim.frg -a genome.trim.asm -o genome.trim.afg
bank-transact -m genome.trim.afg -b genome.bnk -c
bankViewer genome.bnk
More Information is available on the AMOS Assembly Viewer website.