Genome Assembly with Short Reads Tutorial
There are two AMOS pipelines specially designed to assemble short reads:
- AMOScmp-shortReads
- AMOScmp-shortReads-alignmentTrimmed
1. AMOScmp-shortReads
Pipeline designed for comparative assembling of short reads.
Differences compared to AMOScmp:
- uses a smaller nucmer alignment cluster size
- uses a smaller make-consensus alignment wiggle value
AMOScmp-shortReads allows for additional input parameters compared to AMOScmp.
Defaults:
MINCLUSTER = 20
MINMATCH = 20
MINOVL = 5
MAXTRIM = 10
MAJORITY = 50
CONSERR = 0.06
ALIGNWIGGLE = 2
2. AMOScmp-shortReads-alignmentTrimmed
Pipeline designed for alignment based trimming and assembling of short reads.
Differences compared to AMOScmp:
- runs a reference based alignment trimming of the reads prior to the assembly
- uses a smaller nucmer alignment cluster size
- uses a smaller make-consensus alignment wiggle value
The trimming is performed as a set of several processing steps following the alignment of reads to the reference.
Steps:
- identify the zero coverage regions in the reference sequence (delta2cvg)
- extract the read clear ranges from the alignment file (delta2clr)
- extend the read clear ranges for the ones adjacent to zero coverage regions (delta2clr)
- update the bank with the new clear ranges (updateClrRanges)
- update the alignment file with the new read lengths and clear ranges (updateDeltaClr)
Three Perl scripts recently added to the AMOS package (release 2.0.8) are called by the AMOScmp-shortReads-alignmentTrimmed pipeline.
Their main purpose is to parse and update the nucmer alignment(delta) file.
Scripts:
- delta2cvg: computes the alignment coverage of the reference sequence(s)
- delta2clr: computes the minimum 5' and maximum 3' alignment coordinates of the aligned reads
- updateDeltaClr: shifts the alignment coordinates 5' positions to the left (for reads with the minimum 5'>0)
AMOScmp-shortReads-alignmentTrimmed also allows additional parameters than AMOScmp.
Defaults:
MINCLUSTER = 16
MINMATCH = 16
MINLEN = 24 # delta-filter -l 24
MINOVL = 5
MAXTRIM = 10
MAJORITY = 50
CONSERR = 0.06
ALIGNWIGGLE = 2
Input files:
Assuming that prefix is the name of the organism to assemble, two files are required:
- prefix.1con : reference sequence: a related organism sequence in FASTA format (complete or well assembled, usually downloaded from GenBank)
- prefix.afg : AMOS message file that contains read/fragment messages corresponding to each short read; it can be generated using the toAmos script
Examples:
$ toAmos -s prefix.seq -o prefix.afg # create an AMOS message file from short read FASTA sequences
$ toAmos -s prefix.seq -q prefix.qual -o prefix.afg # create an AMOS message file from short read FASTA sequences and qualities
$ AMOScmp-shortReads prefix # assemble reads (no trimming, default parameters)
$ AMOScmp-shortReads prefix -D MINCLUSTER=16 -D MINMATCH=16 # use a minimum alignment/cluster size of 16 bp
$ AMOScmp-shortReads prefix -D CONSERR=0.1 # use a consenus error of 0.1(10%)
$ AMOScmp-shortReads-alignmentTrimmed prefix # assemble reads (alignment based trimming, default parameters)
|