Contig representation in several assembly formats
This page contains information on how several assembly programs
represent contigs as well as the multiple alignment of the reads
aligned to these contigs. We concentrate on the issue of an exact
representation (with no significant processing needed for display) of
the multiple alignment data. Some assembly programs provide just
hints as to the placement of each read. Generating a
multiple-alignment from such a representation requires a complex
process involving aligning each read to the consensus in order to
determine the exact placement.
All contig representations contain the following information:
- The DNA sequence of the consensus together with information on
the placement of "gaps" or "padding" needed to insure the mapping
between a base in the consensus and all underlying reads. This
information is usually explicitly provided within assembler
output. Optionally, each base is assigned a quality value
indicating a phred-like probability that the particular base is
incorrect.
- The DNA sequence of each of the reads aligned to the consensus,
together with the placement of gaps within the reads. This
information is either explicit (by listing all the bases in the
sequence), or provided as a set of edits on the original sequence
provided to the assembler. The latter approach is more compact,
however requires the presence of the files containing the input to the
assembler in addition to the assembler output.
- Information about the exact position where each read is located
within the contig. This information is usually a combination between a
representation of the changes made by the assembler to the clear range
of each read and the position where the alignment of the read starts,
as a coordinate along the consensus.
A simple example of such information is provided below. The
lower-case bases represent the section of the read outside of the clear
range (region not used by the assembler):
asm_start asm_end | | 1 1 2 2 3 3 5 0 5 0 5 0 5 consensus: ACAGGACTAGAGTTAC-CGAGCCGTAGAAATGTAAGT read: atgaGTTACCC-AGCCGTagtg 1 1 2 5 0 5 0 | | read_start read_end
Coordinates asm_start and asm_end (11 and 24, respectively) are
calculated with respect to the coordinate system defined by the
consensus sequence and indicate the first and last base of the
alignment of the read to the consensus. Coordinates read_start
and read_end (5 and 18, respectively) represent the beginning and end
of the "clear range" of the read and are calculated as offsets from the
beginning of the read. The orientation (forward or reverse) of
the read in the contig is also provided, either by reversing one of the
two ranges, or by explicitly indicating the orientation. For
example, a reversed read would have either the (asm_start, asm_end)
range reported as (24, 11), or the (read_start, read_end) range
reported as (18, 5), or will be marked as "R"everse, or "C"complemented.
The coordinates are usually specified in one of two formats:
0-based:
The numbering starts at 0 and refers to the spaces between bases and
each base is refered to by the coordinate preceding it. Ranges
are exclusive: [left, right) - the base numbered "left" is included in
the range but the one numbered "right" is not.
Example:
0 1 2 3 4 5 6 7 A A C A G T A
The bolded string (CAGT) is represented by the range [2,6).
1-based:
The numbering starts at 1 and refers to the bases themselves.
Ranges are inclusive: [left, right] - both the base numbered "left" and
the base numbered "right" are included in the range.
Example:
1 2 3 4 5 6 7 A A C A G T A
The bolded string (CAGT) is represented by the range [3,6].
The specific representations chosen by different assembly packages
are described below. The fields provided in these files not
directly related to the representation of the contigs have been
omitted. In order to write a converter you will have to become
familiar with all other information provided by the different assembly
programs. The AMOS package already provides such converters:
From
|
To
|
Converter
|
Celera Assembler,
.ace, .contig
|
AMOS
|
toAmos
|
AMOS
|
.ace
|
amos2ace
|
Celera Assembler
|
.ace
|
ca2ace
|
AMOS
|
.contig
|
bank2contig (no
additional documentation)
|
Celera Assembler
|
.contig
|
ca2ta or parsecasm
(no additional documentation)
|
.ACE
|
.contig
|
ace2contig (no
additional documentation)
|
File format descriptions
- AMOS
- Celera Assembler
- .ACE (output by phrap and many other assemblers)
- .contig - TIGR Assembler's adaptation of the
GDE alignment format
Amos
Example:
{CTG iid:1 eid:1 seq: CCTCTCCTGTAGAGTTCAACCGA-GCCGGTAGAGTTTTATCA . qlt: DDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDD . {TLE src:1027 off:0 clr:618,0 gap: 250 612 . } }
Description:
For a more general description of the AMOS file format see the
specification documents at http://amos.sourceforge.net/docs/specs/.
- The contigs are identified both by an internal identifier (IID,
of int_32 type) and by and external string identifier (EID)
- The consensus (provided in the seq:
field) contains all the necessary gaps.
- The consensus quality values (provided in the qlt: field) are provided for every
single base in the consensus sequence, including the gaps. Each
quality value (any value between 0 and 60 is allowed) is represented as
a single character representing the character with the ASCII code of
'0' summed with the actual quality value (qv = 0 is represented by '0',
while qv = 40 is represented by 'D').
- The alignment of each read to the consensus (the TLE record) refers to the original
read provided as input to the assembler through the src: field. The alignment
information consists of:
- off: field - 0-based
offset of the beginning of the read within the consensus (off: field)
- clr: field - 0-based
range representing the aligned portion of the read, with respect
to the ungapped/unpadded read sequence (as provided in the input to the
assembler). If this range is increasing (left end is smaller than
right end) the read is aligned in the forward orientation, otherwise in
the reverse orientation.
- gap: field - 0-based
coordinates of all the gaps added to the read. The coordinates
are with respect to the ungapped clear range AFTER the read was reverse
complemented (if necessary). For example, in the case of read
acgACA-T--AC (the lower case letters are outside the clear range), the
gap positions are: 3, 4, 4 (think of the gaps between the upper case
letters)
Note: AMOS-formatted files can be easily parsed in Perl with the AMOS::AmosLib
module, and in C++ through the Message_t
class.
Celera Assembler
Note: the AMOS file format was derived from the Celera Assembler format
thus a lot of the following should already be familiar.
Example:
{AFG acc:(11170,50) clr:29,864 } {CCO acc:(1047167071404,31870) len:2160 cns: CCTCTCCTGTAGAGTTCAACCGA-GCCGGTAGAGTTTTATCA . qlt: DDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDD . npc:5 nou:2 {MPS typ:R mid:11170 src: . pos:2160,1307 dln:2 del: 250 612 . } }
- The contigs are identified by both an external identifier (first
number in the parantheses in the acc:
field) and an internal identifier (second number in the parantheses).
- The consensus (provided in the cns:
field) contains all the necessary gaps.
- The consensus quality values (provided in the qlt: field) are provided for every
single base in the consensus sequence, including the gaps. Each
quality value (any value between 0 and 60 is allowed) is
represented as a single character representing the character with the
ASCII code of '0' summed with the actual quality value (qv = 0 is
represented by '0', while qv = 40 is represented by 'D').
- The alignment of each read to the consensus (the MPS record) refers to the original
read provided as input to the assembler through the mid: field. The number of MPS records assigned to the contig
is provided in the npc:
field. The alignment information consists of:
- pos: field - 0-based
range within the consensus indicating the extent of the alignment of
the read. This range is
increasing for forward reads and decreasing for reversed reads.
- del: field - 0-based
coordinates of all the gaps added to the read. The coordinates
are with respect to the ungapped clear range AFTER
the read was reverse complemented (if necessary). For example, in
the
case of read acgACA-T--AC (the lower case letters are outside the clear
range), the gap positions are: 3, 4, 4 (think of the gaps between the
upper case letters). The field dln:
contains the number of gaps added to the read.
- clr: field in the AFG record - 0-based clear range as
recomputed by the assembler (the original read ID is listed in the acc: field). Currently this
information simply mimics the information provided in the input to the
assembler, however there's no reason to believe it cannot change and
should therefore be taken into account.
Note (1) : Celera Assembler also provides the alignment of unitigs
(uniquely assemblable contigs) to the final contigs. This information
is essentially the same as that containe in the MPS records except it is provided
in UPS records, the number of
which is listed in field nou:.
In addition, the lid: field
within the UPS records refers
to a UTG (unitig) record
earlier in the file, and the alignment coordinates as well as the
coordinates of the gaps are with respect to the un-gapped consensus of
the unitig. Due to this fact, transfering read alignments from
the UTG record to a
corresponding CCO record is
not immediately trivial and may require the addition of gaps to the
contig consensus.
Note (2) : Celera Assembler-formatted files can be easily parsed in
Perl with the
AMOS::AmosLib
module, and in C++ through the
Message_t
class.
.ACE format
The .ACE format is produced by phrap as well as by most other
assemblers (including Arachne, TIGR Assembler, CAP, etc.)
Example:
CO 1 30502 510 273 U CCTCTCC*GTAGAGTTCAACCGAAGCCGGTAGAGTTTTATCACCCCTCCC
BQ 20 20 20 20 20 20 20 20 20 20 20 20 20
AF TBEOG48.y1 C 1
BS 1 137 TBEOG48.y1
RD TBEOG48.y1 619 0 0 CCTCTCC*GTAGAGTTCAACCGAAGCCGGTAGAGTTTTATCACCCCTCCC
QA 1 619 1 619
- Contig identifiers (starting with
CO) list the IDs (1 in the example), the number of bases
(30502), number of reads (510), and number of "base segments" (273) as
well as whether the contigs is in the forward orientation
(Uncomplemented) or reversed (Complemented). In general, the
output of an assembler has all contigs listed as "U".
- The consensus sequence is padded, the gaps being represented as
*s instead of dashes, and follows immediately after the CO line.
- Consensus quality values are provided for the bases alone, the
gaps not being represented. These quality values, in phred-like format
follow immediately after the BQ
line.
- The AF lines (one per
aligned read) contain information of whether the read is complemented
(C) or not (U) followed by a 1-based offset in the consensus
sequence. Note that the offset refers to the beginning of the
entire read in the alignment, not just the clear range. Thus the
read acaggATTGA will have an offset of 1 even though the consensus
truly starts at position 6.
- The BS lines indicate
which read was used to calculate the consensus between the specified
coordinates. These lines can, in general, be ignored as they are
an artifact of the algorithms used to compute the consensus sequence.
- The sequence of each read is explicitly provided after each RD line. The sequence is
padded with *s and is already complemented if necessary.
- The QA line following
each read contains two 1-based ranges. The second range represents the
clear range of the read, with respect to the read sequence (padded and
potentially complemented) as provided in the RD record.
.contig format
The .contig format is a concatenation of the .align files produced by
TIGR Assembler. This format is a more concise representation of
the output of the assembler (reported in the verbose .asm file) and is
an extension of the GDE multiple alignment format.
Example:
##56487 19 1623 bases, 00000000 checksum.
TTAGACCCAGGAGAAG-CATAAAATTTTCAGAGCCATCTGATGTAGGAGGAAGTTATGAA
#000035230611N10F(0) [RC] 711 bases, 00000000 checksum. {720 10} <1 710>
TTAGACCCAGGAGAAG-CATAAAATTTTCAGAGCCATCTGATGTAGGAGGAAGTTATGAA
- Each contig is preceded by a header starting with ##, followed by
the contig identifier, number of reads aligned to it, and the number of
bases in the padded consensus. If generated by TIGR Assembler,
these records also contain an 8-digit checksum, however most converters
generate a blank checksum (it's not used by any code anyway).
- The contig sequence, listed after the "##" header, is padded with
the gap character.
- Each read aligned to the consensus is preceded by a header
starting with a single "#" character. Provided in parantheses, is
the 0-based offset of the read in the consensus. Within the
square brackets the string "RC" indicates the read was reverse
complemented, a fact also indicated in the representation of the clear
range within the braces ({720 10}). The clear range is
1-based with respect to the unpadded/ungapped
read sequence. Note the low number is 10, meaning the first 9 bases (1-9) have been trimmed from the beginning (5' end) of the read.
There may also be bases trimmed at the end of the read (3' end) beyond base 720, but this format does not record how many bases there are.
Next comes the coordinates of the read along the ungapped 1-based consensus are provided
within angle brackets (<1 710>). This header also contains a checksum
(largely ignored) and information about the number of bases following
it.
- After the read header, the aligned section of the read (the bases
within the clear range alone) is provided in padded form, and in the
correct orientation (complemented if necessary).
Note: the .contig format can be easily parsed in Perl using the AMOS::ParseFasta module as follows:
$pf = new AMOS::ParseFasta(\*STDIN,
"#", ""); For more information
run perldoc
AMOS::ParseFasta.
|