Contig representation in several assembly formats

This page contains information on how several assembly programs represent contigs as well as the multiple alignment of the reads aligned to these contigs.  We concentrate on the issue of an exact representation (with no significant processing needed for display) of the multiple alignment data.  Some assembly programs provide just hints as to the placement of each read.  Generating a multiple-alignment from such a representation requires a complex process involving aligning each read to the consensus in order to determine the exact placement.

All contig representations contain the following information:

  1. The DNA sequence of the consensus together with information on the placement of "gaps" or "padding" needed to insure the mapping between a base in the consensus and all underlying reads.  This information is usually explicitly provided within assembler output.  Optionally, each base is assigned a quality value indicating a phred-like probability that the particular base is incorrect.
  2. The DNA sequence of each of the reads aligned to the consensus, together with the placement of gaps within the reads.  This information is either explicit (by listing all the bases in the sequence), or provided as a set of edits on the original sequence provided to the assembler.  The latter approach is more compact, however requires the presence of the files containing the input to the assembler in addition to the assembler output.
  3. Information about the exact position where each read is located within the contig. This information is usually a combination between a representation of the changes made by the assembler to the clear range of each read and the position where the alignment of the read starts, as a coordinate along the consensus.

A simple example of such information is provided below.  The lower-case bases represent the section of the read outside of the clear range (region not used by the assembler):

                   asm_start    asm_end
| |
1 1 2 2 3 3
5 0 5 0 5 0 5
consensus: ACAGGACTAGAGTTAC-CGAGCCGTAGAAATGTAAGT
read: atgaGTTACCC-AGCCGTagtg
1 1 2
5 0 5 0
| |
read_start read_end

Coordinates asm_start and asm_end (11 and 24, respectively) are calculated with respect to the coordinate system defined by the consensus sequence and indicate the first and last base of the alignment of the read to the consensus.  Coordinates read_start and read_end (5 and 18, respectively) represent the beginning and end of the "clear range" of the read and are calculated as offsets from the beginning of the read.  The orientation (forward or reverse) of the read in the contig is also provided, either by reversing one of the two ranges, or by explicitly indicating the orientation.  For example, a reversed read would have either the (asm_start, asm_end) range reported as (24, 11), or the (read_start, read_end) range reported as (18, 5), or will be marked as "R"everse, or "C"complemented.

The coordinates are usually specified in one of two formats:

0-based:

The numbering starts at 0 and refers to the spaces between bases and each base is refered to by the coordinate preceding it.  Ranges are exclusive: [left, right) - the base numbered "left" is included in the range but the one numbered "right" is not.

Example:

0 1 2 3 4 5 6 7
A A C A G T A
The bolded string (CAGT) is represented by the range [2,6).

1-based:

The numbering starts at 1 and refers to the bases themselves.  Ranges are inclusive: [left, right] - both the base numbered "left" and the base numbered "right" are included in the range.

Example:

 1 2 3 4 5 6 7
A A C A G T A
The bolded string (CAGT) is represented by the range [3,6].



The specific representations chosen by different assembly packages are described below.  The fields provided in these files not directly related to the representation of the contigs have been omitted.  In order to write a converter you will have to become familiar with all other information provided by the different assembly programs.  The AMOS package already provides such converters:

From
To
Converter
Celera Assembler, .ace, .contig
AMOS
toAmos
AMOS
.ace
amos2ace
Celera Assembler
.ace
ca2ace
AMOS
.contig
bank2contig (no additional documentation)
Celera Assembler
.contig
ca2ta or parsecasm (no additional documentation)
.ACE
.contig
ace2contig (no additional documentation)


File format descriptions

  1. AMOS
  2. Celera Assembler
  3. .ACE (output by phrap and many other assemblers)
  4. .contig - TIGR Assembler's adaptation of the GDE alignment format


Amos

Example:
{CTG
iid:1
eid:1
seq:
CCTCTCCTGTAGAGTTCAACCGA-GCCGGTAGAGTTTTATCA
.
qlt:
DDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDD
.
{TLE
src:1027
off:0
clr:618,0
gap:
250 612
.
}
}
Description:
For a more general description of the AMOS file format see the specification documents at http://amos.sourceforge.net/docs/specs/.
  • The contigs are identified both by an internal identifier (IID, of int_32 type) and by and external string identifier (EID)
  • The consensus (provided in the seq: field) contains all the necessary gaps.
  • The consensus quality values (provided in the qlt: field) are provided for every single base in the consensus sequence, including the gaps.  Each quality value (any value between 0 and 60 is allowed) is represented as a single character representing the character with the ASCII code of '0' summed with the actual quality value (qv = 0 is represented by '0', while qv = 40 is represented by 'D').
  • The alignment of each read to the consensus (the TLE record) refers to the original read provided as input to the assembler through the src: field.  The alignment information consists of:
    • off: field - 0-based offset of the beginning of the read within the consensus (off: field)
    • clr: field - 0-based range representing the aligned portion of the read,  with respect to the ungapped/unpadded read sequence (as provided in the input to the assembler).  If this range is increasing (left end is smaller than right end) the read is aligned in the forward orientation, otherwise in the reverse orientation.
    • gap: field - 0-based coordinates of all the gaps added to the read.  The coordinates are with respect to the ungapped clear range AFTER the read was reverse complemented (if necessary).  For example, in the case of read acgACA-T--AC (the lower case letters are outside the clear range), the gap positions are: 3, 4, 4 (think of the gaps between the upper case letters)
Note: AMOS-formatted files can be easily parsed in Perl with the AMOS::AmosLib module, and in C++ through the Message_t class.

Celera Assembler

Note: the AMOS file format was derived from the Celera Assembler format thus a lot of the following should already be familiar.

Example:
{AFG
acc:(11170,50)
clr:29,864
}
{CCO
acc:(1047167071404,31870)
len:2160
cns:
CCTCTCCTGTAGAGTTCAACCGA-GCCGGTAGAGTTTTATCA
.
qlt:
DDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDD
.
npc:5
nou:2
{MPS
typ:R
mid:11170
src:
.
pos:2160,1307
dln:2
del:
250 612
.
}
}
  • The contigs are identified by both an external identifier (first number in the parantheses in the acc: field) and an internal identifier (second number in the parantheses).
  • The consensus (provided in the cns: field) contains all the necessary gaps.
  • The consensus quality values (provided in the qlt: field) are provided for every single base in the consensus sequence, including the gaps.  Each quality value (any value between 0 and 60 is allowed) is represented as a single character representing the character with the ASCII code of '0' summed with the actual quality value (qv = 0 is represented by '0', while qv = 40 is represented by 'D').
  • The alignment of each read to the consensus (the MPS record) refers to the original read provided as input to the assembler through the mid: field.  The number of MPS records assigned to the contig is provided in the npc: field.  The alignment information consists of:
    • pos: field - 0-based range within the consensus indicating the extent of the alignment of the read.  This range is increasing for forward reads and decreasing for reversed reads.
    • del: field - 0-based coordinates of all the gaps added to the read.  The coordinates are with respect to the ungapped clear range AFTER the read was reverse complemented (if necessary).  For example, in the case of read acgACA-T--AC (the lower case letters are outside the clear range), the gap positions are: 3, 4, 4 (think of the gaps between the upper case letters).  The field dln: contains the number of gaps added to the read.
    • clr: field in the AFG record - 0-based clear range as recomputed by the assembler (the original read ID is listed in the acc: field). Currently this information simply mimics the information provided in the input to the assembler, however there's no reason to believe it cannot change and should therefore be taken into account.
Note (1) : Celera Assembler also provides the alignment of unitigs (uniquely assemblable contigs) to the final contigs. This information is essentially the same as that containe in the MPS records except it is provided in UPS records, the number of which is listed in field nou:.  In addition, the lid: field within the UPS records refers to a UTG (unitig) record earlier in the file, and the alignment coordinates as well as the coordinates of the gaps are with respect to the un-gapped consensus of the unitig.  Due to this fact, transfering read alignments from the UTG record to a corresponding CCO record is not immediately trivial and may require the addition of gaps to the contig consensus.

Note (2) : Celera Assembler-formatted files can be easily parsed in Perl with the AMOS::AmosLib module, and in C++ through the Message_t class.

.ACE format

The .ACE format is produced by phrap as well as by most other assemblers (including Arachne, TIGR Assembler, CAP, etc.)

Example:

CO 1 30502 510 273 U
CCTCTCC*GTAGAGTTCAACCGAAGCCGGTAGAGTTTTATCACCCCTCCC

BQ
20 20 20 20 20 20 20 20 20 20 20 20 20

AF TBEOG48.y1 C 1

BS 1 137 TBEOG48.y1

RD TBEOG48.y1 619 0 0
CCTCTCC*GTAGAGTTCAACCGAAGCCGGTAGAGTTTTATCACCCCTCCC

QA 1 619 1 619

  • Contig identifiers (starting with CO) list the IDs  (1 in the example), the number of bases (30502), number of reads (510), and number of "base segments" (273) as well as whether the contigs is in the forward orientation (Uncomplemented) or reversed (Complemented).  In general, the output of an assembler has all contigs listed as "U".
  • The consensus sequence is padded, the gaps being represented as *s instead of dashes, and follows immediately after the CO line.
  • Consensus quality values are provided for the bases alone, the gaps not being represented. These quality values, in phred-like format follow immediately after the BQ line. 
  • The AF lines (one per aligned read) contain information of whether the read is complemented (C) or not (U) followed by a 1-based offset in the consensus sequence.  Note that the offset refers to the beginning of the entire read in the alignment, not just the clear range.  Thus the read acaggATTGA will have an offset of 1 even though the consensus truly starts at position 6.
  • The BS lines indicate which read was used to calculate the consensus between the specified coordinates.  These lines can, in general, be ignored as they are an artifact of the algorithms used to compute the consensus sequence.
  • The sequence of each read is explicitly provided after each RD line.  The sequence is padded with *s and is already complemented if necessary.
  • The QA line following each read contains two 1-based ranges. The second range represents the clear range of the read, with respect to the read sequence (padded and potentially complemented) as provided in the RD record. 

.contig format

The .contig format is a concatenation of the .align files produced by TIGR Assembler.  This format is a more concise representation of the output of the assembler (reported in the verbose .asm file) and is an extension of the GDE multiple alignment format.

Example:
##56487 19 1623 bases, 00000000 checksum.
TTAGACCCAGGAGAAG-CATAAAATTTTCAGAGCCATCTGATGTAGGAGGAAGTTATGAA
#000035230611N10F(0) [RC] 711 bases, 00000000 checksum. {720 10} <1 710>
TTAGACCCAGGAGAAG-CATAAAATTTTCAGAGCCATCTGATGTAGGAGGAAGTTATGAA
  • Each contig is preceded by a header starting with ##, followed by the contig identifier, number of reads aligned to it, and the number of bases in the padded consensus.  If generated by TIGR Assembler, these records also contain an 8-digit checksum, however most converters generate a blank checksum (it's not used by any code anyway).
  • The contig sequence, listed after the "##" header, is padded with the gap character.
  • Each read aligned to the consensus is preceded by a header starting with a single "#" character.  Provided in parantheses, is the 0-based offset of the read in the consensus.   Within the square brackets the string "RC" indicates the read was reverse complemented, a fact also indicated in the representation of the clear range within the braces ({720 10}).  The clear range is 1-based with respect to the unpadded/ungapped read sequence.  Note the low number is 10, meaning the first 9 bases (1-9) have been trimmed from the beginning (5' end) of the read. There may also be bases trimmed at the end of the read (3' end) beyond base 720, but this format does not record how many bases there are. Next comes the coordinates of the read along the ungapped 1-based consensus are provided within angle brackets (<1 710>). This header also contains a checksum (largely ignored) and information about the number of bases following it.
  • After the read header, the aligned section of the read (the bases within the clear range alone) is provided in padded form, and in the correct orientation (complemented if necessary). 
Note: the .contig format can be easily parsed in Perl using the AMOS::ParseFasta module as follows: $pf = new AMOS::ParseFasta(\*STDIN, "#", "");    For more information run perldoc AMOS::ParseFasta.