Interpreting Celera Assembler Output

Table of contents

  1. Surrogates
  2. Degenerate contigs
  3. N50 Size
  4. .qc file example
  5. Glossary

Surrogates

Repetitive or ambiguous sections of the genome are often identified as surrogate contigs by Celera Assembler. These contigs are incorporated into one or more other contigs, however only at the consensus level. That means that the reads belonging to a surrogate contigs will not appear in the contigs they belong to. This is because the assembler has determined that the contig may represent different copies of a repeat (based on high arrival rate, i.e. large depth of coverage). It will place in the assembly as many copies of this contig as it can based on mate links, but there are no guarantees that all copies of the repeat will be found. It is also possible that the surrogate is not a repeat at all, but just a contig with unusually deep read coverage. Surrogates are uploaded into the database as contigs, however the assembly..comment field contains the string "CA_FREE". They should be ignored when running Bamboo and should only be used when trying to close the corresponding repeats.

Degenerate contigs

Some contigs fail to be incorporated into scaffolds by the Celera Assembler, generally based on arrival rate statistics (low arrival rate, or low coverage). These contigs frequently contain contaminants or poor quality reads. They generally can be ignored when analyzing or closing the genome. A very low-coverage assembly, however, may have a significant portion of its output in degenerate scaffolds. Note that singleton reads are a separate category, and not counted as contigs at all. Degenerate contigs are uploaded into the database as contigs, however the assembly..comment field contains the string "CA_DEGEN"

N50 size

The N50 size of a set of entities (e.g., contigs or scaffolds) represents the largest entity E such that at least half of the total size of the entities is contained in entities larger than E. For example if we have a collection of contigs with sizes 7, 4, 3, 2, 2, 1, and 1 kb, the N50 length is 4 because we can cover 10 kb with contigs bigger than 4kb.

.qc file example

The following sample output produced by castats.pl.   Note that by default the estimated genome size is the same as the number of bases in scaffolds.  You can specify a different genome size (used to calculate N50 sizes) by passing option -g to castats.pl.

EstimatedGenomeSize         18459625


Scaffolds
---------
TotalScaffolds 6408
TotalContigsInScaffolds 6870
MeanContigsPerScaffold 1.07
MinContigsPerScaffold 1
MaxContigsPerScaffold 4

TotalBasesInScaffolds 18459625
MeanBasesInScaffolds 2880
MinBasesInScaffolds 1019
MaxBasesInScaffolds 21154
N50ScaffoldBases 3386

TotalSpanOfScaffolds 18534790
MeanSpanOfScaffolds 2892
MinScaffoldSpan 1019
MaxScaffoldSpan 21775
IntraScaffoldGaps 462
MeanSequenceGapSize 162

Top 5 Scaffolds
: #contigs size span avgContig avgGap
-----------------------------------------------------
0 3 21154 21775 7051.33 310.50
1 3 20056 20105 6685.33 24.50
2 1 19870 19870 19870.00 0.00
3 4 18474 19378 4618.50 301.33
4 1 18045 18045 18045.00 0.00

Contigs
-------
TotalContigsInScaffolds 6870
TotalBasesInScaffolds 18459625
MeanContigSize 2686.99
MinContigSize 951
MaxContigSize 19870
N50ContigBases 3073

Big Contigs (>10000)
-----------------
TotalBigContigs 60
BigContigLength 738500
MeanBigContigSize 12308.33
MinBigContig 10025
MaxBigContig 19870
BigContigsPercentBases 4.00%

Small Contigs (<10000)
-----------------
TotalSmallContigs 6810
SmallContigLength 17721125
MeanSmallContigSize 2602.22
MinSmallContig 951
MaxSmallContig 9992
SmallContigsPercentBases 96.00%

Degenerate Contigs
------------------
TotalDegenContigs 10214
DegenContigLength 12400379
MeanDegenContigSize 1214.06
MinDegenContig 92
MaxDegenContig 10907
DegenPercentBases 67.18%

Top 5 Contigs: reads bases
--------------------------
0 22 19870
1 79 18045
2 45 17131
3 57 16413
4 64 16221

Surrogates
----------
NumSurrogates 541
SurrogateSize 842205
MinSurrogateSize 505
MaxSurrogateSize 8278
MeanSurrogateSize 1556.76
SDSurrogateSize 840.94

Mates
-----
ReadsWithNoMate 3573
ReadsWithBadMate 18
ReadsWithGoodMate 12604
ReadsWithUnusedMate 123326
TotalScaffoldLinks 1
MeanScaffoldLinkWeight 2.00

Reads
-----
TotalReads 139521
ReadsInContigs 39816
BigContigReads 2359
SmallContigReads 37457
DegenContigReads 30581
ReadsInSurrogates 2535
SingletonReads 66589

Coverage
--------
ContigsOnly 1.96
ContigsAndDegens 3.35
AllReads 6.34


Glossary

EstimatedGenomeSize - the estimated genome size - used in computing N50 values. Usually equal to the number of bases in scaffolds, unless otherwise specified through option -g on the command line of castats.pl.
TotalScaffolds - the total number of scaffolds in the assembly.
TotalContigsInScaffolds - the total number of contigs that made it into scaffolds. Contigs that do not belong to scaffolds are called degenerate and generally can be ignored.
MeanContigsPerScaffold - the average number of contigs in a scaffold.
MinContigsPerScaffold - the minimum number of contigs in a scaffold.
MaxContigsPerScaffold - the maximum number of contigs in a scaffold.
TotalBasesInScaffolds - the sum of all contig sizes for the contigs in scaffolds.
MeanBasesInScaffolds - the average scaffold size. The size of a scaffold is the sum of all contigs contained in that scaffold.
MinBasesInScaffolds - the minimum size of a scaffold.
MaxBasesInScaffolds - the maximum size of a scaffold.
N50ScaffoldBases - the N50 scaffold size.
TotalSpanOfScaffolds - the sum of all contig sizes and gaps in all scaffolds.
MeanSpanOfScaffolds - the average span of a scaffold.
MinScaffoldSpan - the minimum span of a scaffold.
MaxScaffoldSpan - the maximum span of a scaffold.
IntraScaffoldGaps - the number of sequencing gaps in all scaffolds.
MeanSequenceGapSize- the average size of a sequencing gap.
Top 5 Scaffolds - a listing of the 5 largest scaffolds. For each scaffold we report the number of contigs, size, and span as well as the average contig and average sequencing gap sizes.
MeanContigSize - the average contig size.
MinContigSize - the minimum contig size.
MaxContigSize - the maximum contig size.
N50ContigBases - the N50 contig size.
TotalBigContigs - the number of contigs bigger than 10kb.
BigContigLength - the sum of the sizes of all contigs bigger than 10kb.
MeanBigContigSize - the average size of the contigs over 10kb.
MinBigContig - the minimum contig size in contigs over 10kb.
MaxBigContig - the maximum contig size in contigs over 10kb. Should be the same as MaxContigSize.
BigContigsPercentBases - the percentage of TotalBasesInScaffolds contained in contigs over 10kb.
TotalSmallContigs - the number of contigs smaller than 10kb.
SmallContigLength - the sum of the sizes of all contigs smaller than 10kb.
MeanSmallContigSize - the average size of contigs under 10kb.
MinSmallContig - the minimum contig size in contigs under 10kb. Should be the same as MinContigSize.
MaxSmallContig - the maximum contig size in contigs under 10kb.
SmallContigsPercentBases - the percentage of TotalBasesInScaffolds contained in contigs under 10kb.
TotalDegenContigs - the number of degenerate contigs (contigs that do not appear in scaffolds).
DegenContigLength - the sum of the sizes of all degenerate contigs.
MeanDegenContigSize - the average size of degenerate contigs.
MinDegenContig - the minimum size of a degenerate contig.
MaxDegenContig - the maximum size of a degenerate contig.
DegenPercentBases - the ratio (as percentage points) between DegenContigLength and TotalBasesInScaffolds. Note that degenerate contigs are not counted as part of TotalBasesInScaffolds.
Top 5 Contigs - a listing of the 5 largest contigs. For each contig we report the number of reads and the size.
NumSurrogates -  number of surrogates present in the assembly
SurrogateSize - cumulative size of all surrogates in the assembly
MinSurrogateSize - size of smallest surrogate.
MaxSurrogateSize - size of largest surrogate.
MeanSurrogateSize - mean size of a surrogate.
SDSurrogateSize - standard deviation of surrogate sizes assuming a normal distribution.
ReadsWithNoMate - number of reads (out of TotalReads) that did not have a mate
ReadsWithBadMate - number of reads (out of TotalReads) that had a bad mate, i.e. a mate too far, too close, or with the incorrect orientation.
ReadsWithGoodMate - number of reads (out of TotalReads) that had a good mate
ReadsWithUnusedMate - number of reads (out of TotalReads) whose mate was not used in the assembly
TotalScaffoldLinks - number of links between scaffolds. These represent linking information currently conflicting with the existing scaffolds. The lower this number the better.
MeanScaffoldLinkWeight - average weight (# of mate pairs) of links between scaffolds.
TotalReads - the total number of reads included in the assembly.
ReadsInContigs - the number of reads that belong to contigs.
BigContigReads - number of reads that belong to contigs over 10kb in size.
SmallContigReads - number of reads that belong to contigs under 10kb in size.
DegenContigReads - number of reads in degenerate contigs.
ReadsInSurrogates - number of reads in surrogates: potentially repetitive or ambiguously placed contigs.
SingletonReads - number of reads that are neither in contigs, nor surrogates, nor degenerate contigs.
ContigsOnly - coverage (redundancy) of all contigs in scaffolds: length of all the reads in contigs or surrogates divided by the size of all scaffolds
ContigsAndDegens - coverage of all contigs and degenerates: length of all the reads in contigs, surrogates, and degenerates divided by the size of all scaffolds and degenerates.
AllReads - coverage you paid for: length of all the reads divided by the size of the scaffolds.