Interpreting Celera Assembler Output
Table of contents
- Surrogates
- Degenerate
contigs
- N50
Size
- .qc
file example
- Glossary
Surrogates
Repetitive or ambiguous sections of the genome are often identified
as surrogate contigs by Celera Assembler. These contigs are
incorporated into one or more other contigs, however only at the
consensus level. That means that the reads belonging to a surrogate
contigs will not appear in the contigs they belong to. This is because
the assembler has determined that the contig may represent
different copies of a repeat (based on high arrival rate, i.e. large
depth of coverage). It will place in the assembly as many copies of
this contig as it can based on mate links, but there are no guarantees
that all copies of the repeat will be found. It is also
possible that the surrogate is not a repeat at all, but just a contig
with unusually deep read coverage. Surrogates are uploaded into the
database as contigs, however the assembly..comment field
contains the string "CA_FREE". They should be ignored when running
Bamboo and should only be used when trying to close the corresponding
repeats.
Degenerate contigs
Some contigs fail to be incorporated into scaffolds by the Celera
Assembler, generally based on arrival rate statistics (low arrival
rate, or low coverage). These contigs frequently contain contaminants
or poor quality reads. They generally can be ignored when analyzing or
closing the genome. A very low-coverage assembly, however, may have a
significant portion of its output in degenerate scaffolds. Note that
singleton reads are a separate category, and not counted as contigs at
all. Degenerate contigs are uploaded into the database as contigs,
however the assembly..comment field contains the string
"CA_DEGEN"
N50 size
The N50 size of a set of entities (e.g., contigs or scaffolds)
represents the largest entity E such that at least half of the total
size of the entities is contained in entities larger than E. For
example if we have a collection of contigs with sizes 7, 4, 3, 2, 2, 1,
and 1 kb, the N50 length is 4 because we can cover 10 kb with contigs
bigger than 4kb.
.qc file example
The following sample output produced by castats.pl. Note that
by default the estimated genome size is the same as the number of bases
in scaffolds. You can specify a different genome size (used to
calculate N50 sizes) by passing option -g to castats.pl.
EstimatedGenomeSize 18459625
Scaffolds --------- TotalScaffolds 6408 TotalContigsInScaffolds 6870 MeanContigsPerScaffold 1.07 MinContigsPerScaffold 1 MaxContigsPerScaffold 4
TotalBasesInScaffolds 18459625 MeanBasesInScaffolds 2880 MinBasesInScaffolds 1019 MaxBasesInScaffolds 21154 N50ScaffoldBases 3386
TotalSpanOfScaffolds 18534790 MeanSpanOfScaffolds 2892 MinScaffoldSpan 1019 MaxScaffoldSpan 21775 IntraScaffoldGaps 462 MeanSequenceGapSize 162
Top 5 Scaffolds: #contigs size span avgContig avgGap ----------------------------------------------------- 0 3 21154 21775 7051.33 310.50 1 3 20056 20105 6685.33 24.50 2 1 19870 19870 19870.00 0.00 3 4 18474 19378 4618.50 301.33 4 1 18045 18045 18045.00 0.00
Contigs ------- TotalContigsInScaffolds 6870 TotalBasesInScaffolds 18459625 MeanContigSize 2686.99 MinContigSize 951 MaxContigSize 19870 N50ContigBases 3073
Big Contigs (>10000) ----------------- TotalBigContigs 60 BigContigLength 738500 MeanBigContigSize 12308.33 MinBigContig 10025 MaxBigContig 19870 BigContigsPercentBases 4.00%
Small Contigs (<10000) ----------------- TotalSmallContigs 6810 SmallContigLength 17721125 MeanSmallContigSize 2602.22 MinSmallContig 951 MaxSmallContig 9992 SmallContigsPercentBases 96.00%
Degenerate Contigs ------------------ TotalDegenContigs 10214 DegenContigLength 12400379 MeanDegenContigSize 1214.06 MinDegenContig 92 MaxDegenContig 10907 DegenPercentBases 67.18%
Top 5 Contigs: reads bases -------------------------- 0 22 19870 1 79 18045 2 45 17131 3 57 16413 4 64 16221
Surrogates ---------- NumSurrogates 541 SurrogateSize 842205 MinSurrogateSize 505 MaxSurrogateSize 8278 MeanSurrogateSize 1556.76 SDSurrogateSize 840.94
Mates ----- ReadsWithNoMate 3573 ReadsWithBadMate 18 ReadsWithGoodMate 12604 ReadsWithUnusedMate 123326 TotalScaffoldLinks 1 MeanScaffoldLinkWeight 2.00
Reads ----- TotalReads 139521 ReadsInContigs 39816 BigContigReads 2359 SmallContigReads 37457 DegenContigReads 30581 ReadsInSurrogates 2535 SingletonReads 66589
Coverage -------- ContigsOnly 1.96 ContigsAndDegens 3.35 AllReads 6.34
Glossary
EstimatedGenomeSize - the
estimated genome size - used in computing N50 values. Usually equal to
the number of bases in scaffolds, unless otherwise specified through
option -g on the command line
of castats.pl.
TotalScaffolds - the total number
of
scaffolds in the assembly.
TotalContigsInScaffolds -
the total number of contigs that made it into scaffolds. Contigs that
do not belong to scaffolds are called degenerate
and generally can be ignored.
MeanContigsPerScaffold -
the average number of contigs in a scaffold.
MinContigsPerScaffold - the
minimum number of contigs in a scaffold.
MaxContigsPerScaffold - the
maximum number of contigs in a scaffold.
TotalBasesInScaffolds - the
sum of all contig sizes for the contigs in scaffolds.
MeanBasesInScaffolds - the
average scaffold size. The size of a scaffold is the sum of all contigs
contained in that scaffold.
MinBasesInScaffolds - the
minimum size of a scaffold.
MaxBasesInScaffolds - the
maximum size of a scaffold.
N50ScaffoldBases - the N50
scaffold size.
TotalSpanOfScaffolds - the
sum of all contig sizes and gaps in all scaffolds.
MeanSpanOfScaffolds - the
average span of a scaffold.
MinScaffoldSpan - the minimum span
of a scaffold.
MaxScaffoldSpan - the maximum
span of a scaffold.
IntraScaffoldGaps - the number
of sequencing gaps in all scaffolds.
MeanSequenceGapSize- the
average size of a sequencing gap.
Top 5 Scaffolds - a listing of the 5
largest scaffolds. For each scaffold we report the number of contigs,
size, and span as well as the average contig and average sequencing gap
sizes.
MeanContigSize - the average
contig size.
MinContigSize - the minimum contig
size.
MaxContigSize - the maximum contig
size.
N50ContigBases - the N50
contig size.
TotalBigContigs - the number of
contigs bigger than 10kb.
BigContigLength - the sum of the
sizes of all contigs bigger than 10kb.
MeanBigContigSize - the average
size of the contigs over 10kb.
MinBigContig - the minimum contig
size in contigs over 10kb.
MaxBigContig - the maximum contig
size in contigs over 10kb. Should be the same as MaxContigSize.
BigContigsPercentBases -
the percentage of TotalBasesInScaffolds
contained in contigs over 10kb.
TotalSmallContigs - the number
of contigs smaller than 10kb.
SmallContigLength - the sum of
the sizes of all contigs smaller than 10kb.
MeanSmallContigSize - the
average size of contigs under 10kb.
MinSmallContig - the minimum contig
size in contigs under 10kb. Should be the same as MinContigSize.
MaxSmallContig - the maximum contig
size in contigs under 10kb.
SmallContigsPercentBases
- the percentage of TotalBasesInScaffolds
contained in contigs under 10kb.
TotalDegenContigs - the number
of degenerate
contigs (contigs that do not appear in scaffolds).
DegenContigLength - the sum of
the sizes of all degenerate
contigs.
MeanDegenContigSize - the
average size of degenerate
contigs.
MinDegenContig - the minimum size
of a degenerate
contig.
MaxDegenContig - the maximum size
of a degenerate
contig.
DegenPercentBases - the ratio
(as percentage points) between DegenContigLength
and TotalBasesInScaffolds.
Note that degenerate contigs are not counted as part of TotalBasesInScaffolds.
Top 5 Contigs - a listing of the 5
largest contigs. For each contig we report the number of reads and the
size.
NumSurrogates
- number of surrogates present in the
assembly
SurrogateSize
- cumulative size of all surrogates in the
assembly
MinSurrogateSize - size of
smallest surrogate.
MaxSurrogateSize - size of
largest surrogate.
MeanSurrogateSize - mean size of
a surrogate.
SDSurrogateSize - standard
deviation of surrogate
sizes assuming a normal distribution.
ReadsWithNoMate - number of reads
(out of TotalReads)
that did not have a mate
ReadsWithBadMate - number of
reads (out of TotalReads)
that had a bad mate, i.e. a mate too far, too close, or with the
incorrect orientation.
ReadsWithGoodMate - number of
reads (out of TotalReads)
that had a good mate
ReadsWithUnusedMate - number
of reads (out of TotalReads)
whose mate was not used in the assembly
TotalScaffoldLinks - number of
links between scaffolds. These represent linking information currently
conflicting with the existing scaffolds. The lower this number the
better.
MeanScaffoldLinkWeight -
average weight (# of mate pairs) of links between scaffolds.
TotalReads - the total number of reads
included in the assembly.
ReadsInContigs - the number of
reads that belong to contigs.
BigContigReads - number of reads
that belong to contigs over 10kb in size.
SmallContigReads - number of
reads that belong to contigs under 10kb in size.
DegenContigReads - number of
reads in degenerate
contigs.
ReadsInSurrogates - number of
reads in surrogates:
potentially repetitive or ambiguously placed contigs.
SingletonReads - number of reads
that are neither in contigs, nor surrogates,
nor degenerate
contigs.
ContigsOnly - coverage (redundancy) of
all contigs in scaffolds: length of all the reads in contigs or surrogates
divided by the size of all scaffolds
ContigsAndDegens - coverage of
all contigs and degenerates:
length of all the reads in contigs, surrogates,
and degenerates
divided by the size of all scaffolds and degenerates.
AllReads - coverage you paid for:
length of all the reads divided by the size of the scaffolds.
|