Genome Assembly Validation
CBCB faculty: Steven Salzberg, James Yorke, Art Delcher, Mihai Pop
CBCB students and staff: Adam Phillippy, Mike Schatz
Despite continued advances in the development of assembly
algorithms,
few tools are available that evaluate the correctness of the assemblies
generated. With the exception of the few genomes that are
manually curated by experts during an expensive process called finishing, most genome data is
published as "draft" assemblies whose quality is uncertain. The
only quality measure used on a large scale is the assignment of a phred quality score (log-probability
of error) to each base in the output of the assembler. The
assembly quality is ascertained by the number of Q20 bases (bases at
phred score 20 or higher), i.e. the region of the genome where less
than 1 in 100 bases is incorrect. This localized measure cannot
be used to ascertain the quality of the long range connectivity of the
assembly nor the quality of the placement of reads along the
genome. The correctness of the long range connectivity of the
assembly is an essential prerequisite for any comparative genomic
studies, as mis-assemblies can lead to incorrect conclusions.
Our group has been developing assembly validation tools that make use
of all available information about the assembly. We are
developing both visual interfaces that enable the manual inspection of
assemblies and automated tools for detecting and correcting
mis-assemblies. We are exploring the use of varied sources of
information that provide clues regarding the correctness of
assemblies. Examples of such data are:
- Mate-pair information -
In most cases, shotgun reads are obtained by sequencing both ends of
DNA fragments whose approximate size is known. This information
constrains the placement of the reads within the assembly. In an
ideal assembly, all read pairs are placed in such a manner as to
satisfy the orientation and distance constraints imposed by the
sequencing library. Most types of mis-assemblies lead to
violations of these constraints. Our software tools identify such
constraint violations and attempt to characterize the specific type of
mis-assembly.
- Unused read information -
Not all reads provided as input to an assembler are used in the final
assembly. The unused reads, also called singletons, are often contaminants
or insufficiently trimmed reads from the genome. Mis-assemblies,
however,
also lead to the presence of unused reads, as they are inconsistent
with the chosen reconstruction of the genome. As an
example, the reads spanning the join point of two copies of a tandem
repeat are listed as singletons when the assembler incorrectly
collapses this repeat. By aligning the singletons to the contigs
produced by the assembler we can identify such misassemblies.
- Correlated polymorphisms
- Mis-assemblies are characterized by the incorrect placement of reads
within the assembly. Reads generated from different copies of a
same repeat are assembled together if the repeat copies are
sufficiently similar. Such situations can be identified by
examining differences between the reads that cover the mis-assembled
region. While differences between reads are expected due to
sequencing errors, such differences are usually uncorrelated, leading
to a very low probabilty that two overlapping reads have a same
sequencing error at the exact same location. In the case of
mis-assemblies, however, such errors are correlated, providing a
recognizable signature.
- Experimental mapping data -
For some genomes, scientists perform mapping experiments that identify
the locations along the chromosomes of a set of markers. By
comparing these experimental maps with the in silico placement of the markers
along contigs, we can identify assembly errors highlighted by
differences between these maps.
Projects
Assembly viewer
 |
The AMOS Assembly
Viewer is a tool for interactively investigating a genomic assembly
at all levels, including the raw signal of the chromatograms, the
multiple alignment of reads in contigs, and contigs and inserts
placed along scaffolds. The goal was to empower the user by bringing
together
and displaying all relevant assembly information in a single
interactive tool.
The main window of the viewer displays the mulitple
alignment of reads within contigs, and lets one view the bases of the
reads
and the consensus sequence. The chromatogram signal, and quality values
of the reads can optionally be displayed, as can the trimmed
unassembled portion of the read. One can quickly and easily navigate to
any position in any contig, or scan contigs for regions of disagreement
between the reads. Alternatively, the consensus sequence of a contig
can be searched by regular expression.
|
The Inserts view of the assembly shows how the contigs and
inserts are placed on the scaffold. It uses the library sizes to
categories the "happiness" of each insert, meaning it displays if the
paired reads are correctly oriented and at the expected distance apart.
The threshold distance
for a "happy" insert can be adjusted by setting the maximum allowed
number of standard deviations from the mean an insert can be. Details
on all objects displayed
in the Insert view can be found by clicking on any object. The mate for
any
unhappy read is highlighted be right clicking on the read.
It also plots both the read and insert coverage at each position along
the scaffold, highlighting positions of low coverage or low linking
coverage. The viewer can also be used to highlight arbitrary features
along the scaffold. This functionality is currently used to highlight
regions of the genome where the assembly has a high occurence of
unhappy insert coverage, or regions of high density correlated SNPs.
Both such events are strong evidence for misassembly.
Another display is the Contig Graph which shows all links
present
between contigs, including those that conflict with the linearized
scaffold.
Finally, there are selectable tablular displays for reads within
contigs,
contigs within scaffolds, and features within contigs for quick
navigation
and summary statistics.
Please visit the AMOS
Assembly Viewer webpage for more information.
Detection of correlated mate-pair violations
Most common mis-assemblies can be identified from the pattern of
mate-pair violations within an assembly. Collapsed repeats lead
to mate-pairs that appear compressed with respect to estimated library
sizes. Over-estimations of tandem repeat copies lead to
mate-pairs that appear stretched, while rearrangments and inversions
lead to mis-oriented mate-pairs. Such mate-pair violations,
however, also occur due to incorrect sizing of the DNA fragments or due
to incorrect mating information being provided to the assembler.
In order to prevent such errors from falsely indicating the presence of
mis-assemblies we restrict our analysis to clusters of mate-pairs that
all exhibit the same type of constraint violation. Before
analyzing the mate-pairs we also recompute the library sizes based on
the information contained in the assembly itself. The initial
size estimates provided by sequencing centers are frequently incorrect
due to limitations of the laboratory protocols.
Our group has developed a flexible tool that allows users to examine
the patterns of mate-pair violations present in an assembly. This
program, asmQC, can also be
used to recompute library sizes based on assembly information.
AsmQC is distributed as part of the AMOS project and can be used to
generate both human readable output and AMOS bank features that can be
displayed in our assembly viewer.
Compression/expansion statistics for mate-pair violations
The problem of identifying compressed or expanded mate-pairs within an
assembly can be elegantly formalized in statistical terms. Within
an assembly we are interested in identifying regions containing
mate-pairs whose size distribution diverges from the overall
distribution. This formulation provides us with the ability to
identify smaller mis-assemblies than possible by examining each
mate-pair in isolation. Jim Yorke and colleagues developed
software that calculates, for each position in the genome, the
deviation between the mean size of fragments spanning the position and
the overall mean size of all fragments (C-E
statistic). Regions with a high C-E statistic are
statistically likely to represent misassemblies.
Detection of correlated polymorphisms
In an ideal assembly, at each base within a contig, all the reads
covering that base should agree with each other. In the case of
misassemblies, however, the assembler will frequently align together
reads originating
from different repeats copies. In most cases repeat copies differ
between each other due to mutations acquired during evolution, leading
to differences (polymorphisms)
between overlapping reads. Such differences also occur due to
sequencing errors. Due to the random nature of sequencing errors,
however,
two reads seldom contain the same polymorphism. We can, thus,
detect misassemblies by identifying correlated polymorphisms within the
multiple alignment of reads. The program findTcovSNPs (distributed as part of
the AMOS project) can be
used to identify such polymorphisms. In addition to identifying
simple differences between reads at each position in the multiple
alignment, findTcovSNPs also examines the phred quality value of the
disagreeing bases thereby providing the users with a measure of
confidence in the presence of a polymorphism. Regions where
multiple polymorphisms occur close to each other in the assembly are
flagged such that they can be examined in the assembly viewer.
This method is not always applicable as in certain cases polymorphisms
are inherent to the data being assembled. As an example, most
eukaryotic genomes contain two copies of each chromosome, each received
from one of the two parents. Such assemblies naturally contain
polymorphisms in regions where the two chromosome copies differ.
|