Genome Assembly Validation

CBCB faculty: Steven Salzberg, James Yorke, Art Delcher, Mihai Pop
CBCB students and staff: Adam Phillippy, Mike Schatz

Despite continued advances in the development of assembly algorithms, few tools are available that evaluate the correctness of the assemblies generated. With the exception of the few genomes that are manually curated by experts during an expensive process called finishing, most genome data is published as "draft" assemblies whose quality is uncertain. The only quality measure used on a large scale is the assignment of a phred quality score (log-probability of error) to each base in the output of the assembler. The assembly quality is ascertained by the number of Q20 bases (bases at phred score 20 or higher), i.e. the region of the genome where less than 1 in 100 bases is incorrect. This localized measure cannot be used to ascertain the quality of the long range connectivity of the assembly nor the quality of the placement of reads along the genome. The correctness of the long range connectivity of the assembly is an essential prerequisite for any comparative genomic studies, as mis-assemblies can lead to incorrect conclusions.

Our group has been developing assembly validation tools that make use of all available information about the assembly. We are developing both visual interfaces that enable the manual inspection of assemblies and automated tools for detecting and correcting mis-assemblies. We are exploring the use of varied sources of information that provide clues regarding the correctness of assemblies. Examples of such data are:

  • Mate-pair information - In most cases, shotgun reads are obtained by sequencing both ends of DNA fragments whose approximate size is known. This information constrains the placement of the reads within the assembly. In an ideal assembly, all read pairs are placed in such a manner as to satisfy the orientation and distance constraints imposed by the sequencing library. Most types of mis-assemblies lead to violations of these constraints. Our software tools identify such constraint violations and attempt to characterize the specific type of mis-assembly.
  • Unused read information - Not all reads provided as input to an assembler are used in the final assembly. The unused reads, also called singletons, are often contaminants or insufficiently trimmed reads from the genome. Mis-assemblies, however, also lead to the presence of unused reads, as they are inconsistent with the chosen reconstruction of the genome. As an example, the reads spanning the join point of two copies of a tandem repeat are listed as singletons when the assembler incorrectly collapses this repeat. By aligning the singletons to the contigs produced by the assembler we can identify such misassemblies.
  • Correlated polymorphisms - Mis-assemblies are characterized by the incorrect placement of reads within the assembly. Reads generated from different copies of a same repeat are assembled together if the repeat copies are sufficiently similar. Such situations can be identified by examining differences between the reads that cover the mis-assembled region. While differences between reads are expected due to sequencing errors, such differences are usually uncorrelated, leading to a very low probabilty that two overlapping reads have a same sequencing error at the exact same location. In the case of mis-assemblies, however, such errors are correlated, providing a recognizable signature.
  • Experimental mapping data - For some genomes, scientists perform mapping experiments that identify the locations along the chromosomes of a set of markers. By comparing these experimental maps with the in silico placement of the markers along contigs, we can identify assembly errors highlighted by differences between these maps.

Projects

Assembly viewer

The AMOS Assembly Viewer is a tool for interactively investigating a genomic assembly at all levels, including the raw signal of the chromatograms, the multiple alignment of reads in contigs, and contigs and inserts placed along scaffolds. The goal was to empower the user by bringing together and displaying all relevant assembly information in a single interactive tool.

The main window of the viewer displays the mulitple alignment of reads within contigs, and lets one view the bases of the reads and the consensus sequence. The chromatogram signal, and quality values of the reads can optionally be displayed, as can the trimmed unassembled portion of the read. One can quickly and easily navigate to any position in any contig, or scan contigs for regions of disagreement between the reads. Alternatively, the consensus sequence of a contig can be searched by regular expression.

The Inserts view of the assembly shows how the contigs and inserts are placed on the scaffold. It uses the library sizes to categories the "happiness" of each insert, meaning it displays if the paired reads are correctly oriented and at the expected distance apart. The threshold distance for a "happy" insert can be adjusted by setting the maximum allowed number of standard deviations from the mean an insert can be. Details on all objects displayed in the Insert view can be found by clicking on any object. The mate for any unhappy read is highlighted be right clicking on the read.

It also plots both the read and insert coverage at each position along the scaffold, highlighting positions of low coverage or low linking coverage. The viewer can also be used to highlight arbitrary features along the scaffold. This functionality is currently used to highlight regions of the genome where the assembly has a high occurence of unhappy insert coverage, or regions of high density correlated SNPs. Both such events are strong evidence for misassembly.

Another display is the Contig Graph which shows all links present between contigs, including those that conflict with the linearized scaffold. Finally, there are selectable tablular displays for reads within contigs, contigs within scaffolds, and features within contigs for quick navigation and summary statistics.

Please visit the AMOS Assembly Viewer webpage for more information.


Detection of correlated mate-pair violations

Most common mis-assemblies can be identified from the pattern of mate-pair violations within an assembly. Collapsed repeats lead to mate-pairs that appear compressed with respect to estimated library sizes. Over-estimations of tandem repeat copies lead to mate-pairs that appear stretched, while rearrangments and inversions lead to mis-oriented mate-pairs. Such mate-pair violations, however, also occur due to incorrect sizing of the DNA fragments or due to incorrect mating information being provided to the assembler. In order to prevent such errors from falsely indicating the presence of mis-assemblies we restrict our analysis to clusters of mate-pairs that all exhibit the same type of constraint violation. Before analyzing the mate-pairs we also recompute the library sizes based on the information contained in the assembly itself. The initial size estimates provided by sequencing centers are frequently incorrect due to limitations of the laboratory protocols.

Our group has developed a flexible tool that allows users to examine the patterns of mate-pair violations present in an assembly. This program, asmQC, can also be used to recompute library sizes based on assembly information. AsmQC is distributed as part of the AMOS project and can be used to generate both human readable output and AMOS bank features that can be displayed in our assembly viewer.


Compression/expansion statistics for mate-pair violations

The problem of identifying compressed or expanded mate-pairs within an assembly can be elegantly formalized in statistical terms. Within an assembly we are interested in identifying regions containing mate-pairs whose size distribution diverges from the overall distribution. This formulation provides us with the ability to identify smaller mis-assemblies than possible by examining each mate-pair in isolation. Jim Yorke and colleagues developed software that calculates, for each position in the genome, the deviation between the mean size of fragments spanning the position and the overall mean size of all fragments (C-E statistic). Regions with a high C-E statistic are statistically likely to represent misassemblies.


Detection of correlated polymorphisms

In an ideal assembly, at each base within a contig, all the reads covering that base should agree with each other. In the case of misassemblies, however, the assembler will frequently align together reads originating from different repeats copies. In most cases repeat copies differ between each other due to mutations acquired during evolution, leading to differences (polymorphisms) between overlapping reads. Such differences also occur due to sequencing errors. Due to the random nature of sequencing errors, however, two reads seldom contain the same polymorphism. We can, thus, detect misassemblies by identifying correlated polymorphisms within the multiple alignment of reads. The program findTcovSNPs (distributed as part of the AMOS project) can be used to identify such polymorphisms. In addition to identifying simple differences between reads at each position in the multiple alignment, findTcovSNPs also examines the phred quality value of the disagreeing bases thereby providing the users with a measure of confidence in the presence of a polymorphism. Regions where multiple polymorphisms occur close to each other in the assembly are flagged such that they can be examined in the assembly viewer.

This method is not always applicable as in certain cases polymorphisms are inherent to the data being assembled. As an example, most eukaryotic genomes contain two copies of each chromosome, each received from one of the two parents. Such assemblies naturally contain polymorphisms in regions where the two chromosome copies differ.