Graphs to Diversity: extracting genomic variation from sequence graphs

People

Faculty: Mihai Pop, Carl Kingsford

Students:
Current: Geet Duggal, Rob Patro, Lee Mendelowitz,
Alumni: Sergey Koren, Grecia Lapizco-Encinas,Henry Lin, Joshua Wetzel, Niranjan Nagarajan, Saket Navlakha,

Also see the undergraduate summer research program initiated with support from this project. This research program is currently in it's fourth year and has included both undergraduate and High School students.

Funding

Our work is supported by the NSF through grant IIS-0812111 to Mihai Pop and Carl Kingsford

Sequencing as a tool for uncovering genome variation

Most formulations of the genome assembly problem make the implicit assumption that the data being sequenced represents a single piece of DNA, therefore, the task of the assembly process is to generate one single string that best explains the underlying read data. In practice, however, this assumption is often incorrect. The genomes of humans, as well as those of most complex organisms, contain two copies of each chromosome, one from each parent. While overall similar, these chromosomes contain many differences which need to be identified and characterized during the assembly process. Initial efforts at mapping the common patterns of variation in human populations (the HapMap project) have already led to the discovery of several genes involved in complex diseases such as diabetes.

Furthermore, advances in sequencing technologies have made possible the sequencing of bacteria directly from an environment (referred to as environmental sequencing or metagenomics), providing the basis for large-scale studies of microbial diversity. Microbial diversity is important to understand because, for example, bacteria use variation to adapt to changing environments or as a means to escape the immune system. The study of genomic variation is also particularly relevant in viruses, where, due to high mutation rates, homogeneous populations are rarely encountered. Virologists often refer to the concept of quasi-species to describe a heterogeneous mixture of rapidly mutating viral particles. Efforts to combat viruses, such as HIV, can be greatly helped by a better understanding of the structure of quasi-species in viral infections.

Prior to the availability of inexpensive high-throughput sequencing technologies, genomic variation was studied in a targeted fashion. The HapMap project mentioned above, for example, studied a single type of variation: single nucleotide polymorphisms (SNPs) a single base differences between homologous chromosomes. Studies of variation in bacterial populations have primarily relied on the manual inspection of the data. Initial software developed in this field focus primarily on assisting manual analysis rather than automated procedures for characterizing the bacterial diversity (see Strainer).

The focus of our research is to develop automated tools that can extract genomic variation from genome assembly graphs. We are interested in both graphs arising from the co-assembly of multiple bacterial genomes (e.g. arising from metagenomic studies), as well as in the assembly of polymorphic eukaryotic genomes. Furthermore, we are particularly interested in the impact of new high-throughput experimental techniques such as sequencing or mapping technologies, on our ability to characterize genomic variation within genomic assemblies.

Research sub-projects

Assembly graph analysis We developed a series of modules to be added to the AMOS assembly package for performing several analyses on assembly graphs. These include: (i) a new approach for detecting genomic repeats that marks as repeats nodes in the graph that are shared by many shortest paths instead of using coverage-based statistics; (ii) modules for identifying and reporting certain types of sub-graph motifs that are indicative of genome variation; (iii) a series of graph simplifications procedures that can be applied to assembly graphs in order to increase the size of contigs generated.

Parametric complexity of genome assembly We explored the effect of the interplay between read length and repeat size on the computational difficulty of assembling a genome. We revealed that, despite recent NP-hardness results, genome assembly can be performed in polynomial time for a wide range of repeat sizes, though in many cases the problem is under-constrained. This work provides insight into the apparent disconnect between two competing views of assembly complexity: biologists consider assembly difficult when genomic repeats are long (underconstrained problem solvable in polynomial time); while computer scientists find the problem difficult (NP-hard) when repeats are short (assembly is a constrained optimization problem).

Prokaryotic Genome Complexity We describe an application of de Bruijn graphs - a theoretical framework underlying several modern genome assembly programs - to analyze the global repeat structure of prokaryotic genomes. We provide the 1st survey of the repeat structure of a large number of genomes, and make publicly available the resulting simplified de Bruijn graphs.

What information is most useful to solve assembly problems? We are exploring ways in which information provided by high-throughput genomic experiments can be used to guide the genome assembly process. As subquestions we have asked: How useful are mate-pairs? Empirically it is widely accepted that mate-pair data provide useful information in the assembly process, yet this observation has not yet been thoroughly evaluated in a controlled setting. We are currently evaluating the "usefulness" of mate-pairs through simulations performed on a collection of whole-genome deBruijn graphs constructed from complete microbial genomes. How useful are optical maps? We have started developing new algorithms for utilizing mapping data within the assembly process. Optical mapping data can provide clues on how to resolve repeats, similar to the way mate-pair information is used. However, these data contain more information than mate-pairs as they can provide links across much longer distances (from hundreds of kbp up to whole chromosomes).

Understanding connectivity in graphs to speed up NP-hard graph problems We explore the notion of edge connectivity to better understand its theoretical properties and to use that increase understanding to create improved heuristic algorithms for NP-hard problems that work better on real-world graphs. In particular, we have characterized the "exactly 3-edge-connected graphs" by giving a synthesis for generating them all in the spirit of Tutte's synthesis of 3-vertex connected graphs.

Clustering graphs and other data to find natural modules We develop new graph clustering algorithms to find natural biological units in large graphs. We also apply the clustering ideas to more accurately determine OTU diversity in metagenomic populations and to label sequences with their OTU classification.

Solving clique problems via particle swarms We explore the use of particle swarm heuristics to solve NP-hard problems. We extend traditional real-valued particle swarms to handle combinatorial problems, in particular the minimum weighted clique problem.



Publications



Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation