Graphs to Diversity: extracting genomic variation from sequence
graphs
People
Faculty: Mihai Pop, Carl Kingsford
Students: Saket Navlakha,
Grecia Lapizco-Encinas, Sergey Koren
Also see the undergraduate summer
research program related to this project.
Funding
Our work is supported by the NSF through grant
IIS-0812111
to Mihai Pop and Carl Kingsford
Sequencing as a tool for uncovering genome variation
Most formulations of the genome assembly
problem make the implicit assumption that the data being sequenced
represents a single piece of DNA, therefore, the task of the assembly process
is to generate one single string that best explains the underlying read data.
In practice, however, this assumption is often incorrect. The genomes of
humans, as well as those of most complex organisms, contain two copies of each
chromosome, one from each parent. While overall similar, these chromosomes
contain many differences which need to be identified and characterized during
the assembly process. Initial efforts at mapping the common patterns of
variation in human populations (the HapMap
project) have already led to the discovery of several genes involved in
complex diseases such as diabetes.
Furthermore, advances in sequencing technologies have made possible the
sequencing of bacteria directly from an environment (referred to as
environmental sequencing or metagenomics), providing the basis for large-scale
studies of microbial diversity. Microbial diversity is important to understand
because, for example, bacteria use variation to adapt to changing environments
or as a means to escape the immune system. The study of genomic variation is
also particularly relevant in viruses, where, due to high mutation rates,
homogeneous populations are rarely encountered. Virologists often refer to the
concept of quasi-species to describe a heterogeneous mixture of rapidly
mutating viral particles. Efforts to combat viruses, such as HIV, can be
greatly helped by a better understanding of the structure of quasi-species in
viral infections.
Prior to the availability of inexpensive high-throughput sequencing
technologies, genomic variation was studied in a targeted fashion. The HapMap
project mentioned above, for example, studied a single type of variation:
single nucleotide polymorphisms (SNPs) a single base differences between
homologous chromosomes. Studies of variation in bacterial populations have
primarily relied on the manual inspection of the data. Initial software
developed in this field focus primarily on assisting manual analysis rather
than automated procedures for characterizing the bacterial diversity (see Strainer).
The focus of our research is to develop automated tools that can extract
genomic variation from genome assembly graphs. We are interested in both
graphs arising from the co-assembly of multiple bacterial genomes (e.g. arising
from metagenomic studies), as well as in the assembly of polymorphic eukaryotic
genomes. Furthermore, we are particularly interested in the impact of new
high-throughput experimental techniques such as sequencing or mapping
technologies, on our ability to characterize genomic variation within genomic
assemblies.
Research sub-projects
Assembly graph analysis We developed a series of modules to be
added to the AMOS assembly package for performing several analyses on assembly
graphs. These include: (i) a new approach for detecting genomic repeats that
marks as repeats nodes in the graph that are shared by many shortest paths
instead of using coverage-based statistics; (ii) modules for identifying and
reporting certain types of sub-graph motifs that are indicative of genome
variation; (iii) a series of graph simplifications procedures that can be
applied to assembly graphs in order to increase the size of contigs
generated.
Parametric complexity of genome assembly We explored the effect
of the interplay between read length and repeat size on the computational
difficulty of assembling a genome. We revealed that, despite recent NP-hardness
results, genome assembly can be performed in polynomial time for a wide range
of repeat sizes, though in many cases the problem is under-constrained. This
work provides insight into the apparent disconnect between two competing views
of assembly complexity: biologists consider assembly difficult when genomic
repeats are long (underconstrained problem solvable in polynomial time); while
computer scientists find the problem difficult (NP-hard) when repeats are short
(assembly is a constrained optimization problem)
How useful are mate-pairs? Empirically it is widely accepted that
mate-pair data provide useful information in the assembly process, yet this
observation has not yet been thoroughly evaluated in a controlled setting. We
are currently evaluating the "usefulness" of mate-pairs through simulations
performed on a collection of whole-genome deBruijn graphs constructed from
complete microbial genomes.
Understanding connectivity in graphs to speed up NP-hard graph problems
We explore the notion of edge connectivity to better understand its
theoretical properties and to use that increase understanding to create
improved heuristic algorithms for NP-hard problems that work better on
real-world graphs. In particular, we have characterized the "exactly
3-edge-connected graphs" by giving a synthesis for generating them all in the
spirit of Tutte's synthesis of 3-vertex connected graphs.
Clustering graphs and other data to find natural modules We
develop new graph clustering algorithms to find natural biological units in
large graphs. We also apply the clustering ideas to more accurately determine
OTU diversity in metagenomic populations and to label sequences with their OTU
classification.
Solving clique problems via particle swarms We explore the use
of particle swarm heuristics to solve NP-hard problems. We extend traditional
real-valued particle swarms to handle combinatorial problems, in particular the
minimum weighted clique problem.
Prokaryotic Genome Complexity We
describe an application of de Bruijn graphs - a theoretical framework
underlying several modern genome assembly programs - to analyze the global
repeat structure of prokaryotic genomes. We provide the 1st survey of the
repeat structure of a large number of genomes, and make publicly available the
resulting simplied de Bruijn graphs.
Publications
- S. Navlakha, J. White, N. Nagarajan, M. Pop, and C. Kingsford. Finding Biologically
Accurate Clusterings in Hierarchical Tree Decompositions Using the Variation of
Information. In Proceedings of RECOMB 2009. Lecture Notes in Computer
Science 5541, pages 400-417. doi:10.1007/978-3-642-02008-7.
- G. Lapizco-Encinas, C. Kingsford, and J. Reggia. A Cooperative Combinatorial
Particle Swarm Optimization for Side-chain Packing. In Proceedings of
IEEE Swarm Intelligence Symposium, 2009, pages 22-29.
doi:10.1109/SIS.2009.4937840.
- S. Navlakha, M. Schatz, and C. Kingsford. Revealing Biological Modules via
Graph Summarization. Presented at RECOMB-SB/RG/DREAM3 satellite
conference, 2008. Journal version in J. Comp. Biol. 16(2):253-264, 2009.
- N. Nagarajan, M. Pop.
Parametric complexity of sequence assembly: theory and applications
to next generation sequencing.
J. Comp. Biol. 16(7): 897-908, 2009.
- M. Pop. Genome assembly reborn: recent computational challenges.
Brief. Bioinf. 10(4):354-366. 2009.
Any opinions, findings, and conclusions or recommendations expressed
in this material are those of the author(s) and do not necessarily
reflect the views of the National Science Foundation
|