Graphs to Diversity: extracting genomic variation from sequence
Faculty: Mihai Pop, Carl
Current: Geet Duggal, Rob Patro, Lee
Nagarajan, Saket Navlakha,
Also see the undergraduate
summer research program initiated with support from this project. This research
program is currently in it's fourth year and has included both
undergraduate and High School students.
Our work is supported by the NSF
through grant IIS-0812111
to Mihai Pop and Carl Kingsford
Sequencing as a tool for uncovering genome
Most formulations of the genome
assembly problem make the implicit assumption that the data being
sequenced represents a single piece of DNA, therefore, the task of
the assembly process is to generate one single string that best
explains the underlying read data. In practice, however, this
assumption is often incorrect. The genomes of humans, as well as
those of most complex organisms, contain two copies of each
chromosome, one from each parent. While overall similar, these
chromosomes contain many differences which need to be identified and
characterized during the assembly process. Initial efforts at mapping
the common patterns of variation in human populations (the
HapMap project) have already led to the discovery of several
genes involved in complex diseases such as diabetes.
Furthermore, advances in sequencing technologies have made
possible the sequencing of bacteria directly from an environment
(referred to as environmental sequencing or metagenomics), providing
the basis for large-scale studies of microbial diversity. Microbial
diversity is important to understand because, for example, bacteria
use variation to adapt to changing environments or as a means to
escape the immune system. The study of genomic variation is also
particularly relevant in viruses, where, due to high mutation rates,
homogeneous populations are rarely encountered. Virologists often
refer to the concept of quasi-species to describe a heterogeneous
mixture of rapidly mutating viral particles. Efforts to combat
viruses, such as HIV, can be greatly helped by a better understanding
of the structure of quasi-species in viral infections.
Prior to the availability of inexpensive high-throughput
sequencing technologies, genomic variation was studied in a targeted
fashion. The HapMap project mentioned above, for example, studied a
single type of variation: single nucleotide polymorphisms (SNPs) a
single base differences between homologous chromosomes. Studies of
variation in bacterial populations have primarily relied on the
manual inspection of the data. Initial software developed in this
field focus primarily on assisting manual analysis rather than
automated procedures for characterizing the bacterial diversity (see
The focus of our research is to develop automated tools that can
extract genomic variation from genome assembly graphs. We are
interested in both graphs arising from the co-assembly of multiple
bacterial genomes (e.g. arising from metagenomic studies), as well as
in the assembly of polymorphic eukaryotic genomes. Furthermore, we
are particularly interested in the impact of new high-throughput
experimental techniques such as sequencing or mapping technologies,
on our ability to characterize genomic variation within genomic
Assembly graph analysis We developed a series of modules
to be added to the AMOS assembly package for performing several
analyses on assembly graphs. These include: (i) a new approach for
detecting genomic repeats that marks as repeats nodes in the graph
that are shared by many shortest paths instead of using
coverage-based statistics; (ii) modules for identifying and reporting
certain types of sub-graph motifs that are indicative of genome
variation; (iii) a series of graph simplifications procedures that
can be applied to assembly graphs in order to increase the size of
Parametric complexity of genome assembly We explored the
effect of the interplay between read length and repeat size on the
computational difficulty of assembling a genome. We revealed that,
despite recent NP-hardness results, genome assembly can be performed
in polynomial time for a wide range of repeat sizes, though in many
cases the problem is under-constrained. This work provides insight
into the apparent disconnect between two competing views of assembly
complexity: biologists consider assembly difficult when genomic
repeats are long (underconstrained problem solvable in polynomial
time); while computer scientists find the problem difficult (NP-hard)
when repeats are short (assembly is a constrained optimization
Prokaryotic Genome Complexity
We describe an application of de Bruijn graphs - a theoretical
framework underlying several modern genome assembly programs - to
analyze the global repeat structure of prokaryotic genomes. We
provide the 1st survey of the repeat structure of a large number of
genomes, and make publicly available the resulting simplified de
What information is most useful to solve assembly problems?
We are exploring ways in which
information provided by high-throughput genomic experiments can be
used to guide the genome assembly process. As subquestions we have
asked: How useful are mate-pairs? Empirically it is
widely accepted that mate-pair data provide useful information in the
assembly process, yet this observation has not yet been thoroughly
evaluated in a controlled setting. We are currently evaluating the
"usefulness" of mate-pairs through simulations performed on
a collection of whole-genome deBruijn graphs constructed from
complete microbial genomes. How useful are optical maps?
We have started developing new
algorithms for utilizing mapping data within the assembly process.
Optical mapping data can provide clues on how to resolve repeats,
similar to the way mate-pair information is used. However, these data
contain more information than mate-pairs as they can provide links
across much longer distances (from hundreds of kbp up to whole
Understanding connectivity in graphs to speed up NP-hard graph
problems We explore the notion of edge connectivity to better
understand its theoretical properties and to use that increase
understanding to create improved heuristic algorithms for NP-hard
problems that work better on real-world graphs. In particular, we
have characterized the "exactly 3-edge-connected graphs" by
giving a synthesis for generating them all in the spirit of Tutte's
synthesis of 3-vertex connected graphs.
Clustering graphs and other data to find natural modules
We develop new graph clustering algorithms to find natural biological
units in large graphs. We also apply the clustering ideas to more
accurately determine OTU diversity in metagenomic populations and to
label sequences with their OTU classification.
Solving clique problems via particle swarms We explore the
use of particle swarm heuristics to solve NP-hard problems. We extend
traditional real-valued particle swarms to handle combinatorial
problems, in particular the minimum weighted clique problem.
Henry C Lin, Steve
Goldstein, Lee Mendelowitz, Shiguo Zhou, Joshua Wetzel, David
C Schwartz, Mihai Pop. AGORA: Assembly Guided by Optical
Restriction Alignment BMC Bioinformatics
R. Patro, G. Duggal,
E. Sefer, H. Wang, D. Filippova, and C. Kingsford. The
Missing Models: A Data-Driven Approach for Learning How
Networks Grow. To appear in KDD 2012.
Rob Patro, Emre Sefer, Justin
Malin, Guillaume Marçais, Saket Navlakha, and Carl Kingsford.
Parsimonious Reconstruction of Network Evolution.
in Proceedings of WABI 2011
Guillaume Marçais and Carl
fast, lock-free approach for efficient parallel counting of
occurrences of k-mers. Bioinformatics 27 (6): 764-770, 2011.
Niranjan Nagarajan and Carl
Robust, Computational Identification of Influenza Reassortments via
Graph Mining. Nucleic Acids Research 39(6):e34, 2011.
Emre Sefer and Carl Kingsford.
Metric Labeling and Semi-metric Embedding for Protein Annotation
Prediction. In Proceedings of the 15th Annual international
conference on Research in Computational Molecular Biology (RECOMB),
pp. 392-407, 2011
Joshua Wetzel, Carl Kingsford, and
Mihai Pop. Assessing
the benefits of using mate-pairs to resolve repeats in de novo
short-read prokaryotic assemblies. BMC Bioinformatics 2011,
Saket Navlakha and Carl Kingsford.
Archaeology: Uncovering Ancient Networks from Present-day
Interactions. PLoS Computational Biology 7(4):e1001119.
J. R. White, S. Navlakha, N.
Nagarajan, M. Ghodsi, C. Kingsford, and M. Pop. Alignment
and clustering of phylogenetic markers - implications for microbial
diversity studies. BMC Bioinformatics 2010, 11:152. [highly
S. Navlakha and C. Kingsford. The
Power of Protein Interaction Networks for Associating Genes with
Diseases. Bioinformatics, 2010; doi:
B. Liu and M. Pop. Identifying
Differentially Abundant Metabolic Pathways in Metagenomic Datasets.
in Proceedings of ISBRA 2010. Springer LNBI 6053. 2010.
N. Nagarajan, C. Cook, M. Di
Bonaventura, H. Ge, A. Richards, K.A. Bishop-Lilly, R. DeSalle, T.D.
Read, M. Pop Finishing
genomes with limited resources: lessons from an ensemble of
microbial genomes. BMC Genomics, 11:242 2010 [highly
C. Kingsford, M. C. Schatz, and
Mihai Pop. Assembly
complexity of prokaryotic genomes using short reads. BMC
Bioinformatics11:21, 2010. (Top 10 most-viewed articles Jan/Feb
2010.) [highly accessed]
S. Navlakha, J. White, N.
Nagarajan, M. Pop, and C. Kingsford. Finding
Biologically Accurate Clusterings in Hierarchical Tree
Decompositions Using the Variation of Information. In
Proceedings of RECOMB 2009. Lecture Notes in Computer Science 5541,
pages 400-417. doi:10.1007/978-3-642-02008-7.
G. Lapizco-Encinas, C. Kingsford,
and J. Reggia. A
Cooperative Combinatorial Particle Swarm Optimization for Side-chain
Packing. In Proceedings of IEEE Swarm Intelligence
Symposium, 2009, pages 22-29. doi:10.1109/SIS.2009.4937840.
S. Navlakha, M. Schatz, and C.
Biological Modules via Graph Summarization. Presented at
RECOMB-SB/RG/DREAM3 satellite conference, 2008. Journal version in
J. Comp. Biol. 16(2):253-264, 2009.
N. Nagarajan, M. Pop. Parametric
complexity of sequence assembly: theory and applications to next
generation sequencing. J. Comp. Biol. 16(7): 897-908, 2009.
M. Pop. Genome
assembly reborn: recent computational challenges. Brief.
Bioinf. 10(4):354-366. 2009.
Any opinions, findings, and conclusions or recommendations
expressed in this material are those of the author(s) and do not
necessarily reflect the views of the National Science Foundation