Assembly and Analysis Software for Exploring the Human Microbiome
People
Faculty: Mihai
Pop, Steven Salzberg
Post-doctoral fellows:Niranjan Nagarajan, Arthur Brady
Students:Sergey Koren, Mohammad Ghodsi, Bo Liu, James
White, Ted Gibbons
Funding
Our work is supported by the NIH
through grant R01-HG-004885
to Mihai Pop.
Quick Links
Assembly Gene
Finding Binning Publications Software
Metagenomic assembly
The main challenge in metagenomic assembly arises from the
heterogeneous nature of metagenomic data. Most environments contain
an uneven representation of the member species, and furthermore, the
organisms in the environment frequently belong to clusters of closely
related strains whose genomes are largely similar but differ due to
mobile genetic elements and point mutations. These characteristics of
the data make it virtually impossible to construct a single assembly
of each organisms present in a sample, instead many organisms will be
under-sampled and will be assembled in a highly fragmented form,
while groups of closely related organisms will end up assembled
together into a polymorphic structure that can be modeled as a
computational graph.
We are currently exploring several approaches for analyzing and
visualizing metagenomic assembly graphs, including procedures for
graph simplification, for detection of genomic polymorphisms (work
related to our research on the analysis of genomic
variation from assembly information), and new approaches for
repeat identification and resolution.
Metagenomic gene
finding
Gene finding in metagenomic data-sets is complicated by the
fragmented nature of metagenomic assemblies, and by the fact that
many organisms are only poorly sampled, potentially leading to
fragmentation and frame-shifts due to high error rates. We are
working on extensions of the Glimmer gene finder to accommodate these
characteristics of metagenomic data.
Metagenomic binning
We have developed a metagenomic binning program specifically
targeted at short DNA fragments (such as reads). This program, called
Phymm, uses the
Interpolated Markov Model framework from Glimmer to accurately
classify reads as short as 100bp. We are currently exploring whether
binning reads prior to assembly can improve the quality of
metagenomic analysis.
Research
sub-projects
Publications
M. Pop. Genome
assembly reborn: recent computational challenges. Brief.
Bioinf. 10(4):354-366. 2009.
Ben Langmead, Cole Trapnell, Mihai Pop, Steven L. Salzberg
Ultrafast
and memory-efficient alignment of short DNA sequences to the human
genome Genome Biology, 10:R25 2009
Ben Langmead, Michael C. Schatz, Jimmy Lin, Mihai Pop, Steven
L. Salzberg Searching
for SNPs with cloud computing Genome Biology, 10:R134 2009
A Brady, SL Salzberg Phymm
and PhymmBL: metagenomic phylogenetic classification with
interpolated Markov models. Nat Methods, 6:673-6 2009 PDF
M. Ghodsi, M. Pop Inexact local alignment search over
suffix arrays Proceedings of the IEEE International
Conference on Bioinformatics and Biomedicine, 83-87 2009 PDF
James R. White, Niranjan Nagarajan, Mihai Pop Statistical
Methods for Detecting Differentially Abundant Features in Clinical
Metagenomic Samples PLoS Computational Biology, 5:e1000352
2009
Software
AMOS
- open-source genome assembly framework. The assembly software
developed in this project will be incorporated within AMOS. Packages
currently available are :
ABBA
- gene-boosted assembly - assembler that uses protein sequences to
guide the assembly of gene fragments from metagenomic data.
AMOScmp
- comparative (templated) assembler - used to build an assembly of
metagenomic data against a reference genome. Extensions for
handling short reads are described here
and here.
Bambus 2 - scaffolder that can
be used in the assembly of metagenomic data and can also identify
certain types of intra-strain variation. A full metagenomic
assembly pipeline will be available soon.
Minimus
- conservative sequence assembler - mostly avoids co-assembling
different organisms in metagenomic data at the cost of generating a
much more fragmented assembly. Together with Bambus it will be the
core of a full assembly pipeline.
Bowtie
- fast sequence alignment tool for short-read (e.g. Illumina)
sequences originally targeted at human resequencing projects. This
tool can, however, be used to simultaneously map metagenomic reads
to the full complement of reference genomes.
Crossbow
- alignment and SNP calling tool that can run on a cloud computing
platform running Hadoop (e.g. through the Amazon EC2 service). This
tool can enable the analysis of large amounts of sequencing data
that overpower local computing infrastructures.
Metastats
- statistical analysis software for comparing metagenomic clinical
samples.
Phymm - metagenomic binning
software that maintains high accuracy for sequences as short as
100bp (e.g. single Illumina reads)
Also of interest:
|