Assembly and Analysis Software for Exploring the Human Microbiome


Faculty:  Mihai Pop, Steven Salzberg

Post-doctoral fellows:Niranjan Nagarajan, Arthur Brady

Students:Sergey Koren, Mohammad Ghodsi, Bo Liu, James White, Ted Gibbons


Our work is supported by the NIH through grant R01-HG-004885 to Mihai Pop.

Quick Links

Gene Finding

Metagenomic assembly

The main challenge in metagenomic assembly arises from the heterogeneous nature of metagenomic data. Most environments contain an uneven representation of the member species, and furthermore, the organisms in the environment frequently belong to clusters of closely related strains whose genomes are largely similar but differ due to mobile genetic elements and point mutations. These characteristics of the data make it virtually impossible to construct a single assembly of each organisms present in a sample, instead many organisms will be under-sampled and will be assembled in a highly fragmented form, while groups of closely related organisms will end up assembled together into a polymorphic structure that can be modeled as a computational graph.

We are currently exploring several approaches for analyzing and visualizing metagenomic assembly graphs, including procedures for graph simplification, for detection of genomic polymorphisms (work related to our research on the analysis of genomic variation from assembly information), and new approaches for repeat identification and resolution.

Metagenomic gene finding

Gene finding in metagenomic data-sets is complicated by the fragmented nature of metagenomic assemblies, and by the fact that many organisms are only poorly sampled, potentially leading to fragmentation and frame-shifts due to high error rates. We are working on extensions of the Glimmer gene finder to accommodate these characteristics of metagenomic data.

Metagenomic binning

We have developed a metagenomic binning program specifically targeted at short DNA fragments (such as reads). This program, called Phymm, uses the Interpolated Markov Model framework from Glimmer to accurately classify reads as short as 100bp. We are currently exploring whether binning reads prior to assembly can improve the quality of metagenomic analysis.

Research sub-projects



  • AMOS - open-source genome assembly framework. The assembly software developed in this project will be incorporated within AMOS. Packages currently available are :

    • ABBA - gene-boosted assembly - assembler that uses protein sequences to guide the assembly of gene fragments from metagenomic data.

    • AMOScmp - comparative (templated) assembler - used to build an assembly of metagenomic data against a reference genome. Extensions for handling short reads are described here and here.

    • Bambus 2 - scaffolder that can be used in the assembly of metagenomic data and can also identify certain types of intra-strain variation. A full metagenomic assembly pipeline will be available soon.

    • Minimus - conservative sequence assembler - mostly avoids co-assembling different organisms in metagenomic data at the cost of generating a much more fragmented assembly. Together with Bambus it will be the core of a full assembly pipeline.

  • Bowtie - fast sequence alignment tool for short-read (e.g. Illumina) sequences originally targeted at human resequencing projects. This tool can, however, be used to simultaneously map metagenomic reads to the full complement of reference genomes.

  • Crossbow - alignment and SNP calling tool that can run on a cloud computing platform running Hadoop (e.g. through the Amazon EC2 service). This tool can enable the analysis of large amounts of sequencing data that overpower local computing infrastructures.

  • Metastats - statistical analysis software for comparing metagenomic clinical samples.

  • Phymm - metagenomic binning software that maintains high accuracy for sequences as short as 100bp (e.g. single Illumina reads)

Also of interest: