About Phymm and PhymmBL

Metagenomics sequencing projects collect samples of DNA from uncharacterized environments that may contain hundreds or even thousands of species. One of the main challenges in analyzing a metagenome is phylogenetic classification of raw sequence reads into groups representing the same or similar species. Such classification is a useful prerequisite for genome assembly and for analysis of the biological diversity present in a sample. The newest sequencing technologies have simultaneously made metagenomics easier, by making the sequencing process faster, and more difficult, by producing shorter read lengths than previous technologies. Methods for classifying sequences as short as 100 base pairs (bp) have until now been relatively inaccurate, requiring metagenomics projects to use older, long-read technologies. Phymm, a new classifier for metagenomics data which uses interpolated Markov models (IMMs) to taxonomically classify DNA sequences, can accurately classify reads as short as 100 bp. Its accuracy for short reads represents a significant leap forward over previous composition-based classification methods. PhymmBL, a method included in this distribution which combines analysis from both Phymm and BLAST, produces even higher accuracy.
 
 

Accuracy

Because one of the main challenges of metagenomic analysis is the fact that species are frequently encountered which have never before been sequenced, we examined the performance of this system using increasingly less data from organisms related to those from which query reads were sampled. The table below summarizes predictive accuracy results from PhymmBL, the hybrid method incorporating information from both Phymm and BLAST.
 
For instance, the information in the cell indexed by "Family excluded" and "Phylum" means that when, for each query read in our test set, all organisms belonging to the same family as the organism from which that read was sampled were excluded from consideration -- i.e., when the best possible prediction is one made at the order level -- PhymmBL was able to predict the correct phylum of query reads 57.5% of the time, with a standard deviation (measured over 10 runs) of ± 0.6%.
 
Note that accuracy, as reported in the table below, is measured as the percentage of all 100-bp query reads in the test data that received a correct label; no reads are left unlabeled.
 
Please see the paper for details on these and other experiments. All synthetic test data used for the experiments described in the paper (10 sets of 100-bp reads, plus one set each containing reads of 200, 400, 800 and 1000 bp) can be downloaded here.
 


 Species   Genus   Family   Order   Class   Phylum 
 All matches allowed   95.4 ± 0.2   99.1 ± 0.1   99.7 ± 0.1   99.8 ± 0.1   99.9 ± 0.1   99.9 ± 0.0 
 Species excluded   ---   58.5 ± 0.6   63.7 ± 0.6   66.3 ± 0.6   71.0 ± 0.5   76.8 ± 0.8 
 Genus excluded   ---   ---   26.9 ± 0.6   33.0 ± 0.6   44.6 ± 0.6   63.4 ± 0.6 
 Family excluded   ---   ---   ---   19.3 ± 0.5   33.4 ± 0.5   57.5 ± 0.6 
 Order excluded   ---   ---   ---   ---   23.8 ± 0.5   53.2 ± 0.6 
 Class excluded   ---   ---   ---   ---   ---   43.5 ± 0.7 

PhymmBL percent prediction accuracy and standard deviations for classification experiments with 100-bp reads and different clade levels excluded from comparison.

 

Obtaining the Software

This software is OSI Certified Open Source Software.   
 
Click to download the complete Phymm/PhymmBL system as either a gzipped tarball or as a .zip file.

--> NOTE: users of gcc 4.4.x and above please use these links instead: gzipped tar | .zip file <--

After downloading, move the downloaded file into a directory in which you intend to store the Phymm/PhymmBL program files and any downloaded genomic data, then uncompress it by typing
 
     tar zxvf phymmInstaller.tar.gz
 
The proper subdirectory structure will be created in your target directory, as will the installer script and a README file with instructions on building and using the system.
 
The software was developed and tested on a multi-core Linux system; it is expected to work properly on any Unix-like system which meets its system requirements (see the README for complete details).
 
PLEASE NOTE: Setup is particularly computationally intensive (moreso than read analysis, once the system has been built): even on a relatively powerful server, you should expect ground-up installation to take at least 24 hours.

 

References

  • A. Brady and S. L. Salzberg. Phymm and PhymmBL: Phylogenetic Classification of Metagenomic Data with Interpolated Markov Models. (submitted for publication)