About Phymm and PhymmBL
Metagenomics sequencing projects collect samples of DNA from uncharacterized
environments that may contain hundreds or even thousands of species. One of the
main challenges in analyzing a metagenome is phylogenetic classification of raw
sequence reads into groups representing the same or similar species. Such
classification is a useful prerequisite for genome assembly and for analysis of the
biological diversity present in a sample. The newest sequencing technologies have
simultaneously made metagenomics easier, by making the sequencing process faster,
and more difficult, by producing shorter read lengths than previous technologies.
Methods for classifying sequences as short as 100 base pairs (bp) have until now
been relatively inaccurate, requiring metagenomics projects to use older, long-read
technologies. Phymm, a new classifier for metagenomics data which uses interpolated
Markov models (IMMs) to taxonomically classify DNA sequences, can accurately classify reads
as short as 100 bp. Its accuracy for short reads represents a
significant leap forward over previous composition-based classification methods.
PhymmBL, a method included in this distribution which
combines analysis from both Phymm and BLAST,
produces even higher accuracy.
|
Accuracy
Because one of the main challenges of metagenomic analysis is the fact
that species are frequently encountered which have never before been sequenced,
we examined the performance of this system using increasingly less data from
organisms related to those from which query reads were sampled. The table below
summarizes predictive accuracy results from PhymmBL, the hybrid method incorporating
information from both Phymm and BLAST.
For instance, the information in the cell indexed by "Family excluded" and "Phylum"
means that when, for each query read in our test set, all organisms belonging
to the same family as the organism from which that read was sampled were
excluded from consideration -- i.e., when the best possible prediction is one
made at the order level -- PhymmBL was able to predict the correct phylum
of query reads 57.5% of the time, with a standard deviation (measured over 10
runs) of ± 0.6%.
Note that accuracy, as reported in the table below, is measured as the percentage
of all 100-bp query reads in the test data that received a correct label;
no reads are left unlabeled.
Please see the paper for details on these and other
experiments. All synthetic test data used for the experiments described
in the paper (10 sets of 100-bp reads, plus one set each containing reads
of 200, 400, 800 and 1000 bp) can be downloaded here.
|
Species |
Genus |
Family |
Order |
Class |
Phylum |
All matches allowed |
95.4 ± 0.2 |
99.1 ± 0.1 |
99.7 ± 0.1 |
99.8 ± 0.1 |
99.9 ± 0.1 |
99.9 ± 0.0 |
Species excluded |
--- |
58.5 ± 0.6 |
63.7 ± 0.6 |
66.3 ± 0.6 |
71.0 ± 0.5 |
76.8 ± 0.8 |
Genus excluded |
--- |
--- |
26.9 ± 0.6 |
33.0 ± 0.6 |
44.6 ± 0.6 |
63.4 ± 0.6 |
Family excluded |
--- |
--- |
--- |
19.3 ± 0.5 |
33.4 ± 0.5 |
57.5 ± 0.6 |
Order excluded |
--- |
--- |
--- |
--- |
23.8 ± 0.5 |
53.2 ± 0.6 |
Class excluded |
--- |
--- |
--- |
--- |
--- |
43.5 ± 0.7 |
|
|
PhymmBL percent prediction accuracy and standard
deviations for classification experiments with 100-bp reads and different
clade levels excluded from comparison. |
|
Obtaining the Software
This software is OSI
Certified Open Source Software. 
Click to download the complete Phymm/PhymmBL system as either a gzipped tarball or as a .zip file.
--> NOTE: users of gcc 4.4.x and above please use these links instead: gzipped tar | .zip file <--
After downloading, move the downloaded file into a directory in which you
intend to store the Phymm/PhymmBL program files and any downloaded genomic
data, then uncompress it by typing
tar zxvf phymmInstaller.tar.gz
The proper subdirectory structure will be created in your target directory,
as will the installer script and a README
file with instructions on building and using the system.
The software was developed and tested on a multi-core Linux system; it is expected to work
properly on any Unix-like system which meets its system requirements (see the
README for complete details).
PLEASE NOTE: Setup is particularly computationally intensive (moreso than read analysis,
once the system has been built): even on a relatively powerful server, you should expect
ground-up installation to take at least 24 hours.
|
References
- A. Brady and S. L. Salzberg. Phymm and PhymmBL: Phylogenetic
Classification of Metagenomic Data with Interpolated Markov Models.
(submitted for publication)
|
|