Note on support
Until further notice, PhymmBL is provided as-is. I will continue to develop
the program (including incorporating useful suggestions and fixes sent in by users)
as I find time, but my current job doesn't provide a whole lot of leeway for that,
so it'll be sporadic and drawn-out.
Also until further notice, I'm suspending individual user support. I'm humbled and
gratified that people continue to use the program, but I'm not doing anyone any
favors by making them wait 3-4 months for an answer (roughly my current turnaround
time). If anyone would like to start a support forum for the software -- it's fully
open-source, so the more experienced programmers out there can probably help
stuck users at least as well as I can -- I thoroughly endorse the idea. I ask only
that you tell me about any such forums as they go live, so that I can point future
support requests there (and so I can make sure there aren't redundant forums popping up in
My apologies to any users who've emailed me between now and the last time I had some
extra time to devote to development and support (around September 2012). Your requests
are still in my queue, but I can't predict when I'll be able to attend to them.
Metagenomics sequencing projects collect samples of DNA from uncharacterized
environments that may contain hundreds or even thousands of species. One of the
main challenges in analyzing a metagenome is phylogenetic classification of raw
sequence reads into groups representing the same or similar species. Such
classification is a useful prerequisite for genome assembly and for analysis of the
biological diversity present in a sample. The newest sequencing technologies have
simultaneously made metagenomics easier, by making the sequencing process faster,
and more difficult, by producing shorter read lengths than previous technologies.
Methods for classifying sequences as short as 100 base pairs (bp) have until now
been relatively inaccurate, requiring metagenomics projects to use older, long-read
technologies. Phymm, a new classification approach for metagenomics data
which uses interpolated Markov models (IMMs) to taxonomically classify DNA sequences,
can accurately classify reads as short as 100 bp. Its accuracy for short reads represents a
significant leap forward over previous composition-based classification methods.
PhymmBL (rhymes with "thimble"), the hybrid classifier included in this
distribution which combines analysis from both Phymm and BLAST,
produces even higher accuracy.
PhymmBL v4.0 is the current stable release.
> New in v4.0 [2012.08.31.1640]:
- A fleet of across-the-board stability upgrades and minor code changes.
- Enabled detailed logging of all levels of operation for easier progress tracking & troubleshooting.
- Support has been enabled for the BLAST+ applications, making the BLAST portion of PhymmBL's processing pipeline significantly faster.
- A file descriptor/redirection complaint specific to Ubuntu has been identified and fixed. Thanks to David Kelley for first pointing it out.
- The setup process has been made substantially more robust against errors and inconsistencies in RefSeq's GenBank-encoded metadata and NCBI's taxonomic trees.
- A rare bug in the BLAST database-rebuilding components of several scripts has been identified and eliminated.
> New in v3.2 [2011.02.23.1546]:
- Custom genome data can now be added in batches instead of having to add
one organism at a time. See the README for details and instructions on
using the new batch mode.
> New in v3.1 [2010.10.18.1651]:
- Reconfigured raw Phymm output format to deliver a huge reduction in file size.
- Fixed a rare inifite loop potential in rebuildBlastDB.pl.
> New in v3.01 [2010.09.17.1300]:
- Fixed a minor bug in addCustomGenome.pl that occasionally resulted in the loss of taxonomic
metadata for new organisms.
> New in v3.0 [2010.06.25.1425]:
- Confidence scores are now listed in the PhymmBL results files, translating raw scores into
usable estimates of predictive accuracy. Please see the README for
an important discussion of how to interpret and work with these scores.
- Date stamps are now given in each phase of PhymmBL's terminal output to let users know how
long each phase of analysis has taken.
- ICM IDs are now listed in the raw Phymm output to allow for disambiguation between ICM scores
assigned by different ICMs within the same species.
> New in v2.03 [2010.06.11.1327]:
- Semicolons in species/strain names are now handled properly with respect to local database
- The database of known GenBank taxonomic-labeling inconsistencies has been updated.
- A workaround has been added for kernels that complain when the 'cat' command is passed
too many arguments, which can affect the construction of the local BLAST database. (If
you didn't see an error during setup, you don't have to worry about this.)
- A section has been added to the README with suggestions on incorporating mate-pair
information into your classification run.
> New in v2.02 [2010.06.07.1246]:
- A bug in addCustomGenome.pl preventing full assimilation of new genomes has been corrected.
If you have any of the 2.x versions, and you attempted to add your own genomic data, check your
PHYMM_DIR/.genomeData/.userAdded/ADDED_ORGANISM/ directory; if it doesn't contain any .icm files,
please redownload the Phymm installer and add your custom genomes again. You will
not need to regenerate the core RefSeq libraries or alter the main genomic
database in any way.
- A bug in the new-copy RefSeq download subroutine has been fixed. If you installed
one of the 2.x versions for the first time and no genome data appeared, this version will
correct the problem. Timeouts for RefSeq downloads have been extended, and
the interface for addCustomeGenome.pl has been tweaked to make the taxonomic data entry
a little clearer.
> New in v2.01 [2010.05.27.1634]:
- The README has been substantially expanded to include instructions on
parallelization, notes on interpreting PhymmBL's numeric scores, and several
other minor changes. Program code has not been changed. Thanks to Liam Elbourne
for helpful discussions.
> New in v2.0 [2010.05.25.1335]:
- A script has been added allowing users to add their own custom genomic sequence data to the local database.
The script takes new sequence data (as FASTA/multiFASTA files), adds them to the BLAST database,
and creates IMMs to model them. The user is polled to provide taxonomic data for each new organism.
- The setup script has been completely rewritten; users can now choose whether to download a completely new
copy of the RefSeq microbial database, or to update the existing local database with only RefSeq
sequences which have been added or have changed since the last install. (Genome content,
taxonomic information and model files for user-added organisms are stored separately from the RefSeq data,
so updates won't affect any custom content.)
- A script has been added to manually regenerate the local BLAST database.
- PhymmBL's combined scoring function was tweaked for the case of BLAST's E-value being reported
as "0.0", resulting in slightly better overall accuracy in this case.
- A collection of minor bugs and irritations has been fixed.
- Mac OS is now formally supported, but please see the README for a note on obtaining
wget, which isn't provided with the OS X suite of developer tools and is needed for setup to run
Because one of the main challenges of metagenomic analysis is the fact
that species are frequently encountered which have never before been sequenced,
we examined the performance of this system using increasingly less data from
organisms related to those from which query reads were sampled. The table below
summarizes predictive accuracy results from PhymmBL, the hybrid method incorporating
information from both Phymm and BLAST.
For instance, the information in the cell indexed by "Family excluded" and "Phylum"
means that when, for each query read in our test set, all organisms belonging
to the same family as the organism from which that read was sampled were
excluded from consideration -- i.e., when the best possible prediction is one
made at the order level -- PhymmBL was able to predict the correct phylum
of query reads 57.5% of the time, with a standard deviation (measured over 10
runs) of ± 0.6%.
Note that accuracy, as reported in the table below, is measured as the percentage
of all 100-bp query reads in the test data that received a correct label;
no reads are left unlabeled.
Please see the paper for details on these and other
experiments. All synthetic test data used for the experiments described
in the paper (10 sets of 100-bp reads, plus one set each containing reads
of 200, 400, 800 and 1000 bp) can be downloaded here.
| All matches allowed
|| 95.4 ± 0.2
|| 99.1 ± 0.1
|| 99.7 ± 0.1
|| 99.8 ± 0.1
|| 99.9 ± 0.1
|| 99.9 ± 0.0
| Species excluded
|| 58.5 ± 0.6
|| 63.7 ± 0.6
|| 66.3 ± 0.6
|| 71.0 ± 0.5
|| 76.8 ± 0.8
| Genus excluded
|| 26.9 ± 0.6
|| 33.0 ± 0.6
|| 44.6 ± 0.6
|| 63.4 ± 0.6
| Family excluded
|| 19.3 ± 0.5
|| 33.4 ± 0.5
|| 57.5 ± 0.6
| Order excluded
|| 23.8 ± 0.5
|| 53.2 ± 0.6
| Class excluded
|| 43.5 ± 0.7
|PhymmBL percent prediction accuracy and standard
deviations for classification experiments with 100-bp reads and different
clade levels excluded from comparison.
Obtaining the Software
This software is OSI
Certified Open Source Software.
Click to download the PhymmBL installation software as either a gzipped tarball or as a .zip file.
After downloading, move the downloaded file into a directory in which you
intend to store the PhymmBL program files and downloaded genomic
data, then uncompress it by typing
tar zxvf phymmbl_installer.tar.gz
PhymmBL's subdirectory structure will be created in your target directory,
as will the installer script and a README
file with instructions on building and using the system.
The software was developed and tested on a multi-core Linux system; it is expected to work
properly on any Unix-like system which meets its system requirements (see the
README for details, including an extra step Mac OS users
will need to take).
PLEASE NOTE: Setup is particularly computationally intensive: even on a
relatively powerful server, you should expect ground-up installation to take at
least 24 hours.
This work is supported in part by NIH grant R01-LM006845 to S.L. Salzberg.