Scimm + PhyScimm :: Center for Bioinformatics and Computational Biology

Should I use Scimm or PhyScimm?

If your sequences are from a novel environment that is unlikely to be well represented in GenBank, use Scimm. If your environment is well studied the organisms in your environment are likely to have relatives sequenced in GenBank, use PhyScimm.

How should I set the number of clusters k?

Properly determining the number of clusters is a very difficult problem. For now, we ask the user to make an informed guess based on knowledge of the environment. Tools that look for informative genes on the sequences in order to estimate diversity, such as MetaPhyler may be useful for this task. But this is an active area of research.

How long must my sequences be for accurate clustering?

In the experiments for the paper, we focused on 800 bp reads. Tests with 400 bp reads still produced reasonable results. We have not quantified Scimm's performance on any shorter sequences.

How should I run Scimm for a huge dataset?

The initial partitioning methods will sample a subset of the data anyway so the overall number of sequences won't affect this step. One thing you could do is increase the value of the variable rsments_t in the script imm_cluster.py. By default, Scimm will iterate until fewer than 0.05% of the sequences change cluster from one iteration to the next. But this is fairly strict and a greater number like 0.5% would decrease the number of computationally intensive iterations with only a slight decrease in accuracy.
Another option would be to only cluster a subset of the sequences, and then "classify" the remaining reads into their highest scoring cluster. We have not implemented this, but if you are interested, e-mail David Kelley.

How should I run Scimm to produce clusters fast?

To obtain the initial partitioning, CBCBCompostBin is faster than LikelyBin so set --ls=0 and only use CBCBCompostBin. Then you could speed up CBCBCompostBin by decreasing the oligonucleotides to count, setting --cs=4, and decreasing the number of reads used, setting --cn=2000. Finally, as described above for large datasets, you can increase rsments_t for IMM clustering.

How should I run Scimm to get the best clusters possible when computation time is of no concern?

First, increase the amount of work we'll do for the initial partitioning. For example, set --ls=2, --ln=5000, --lo=4 and --cs=2, --cn=5000, --co=6. Try clustering with a few different values for k, the number of clusters.

Site Map

Home
Manual
FAQ

Download

Scimm 0.3.0

2/9/2012

Publication

Kelley DR, Salzberg SL. Clustering metagenomic sequences with interpolated Markov models. BMC Bioinformatics 11:544 2010.

Contributors

David Kelley
Steven Salzberg

Links

UMD CBCB
UMD Computer Science
UMIACS
University of Maryland