What is Scimm?

Scimm is an unsupervised clustering method intended for metagenomic DNA sequences. Every cluster is modeled by an interpolated Markov model (IMM), and a variant of the

*k*-means iterative optimization algorithm is used to cluster the sequences in such a way that the likelihood of generating the sequences with the*k*IMMs is maximized.Installation

First, download the source code. Then install Scimm, PhyScimm with Python scripts

Note that installation of PhyScimm requires an installation of Phymm, which requires ~50 GB of space and ~24 hours to build an IMM for every genome in GenBank. See the Phymm homepage for details. If you already have Phymm installed, edit the variable

`install_scimm.py`and`install_physcimm.py`. These will set up directories for the various software packages on which Scimm depends, and place executables and scripts needed to run Scimm in a bin directory.Note that installation of PhyScimm requires an installation of Phymm, which requires ~50 GB of space and ~24 hours to build an IMM for every genome in GenBank. See the Phymm homepage for details. If you already have Phymm installed, edit the variable

*prior_phymm_dir*at the top of`install_physcimm.py`to the installation path.Scimm usage

The following is a detailed description of the options used to control the Scimm script:

Usage: scimm.py -s <sequence file> -k <# clusters> [options]*

Arguments: | Description: |

-s | Fasta file of sequences to be clustered. |

-k | Number of clusters |

Options: | |

-p | Number of processors to use for computation |

--ls | Number of initial partitionings to obtain using LikelyBin. Default=1. |

--ln | Number of sequences to sample for each LikelyBin initial partitioning. Default=3000. |

--lt | Number of parallel Markov chain Monte Carlo optimizations to perform within LikelyBin on each set of sequences. More optimizations prevents the dependence on the initial parameter setting. Default=2. |

--lo | Order of LikelyBin Markov model. 2, 3, or 4 are valid. 4 can be slow and I have encountered bugs with it. Default=3 |

--cs | Number of initial partitionings to obtain using CBCBCompostBin. Default=1. |

--cn | Number of sequences to sample for each CBCBCompostBin initial partitioning. Default=3000. |

--co | Order of CBCBCompostBin oligonucleotides to count. Default=5. |

PhyScimm usage

The following is a detailed description of the options used to control the PhyScimm script:

Usage: physcimm.py -s <sequence file> [options]*

Arguments: | Description: |

-s | Fasta file of sequences to be clustered. |

Options: | |

-p | Number of processors to use for computation |

-n | Number of sequences to sample for each Phymm initial partitioning. Default=3000. |

--taxlevel | Taxonomic level at which to cluster sequences using Phymm classifications. Default=family |

--minbp_pct | Minimum proportion of bases assigned to a class to become a cluster. The purpose of this value is to eliminate clusters arising from Phymm misclassifications. If it's set higher, you will filter out more incorrect clusters but also potentially some low abundant species, and vice versa for setting it lower. Default=.01. |

Known Issues

**Matlab compostbin.py**

If you don't have Matlab or are encountering problems with the compostbin.py runs, such as never receiving your command prompt back after running Scimm, you can turn off compostbin without doing much harm. Just set "--cs 0". If you can handle the extra computation time, consider raising the number of LikelyBin runs with "--ls" in order to more widely sample the space of initial partitionings. Additionally, if you look at compostbin.py and understand why my Matlab script call fails on your machine, let me know!