Jellyfish is a software to count k-mers in DNA sequences.
Jellyfish is a k-mer counter based on a multi-threaded hash table implementation.
To count k-mers, use a command like:
jellyfish count -m 22 -o output -c 3 -s 10000000 -t 32 input.fasta
This will count the the 22-mers in species.fasta with 32 threads. The counter field in the hash uses only 3 bits and the hash has at least 10 million entries. Let the size of the table be s=2^l and the max reprobe value is less than 2^r, then the memory usage per entry in the hash is (in bits, not bytes) 2k-l+r+1.
To save space, the hash table supports variable length counter, i.e. a k-mer occurring only a few times will use a small counter, a k-mer occurring many times will used multiple entries in the hash. The -c specify the length of the small counter. The tradeoff is: a low value will save space per entry in the hash but will increase the number of entries used, hence maybe requiring a larger hash. In practice, use a value for -c so that most of you k-mers require only 1 entry. For example, to count k-mers in a genome, where most of the sequence is unique, use -c1 or -c2. For sequencing reads, use a value for -c large enough to counts up to twice the coverage.
When the orientation of the sequences in the input fasta file is not known, e.g. in sequencing reads, using --both-strands makes the most sense.
Count k-mers in one or many fasta file(s). There is no restriction in the size of the fasta file, the number of sequences or the size of the sequences in the fasta files. On the other hand, they must be files on and not pipes, as the files are memory mapped into memory.
Display statistics or dump full content of hash table in an easily parsable text format.
By default, it displays the statistics in the header of the file. These are:
Create an histogram with the number of k-mers having a given count. In bucket i are tallied the k-mers which have a count c satisfying low+i*inc<=c
Query a database created with jellyfish count.
It reads
k-mers from the standard input and write the counts on the standard
output. For example:
Version: 0.9 of 2010/10/1
Carl Kingsford
query
$ echo "AAAAA ACGTA" | jellyfish query database
AAAAA 12
ACGTA 3
Version
Bugs
Copyright & License
This program is distributed in the hope that it will be useful, but
WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
General Public License for more details.
You should have received a copy of the GNU General Public License
along with this program. If not, see
<http://www.gnu.org/licenses/>.
Authors
Guillaume Marcais
University of Maryland
gmarcais@umd.edu
University of Maryland
carlk@umiacs.umd.edu