JELLYFISH - Fast, Parallel k-mer Counting for DNA

Guillaume Marçais1, Carl Kingsford2

1Program in Applied Mathematics & Statistics, and Scientific Computation, University of Maryland, College Park
2Department of Computer Science and Institute for Advanced Computer Studies, University of Maryland, College Park

Overview

JELLYFISH is a tool for fast, memory-efficient counting of k-mers in DNA. A k-mer is a substring of length k, and counting the occurrences of all such substrings is a central step in many analyses of DNA sequence. JELLYFISH can count k-mers using an order of magnitude less memory and an order of magnitude faster than other k-mer counting packages by using an efficient encoding of a hash table and by exploiting the "compare-and-swap" CPU instruction to increase parallelism.

JELLYFISH is a command-line program that reads FASTA and multi-FASTA files containing DNA sequences. It outputs its k-mer counts in an binary format, which can be translated into a human-readable text format using the "jellyfish dump" command. See the documentation below for more details.

If you use JELLYFISH in your research, please cite:

Guillaume Marcais and Carl Kingsford, A fast, lock-free approach for efficient parallel counting of occurrences of k-mers. Bioinformatics (2011) 27(6): 764-770 (first published online January 7, 2011) doi:10.1093/bioinformatics/btr011

Requirements

JELLYFISH runs on 64-bit Intel-compatible processors running Linux or FreeBSD (including Intel Macs). It requires GNU GCC to compile.

Download

The current version of JELLYFISH is 2.0 and it is now hosted at a new location. Users are encouraged to use 2.0, which contains a number of enhancements such as support for longer k-mer sizes and better dynamic memory management.

This page will no longer be updated with JELLYFISH versions. Please see the link above for future releases.

Older Versions

The last version of JELLYFISH 1 is 1.1.11.

Source code for JELLYFISH is freely available via the link below.

Change Log

Version 2.0 includes a large number of enhancements. See the JELLYFISH 2.0 homepage for more details.

Version 1.1.11 now compiles and runs on Win7 with cygwin. gcc, g++, make and diffutils must be installed in cygwin.

Version 1.1.10 has various bug fixes and minor changes.

Version 1.1.6 has the following changes:

Version 1.1.5 has better handling of invalid characters in the input sequence and compiles properly with gcc 4.7.0.

Version 1.1.4 is a bug-fix release. Fixes segmentation fault on Mac OS X. Work around Linux non-standard behavior with the special file /dev/fd/1.

Version 1.1.3 fixes an issue with the histogram computed from merged files. The histogram computed from a database obtained by merging multiple file (e.g. with jellyfish merge) was incorrect: only a subset of the k-mers were counted in the histogram. On the other hand, the merged file itself is correct and results from jellyfish stats or jellyfish dump are valid.

Version 1.1.2 fixes an issue with SSE instructions causing a segfault on some computers.

Version 1.1.1 fixes several minor bugs and 3 major bugs:

These bugs have been fixed and everybody is encouraged to upgrade.

Version 1.1 adds several features:

Version 1.0.2 is a bug fix release from 1.0.1 and everybody is encouraged to upgrade to 1.0.2.

Contact

For questions and comments write to gmarcais at umd.edu.

Funding

This work was supported by the National Science Foundation grants EF-0849899 and IIS-0812111. G.M. was supported by National Science Foundation grant DMS-0616585 and National Institutes of Health grant 1R01HG0294501.


Last modified: July 26, 2011.