1. How do I choose k?
We have two competing goals in choosing k. Smaller values of k provide greater discrimative power for identifying the location of errors in the reads and allow the algorithm to run faster and use less space. However, k cannot be so small that there is a high probability that one k-mer in the genome would be similar to another k-mer in the genome after a single nucleotide substitution because these occurrences confound error identification.
We recommend setting k such that the probability that a randomly selected k-mer from the space of (4^k)/2 (for odd k considering reverse complements as equivalent) possible k-mers occurs in a random sequence of nucleotides the size of the sequenced genome is ~0.01. Thus, for a genome of size G, set k to the nearest integer such that k~=log(200G)/log(4).
For a ~5 Mb such as E. coli, we set k to 15, and for the ~3 Gb human genome, we set k to 19.
2. Can I use Quake for exome sequencing?
Absolutely! However, Quake's 2nd step where the coverage cutoff is chosen may not function properly. I recommend running the steps yourself rather than use quake.py, skipping cov_model.py for step 2 and setting the cutoff by hand by examining the histogram of coverage. When you set the k-mer size, consider the size of the exome rather than the whole genome.
3. Can I use Quake for transcriptome sequencing?
I have not, but you could. Make sure you understand exactly what Quake is doing and be careful. The main problem is that a single copy k-mer may have come from a real but rare transcript. Forget quake.py. Run the steps yourself, skipping cov_model.py for step 2 and setting the cutoff by hand and probably to a very low value like 1. Then use the -u option to output the reads that Quake could not correct- you will probably want to hang on to these because they could be real.
4. Can I use Quake for metagenomic sequencing?
Similar story to transcriptome sequencing. I have not, you could, but tread carefully.
5. How do I count q-mers using Hadoop
Included in the Quake package are the programs count-qmers and reduce-qmers which operate on streams of reads through standard input making them appropriate for use on a Hadoop cluster via Hadoop streaming. In the src/ directory, there is an example bash script hadoop_count-qmers.sh which you can adapt to your cluster.
count-qmers works by hashing k-mer and count pairs with the C++ hash_map, which will require a lot of space for larger genomes. If your cluster nodes have limited memory, you may need to limit the memory used by count-qmers using the -l option. If your read files are unzipped in the HDFS, the size of a file chunk given to a map task will probably be small enough that you will not encounter any issues. However, if you have large zipped files presented to a single map task, you will definitely have to be careful to avoid thrashing.
6. Can Quake handle my reads' quality values on scale XYZ?
We do our best to guess what scale the quality values are on by looking at a sample of reads. If there are ascii values less than 64, we set the scale to 33; otherwise we set the scale to 64. If you notice that Quake is guessing wrong, specify the scale yourself with the -q option and let us know so we can fix it.
7. How can I diagnose coverage cutoff problems?
cov_model.py generally does a fine job of deciding on an appropriate cutoff, but you can sanity check the results by examining the histogram of k-mer coverages. cov_model.py will randomly sample kmers and their counts into the file kmers.txt which the R scripts will look for. We included a program bin/kmer_hist.r that will draw the histogram for integer bins into kmer_hist.pdf and print the sampled counts for 1/10 increment bins into kmer_hist.bin10.txt.
Generally, the counts will be extremely high in the interval [0,1] but quickly subside after that. You want the cutoff to be greater than most of that distribution which will include error k-mers, but not infringe on the normally distributed true k-mer distribution. If the cutoff is too high, true k-mers that randomly acquired low coverage will be untrusted and reads with those k-mers will be removed from the set. This may badly fragment a genome assembly.
If you don't have deep coverage of the genome, cov_model.py may suggest a very small cutoff value. Here it may be useful to view a finer resolution histogram to make sure that there are enough error k-mers (and not too many true k-mers) less than your cutoff to make error correction worthwhile.
8. Can I save the trusted k-mer data structure for repeated use?
If you expect to run correct many times and have a large counts file, you can build the data structure used to store trusted k-mers offline.
build_bithash -k [k-mer size] -m [counts file] -c [cutoff] -o [bithash file]
correct -k [k-mer size] -f [fastq list file] -b [bithash file] -p 4
9. What are my options if I don't have a large memory machine for large values of k?Decreasing k to a feasible value is one option. For example, using 18-mers to correct human reads would mostly work, but you would make slightly more mis-corrections and wrongly keep more reads with errors.
Let me know if this is a problem for you because there are other data structures that we have considered implementing, such as a Bloom filter.