BlastReduce: High Performance Short Read Mapping with MapReduce

Michael C. Schatz


Next-generation DNA sequencing machines generate sequence data at an unprecedented rate, but traditional single-processor sequence alignment algorithms are struggling to keep pace with them. BlastReduce is a new parallel read mapping algorithm optimized for aligning sequence data from those machines to reference genomes, for use in a variety of biological analyses, including SNP discovery, genotyping, and personal genomics. It is modeled after the widely used BLAST sequence alignment algorithm, but uses the open-source Hadoop implementation of MapReduce to parallelize execution to multiple compute nodes. To evaluate its performance, BlastReduce was used to map next generation sequence data to a reference bacterial genome in a variety of configurations. The results show BlastReduce scales linearly for the number of sequences processed, and with high speedup as the number of processors increases. Furthermore, BlastReduce is fully compatible with cloud computing, and can be easily executed on massively parallel remote resources to meet peak demand. BlastReduce is available open-source at: http://www.cbcb.umd.edu/software/blastreduce/.

Source Code coming soon