High Performance Computing @ CBCB



Cloud Computing



Recent advances in DNA sequencing technology from Illumina, 454 Life Sciences, ABI, and Helicos, have enabled next generation sequencing instruments to sequence the equivalent of the human genome (~3 billion bp) in few days and at low cost. In contrast, the sequencing for the human genome project of the late 90's and early '00s required years of work on hundreds of machines with sequencing costs measured in hundreds of millions of dollars. This dramatic increase in efficiency has spurred tremendous growth in applications for DNA sequencing.

For example, whereas the human genome project sought to sequence the genome of a small group of individuals, the 1000 genomes project aims to catalog the genomes of 1000 individuals from all regions of the globe in just three years. Recent related projects aim to catalog all of the biologically active transcribed regions of the genome over a wide variety of environmental and disease conditions. Similar studies are also underway for model organisms such as mouse, rat, chicken, rice, and yeast, and other organisms of interest.

Cheap and fast sequencing technologies are also providing scientists with the tools to analyze the largely unknown microbial biosphere. The majority of microbes inhabiting our world and our bodies are unknown and cannot be easily manipulated in the laboratory. In recent years a new scientific field has emerged - metagenomics - that aims to characterize entire microbial communities by directly sequencing the DNA directly extracted from an environment. Several studies have already targeted a range of natural environments (ocean, soil, mine drainage) as well as the commensal microbes inhabiting the bodies of humans and other animals and insects. The latter are the target of a new NIH initiative - the Human Microbiome Project - an effort to characterize the diversity of human-associated microbial communities and to understand their contributions to human health. For more details see our description of metagenomic research at the CBCB.

The raw data generated by the new sequencing instruments often exceed 1 terabyte and are already straining the computational infrastructure typically available in an average research lab. Furthermore, biological datasets are only increasing in size, as data for more individuals and more environments are collected, further complicating computational analyses. Even seemingly simple tasks, such as mapping a collection of sequencing reads to one of the human reference genome, can require days of computation, while de novo assembly of an entire human-sized genome using new generation data has yet to be attempted. The only long-term solution to the challenges posed by the massive data-sets being generated is to combine computational biology research with advances from high performance computing (HPC).

At the CBCB, research in high-performance computational biology aims to leverage two recent technological advances: (1) massively parallel distributed computing clusters made available over the internet as a pay-per-use service - a paradigm called Cloud Computing; and (2) the availability of highly parallel graphics processing units (GPUs) in high-end graphics cards. These research directions, and recent results of our research, are described in more detail below.



Mihai Pop
Steven Salzberg
Amitabh Varshney
Jimmy Lin

Graduate Students

Ben Langmead
Chris Hill

Undergraduate students

Carl Albach
Sebastian Gomez


Michael Schatz - now faculty at Cold Spring Harbor Laboratories
Cole Trapnell - now at the Broad Institute

Cloud Computing Research

Research on cloud computing at the CBCB is supported under the NSF Cluster Exploratory Program (CluE) - grant IIS-0844494.

Our research has recently received quite a bit of media attention:

Our research aims to develop software tools for genome sequence alignment and genome assembly that can take advantage of the large-scale parallelism offered by the Google/IBM computing cluster. One goal of our work is to provide biologists with the ability to simply rent computational resources through one of the available cloud computing services, thereby obviating the need to establish a large computing infrastructure at their institution. Effectively we would transform computation from a capital investment into simply a line-item in the research budget similar to how laboratory reagents are budgeted.

At a more fundamental level, we will explore the limits of the MapReduce computation paradigm (as implemented in the Hadoop system) when applied to bioinformatics applications. In particular, genome assembly programs rely on graph theoretic algorithms that are notoriously difficult to parallelize. We will also evaluate the cost imposed by the transfer of data to and from the compute cluster - for the large data-sets being analyzed communication will likely account for a significant fraction of the total analysis cost.



Combines the efficiency of Bowtie with advances in Cloud Computing to enable deep-coverage human resequencing and genotyping in about an hour per individual.


Highly Sensitive Short Read Mapping with MapReduce


Cloud-computing genome assembler


A fast, distributed, clustering approach to sequence querying using MapReduce.

Also see a description of related research at the CBCB on software for analyzing data from new generation sequencing technologies.

Publications related to this project

  1. Hill, C.M., Astrovskaya, I., Huang, H., Koren, S., Treangen, T., Memon, A., and Pop, M. (2013) De novo likelihood-based measures for comparing metagenomic assemblies. in preparation.

  2. Bradnam, K.R., et al. (2013). Assemblathon 2: evaluating de novo methods of genome assembly in three vertebrate species. arXiv preprint arXiv:1301.5406.

  3. Ghodsi, M., Hill, C.M., Astrovskaya, I., Lin, H., Sommer, D.D., Koren, S., and Pop, M. (2013) De novo likelihood-based measures for assembly validation. under review.

  4. Gurtowski, J., Schatz, M.C., and Langmead, B. (2012) Genotyping in the Cloud with Crossbow. Current Protocols in Bioinformatics, 15 Unit15-3.

  5. Lee, H., and Schatz, M.C. (2012) Genomic dark matter: the reliability of short read mapping illustrated by the genome mappability score. Bioinformatics, 28(16): p. 2097-2105.

  6. Titmus, M.A., Gurtowski, J., and Schatz, M.C. (2012) Answering the demands of digital genomics. Concurrency and Computation: Practice and Experience.

  7. Kelley, DR, Schatz, MC, Salzberg, SL (2010) Quake: quality-aware detection and correction of sequencing reads. Genome Biology. 11:R116

  8. Lin, J, Schatz, MC. (2010) Design patterns for efficient graph algorithms in MapReduce. Proceedings of the Eighth Workshop on Mining and Learning with Graphs Workshop (MLG-2010) .

  9. Schatz, MC, Landmead, B, Salzberg, SL. (2010) Cloud Computing and the DNA Data Race. Nature Biotechnology. 28:691-693

  10. Schatz, MC, Delcher, AL, Salzberg, SL. (2010) Assembly of large genomes using second-generation sequencing. Genome Research. 20:1165-1173

  11. Kingsford, C., M.C. Schatz, and M. Pop, Assembly complexity of prokaryotic genomes using short reads. BMC Bioinformatics, 2010. 11: p. 21.

  12. Zimin, A.V., et al., A whole-genome assembly of the domestic cow, Bos taurus. Genome Biol, 2009. 10(4): p. R42.

  13. Schatz, M.C., CloudBurst: highly sensitive read mapping with MapReduce. Bioinformatics, 2009. 25(11): p. 1363-9.

  14. Pop, M., Genome assembly reborn: recent computational challenges. Brief Bioinform, 2009. 10(4): p. 354-66.

  15. Nagarajan, N. and M. Pop, Parametric complexity of sequence assembly: theory and applications to next generation sequencing. J Comput Biol, 2009. 16(7): p. 897-908.

  16. Langmead, B., et al., Searching for SNPs with cloud computing. Genome Biol, 2009. 10(11): p. R134.

  17. Cornman, R.S., et al., Genomic analyses of the microsporidian Nosema ceranae, an emergent pathogen of honey bees. PLoS Pathog, 2009. 5(6): p. e1000466.

Graphics Processing Units

Research on the applications of GPUs to Bioinformatics is supported by the NIH under grants R01-LM006845, R01-GM083873,and R01-LM007938, and by the NSF under grant CNS-0403313.

Many people do not realize the significant computational resources available on their computer's graphics card. Many high-end graphics cards are contain highly-parallel processors called Graphics Processing Units (GPUs). These processors were initially designed to speed up the rendering of complex graphics,e.g. for video game applications, however their power can also be harnessed for other scientific applications. The use of graphics processors for general purpose computation is particuarly attractive as their performance is improving faster than that of typical CPUs (the Moore's law curve is steper for GPUs), furthermore the cost of high end graphics cards is rapidly decreasing and these cards are increasingly available in pre-configured desktop computers. At the CBCB we are interested in using GPU processors to accelerate bioinformatics applications such as genome alignment and genome assembly.



High-throughput Sequence Alignment on the GPU using CUDA GPGPU API from nVidia



Schatz, M.C., Trapnell, C., Delcher, A.L., Varshney, A. (2007) High-throughput sequence alignment using Graphics Processing Units. BMC Bioinformatics 8:474.


Trapnell, C., Schatz, M.C. (2009) Optimizing data intensive GPGPU computations for DNA sequence alignment Parallel Computing 35: p. 429-440. doi:10.1016/j.parco.2009.05.002.