High Performance Computing @ CBCB
Recent advances in DNA sequencing technology from Illumina, 454
Life Sciences, ABI, and Helicos, have enabled next generation
sequencing instruments to sequence the equivalent of the human genome
(~3 billion bp) in few days and at low cost. In contrast, the
sequencing for the human genome project of the late 90's and early
'00s required years of work on hundreds of machines with sequencing
costs measured in hundreds of millions of dollars. This dramatic
increase in efficiency has spurred tremendous growth in applications
for DNA sequencing.
For example, whereas the human genome project sought to sequence
the genome of a small group of individuals, the 1000
genomes project aims to catalog the genomes of 1000 individuals
from all regions of the globe in just three years. Recent related
projects aim to catalog all of the biologically active transcribed
regions of the genome over a wide variety of environmental and
disease conditions. Similar studies are also underway for model
organisms such as mouse, rat, chicken, rice, and yeast, and other
organisms of interest.
Cheap and fast sequencing technologies are also providing
scientists with the tools to analyze the largely unknown microbial
biosphere. The majority of microbes inhabiting our world and our
bodies are unknown and cannot be easily manipulated in the
laboratory. In recent years a new scientific field has emerged -
metagenomics - that aims to characterize entire microbial communities
by directly sequencing the DNA directly extracted from an
environment. Several studies have already targeted a range of natural
environments (ocean, soil, mine drainage) as well as the commensal
microbes inhabiting the bodies of humans and other animals and
insects. The latter are the target of a new NIH initiative - the
Human Microbiome Project
- an effort to characterize the diversity of human-associated
microbial communities and to understand their contributions to human
health. For more details see our description of metagenomic
research at the CBCB.
The raw data generated by the new sequencing instruments often
exceed 1 terabyte and are already straining the computational
infrastructure typically available in an average research lab.
Furthermore, biological datasets are only increasing in size, as data
for more individuals and more environments are collected, further
complicating computational analyses. Even seemingly simple tasks,
such as mapping a collection of sequencing reads to one of the human
reference genome, can require days of computation, while de novo
assembly of an entire human-sized genome using new generation data
has yet to be attempted. The only long-term solution to the
challenges posed by the massive data-sets being generated is to
combine computational biology research with advances from high
performance computing (HPC).
At the CBCB, research in high-performance computational biology
aims to leverage two recent technological advances: (1) massively
parallel distributed computing clusters made available over the
internet as a pay-per-use service - a paradigm called Cloud
Computing; and (2) the availability of highly parallel graphics
processing units (GPUs) in high-end graphics cards. These
research directions, and recent results of our research, are
described in more detail below.
Research on cloud computing at the CBCB is supported under the NSF
Cluster Exploratory Program (CluE) - grant IIS-0844494.
Our research has recently received quite a bit of media attention:
Our research aims to develop software tools for genome sequence
alignment and genome assembly that can take advantage of the
large-scale parallelism offered by the Google/IBM computing cluster.
One goal of our work is to provide biologists with the ability to
simply rent computational resources through one of the available
cloud computing services, thereby obviating the need to establish a
large computing infrastructure at their institution. Effectively we
would transform computation from a capital investment into simply a
line-item in the research budget similar to how laboratory reagents
At a more fundamental level, we will explore the limits of the
MapReduce computation paradigm (as implemented in the Hadoop
system) when applied to bioinformatics applications. In
particular, genome assembly programs rely on graph theoretic
algorithms that are notoriously difficult to parallelize. We will
also evaluate the cost imposed by the transfer of data to and from
the compute cluster - for the large data-sets being analyzed
communication will likely account for a significant fraction of the
total analysis cost.
Combines the efficiency of Bowtie
with advances in Cloud Computing to enable deep-coverage human
resequencing and genotyping in about an hour per individual.
Highly Sensitive Short Read Mapping with MapReduce
Cloud-computing genome assembler
A fast, distributed, clustering approach to sequence querying using MapReduce.
Also see a description of related research at the CBCB on software
for analyzing data from new
generation sequencing technologies.
Publications related to this project
Bradnam, K.R., et al. (2013). Assemblathon 2: evaluating de novo
methods of genome assembly in three vertebrate species. arXiv preprint arXiv:1301.5406.
Ghodsi, M., Hill, C.M., Astrovskaya, I., Lin, H., Sommer, D.D., Koren, S., and Pop, M. (2013) De novo likelihood-based measures for assembly
validation. under review.
Gurtowski, J., Schatz, M.C., and Langmead, B. (2012) Genotyping in the Cloud with
Crossbow. Current Protocols in Bioinformatics, 15 Unit15-3.
Lee, H., and Schatz, M.C. (2012) Genomic dark
matter: the reliability of short read mapping illustrated by the genome mappability score.
Bioinformatics, 28(16): p. 2097-2105.
Titmus, M.A., Gurtowski, J., and Schatz, M.C. (2012) Answering
the demands of digital genomics. Concurrency and Computation: Practice and Experience.
Kelley, DR, Schatz, MC, Salzberg, SL (2010) Quake:
quality-aware detection and correction of sequencing reads.
Genome Biology. 11:R116
Lin, J, Schatz, MC. (2010) Design
patterns for efficient graph algorithms in MapReduce.
Proceedings of the Eighth Workshop on Mining and Learning with
Graphs Workshop (MLG-2010) .
Schatz, MC, Landmead, B, Salzberg, SL. (2010) Cloud
Computing and the DNA Data Race. Nature Biotechnology.
Schatz, MC, Delcher, AL, Salzberg, SL. (2010) Assembly
of large genomes using second-generation sequencing. Genome
Kingsford, C., M.C. Schatz, and M. Pop, Assembly
complexity of prokaryotic genomes using short reads. BMC
Bioinformatics, 2010. 11: p. 21.
Zimin, A.V., et al., A
whole-genome assembly of the domestic cow, Bos taurus. Genome
Biol, 2009. 10(4): p. R42.
Schatz, M.C., CloudBurst:
highly sensitive read mapping with MapReduce.
Bioinformatics, 2009. 25(11): p. 1363-9.
Pop, M., Genome
assembly reborn: recent computational challenges. Brief
Bioinform, 2009. 10(4): p. 354-66.
Nagarajan, N. and M. Pop, Parametric
complexity of sequence assembly: theory and applications to next
generation sequencing. J Comput Biol, 2009. 16(7): p.
Langmead, B., et al., Searching
for SNPs with cloud computing. Genome Biol, 2009. 10(11):
Cornman, R.S., et al., Genomic
analyses of the microsporidian Nosema ceranae, an emergent pathogen
of honey bees. PLoS Pathog, 2009. 5(6): p. e1000466.
Graphics Processing Units
Research on the applications of GPUs to Bioinformatics is
supported by the NIH under grants R01-LM006845, R01-GM083873,and
R01-LM007938, and by the NSF under grant CNS-0403313.
Many people do not realize the significant computational resources
available on their computer's graphics card. Many high-end graphics
cards are contain highly-parallel processors called Graphics
Processing Units (GPUs). These processors were initially designed to
speed up the rendering of complex graphics,e.g. for video game
applications, however their power can also be harnessed for other
scientific applications. The use of graphics processors for general
purpose computation is particuarly attractive as their performance is
improving faster than that of typical CPUs (the Moore's law curve is
steper for GPUs), furthermore the cost of high end graphics cards is
rapidly decreasing and these cards are increasingly available in
pre-configured desktop computers. At the CBCB we are interested in
using GPU processors to accelerate bioinformatics applications such
as genome alignment and genome assembly.
Schatz, M.C., Trapnell, C., Delcher, A.L., Varshney, A. (2007)
sequence alignment using Graphics Processing Units. BMC
Trapnell, C., Schatz, M.C. (2009) Optimizing
data intensive GPGPU computations for DNA sequence alignment
Parallel Computing 35: p. 429-440.