High Performance Computing @ CBCB
Recent advances in DNA sequencing technology from Illumina, 454 Life Sciences,
ABI, and Helicos, have enabled next generation sequencing instruments to
sequence the equivalent of the human genome (~3 billion bp) in few days and
at low cost. In contrast, the sequencing for the human genome project of the
late 90's and early '00s required years of work on hundreds of machines with
sequencing costs measured in hundreds of millions of dollars. This dramatic
increase in efficiency has spurred tremendous growth in applications for DNA
sequencing.
For example, whereas the human genome project sought to sequence
the genome of a small group of individuals, the 1000 genomes project aims to
catalog the genomes of 1000 individuals from all regions of the globe
in just three years. Recent related projects aim to catalog all of
the biologically active transcribed regions of the genome over a wide
variety of environmental and disease conditions. Similar studies are
also underway for model organisms such as mouse, rat, chicken, rice,
and yeast, and other organisms of interest.
Cheap and fast sequencing technologies are also providing scientists
with the tools to analyze the largely unknown microbial biosphere.
The majority of microbes inhabiting our world and our bodies are
unknown and cannot be easily manipulated in the laboratory. In recent
years a new scientific field has emerged - metagenomics - that aims to
characterize entire microbial communities by directly sequencing the
DNA directly extracted from an environment. Several studies have
already targeted a range of natural environments (ocean, soil, mine
drainage) as well as the commensal microbes inhabiting the bodies of
humans and other animals and insects. The latter are the target of a
new NIH initiative - the Human
Microbiome Project - an effort to characterize the diversity of
human-associated microbial communities and to understand their
contributions to human health. For more details see our description of
metagenomic research at the CBCB.
The raw data generated by the new sequencing instruments often exceed
1 terabyte and are already straining the computational infrastructure
typically available in an average research lab. Furthermore,
biological datasets are only increasing in size, as data for more
individuals and more environments are collected, further complicating
computational analyses. Even seemingly simple tasks, such as mapping a
collection of sequencing reads to one of the human reference genome,
can require days of computation, while de novo assembly of an entire
human-sized genome using new generation data has yet to be attempted.
The only long-term solution to the challenges posed by the massive
data-sets being generated is to combine computational biology research
with advances from high performance computing (HPC).
At the CBCB, research in high-performance computational biology aims
to leverage two recent technological advances: (1) massively parallel
distributed computing clusters made available over the internet as a
pay-per-use service - a paradigm called Cloud Computing; and (2) the
availability of highly parallel graphics processing units (GPUs) in
high-end graphics cards. These research directions, and recent results
of our research, are described in more detail below.
Research on cloud computing at the CBCB is supported under the NSF
Cluster Exploratory Program (CluE) - grant IIS-0844494.
Our research has recently received quite a bit of media attention:
Our research aims to develop software tools for genome sequence
alignment and genome assembly that can take advantage of the
large-scale parallelism offered by the Google/IBM computing cluster.
One goal of our work is to provide biologists with the ability to
simply rent computational resources through one of the available cloud
computing services, thereby obviating the need to establish a large
computing infrastructure at their institution. Effectively we would
transform computation from a capital investment into simply a
line-item in the research budget similar to how laboratory reagents
are budgeted.
At a more fundamental level, we will explore the limits of the
MapReduce computation paradigm (as implemented in the Hadoop system) when applied to
bioinformatics applications. In particular, genome assembly programs
rely on graph theoretic algorithms that are notoriously difficult to
parallelize. We will also evaluate the cost imposed by the transfer
of data to and from the compute cluster - for the large data-sets
being analyzed communication will likely account for a significant
fraction of the total analysis cost.
Software
| Crossbow | Combines
the efficiency of Bowtie with
advances in Cloud Computing to enable deep-coverage human resequencing
and genotyping in about an hour per individual. | |
| CloudBurst |
Highly Sensitive Short Read Mapping with MapReduce |
Also see a description of related research at the CBCB on software for
analyzing data from new generation sequencing technologies.
Publications
|
Graphics Processing Units |
|---|
Research on the applications of GPUs to Bioinformatics is supported by
the NIH under grants R01-LM006845, R01-GM083873,and R01-LM007938, and
by the NSF under grant CNS-0403313.
Many people do not realize the significant computational resources
available on their computer's graphics card. Many high-end graphics
cards are contain highly-parallel processors called Graphics
Processing Units (GPUs). These processors were initially designed to
speed up the rendering of complex graphics,e.g. for video game
applications, however their power can also be harnessed for other
scientific applications. The use of graphics processors for general
purpose computation is particuarly attractive as their performance is
improving faster than that of typical CPUs (the Moore's law curve is
steper for GPUs), furthermore the cost of high end graphics cards is
rapidly decreasing and these cards are increasingly available in
pre-configured desktop computers. At the CBCB we are interested in using GPU
processors to accelerate bioinformatics applications such as genome
alignment and genome assembly.
Software
Publications
| 1. | Schatz, M.C., Trapnell, C., Delcher, A.L., Varshney, A. (2007) High-throughput sequence alignment using Graphics Processing Units. BMC Bioinformatics 8:474. |
| 2. | Trapnell, C., Schatz, M.C. (2009) Optimizing data intensive GPGPU computations for DNA sequence alignment Parallel Computing doi:10.1016/j.parco.2009.05.002. |
|