High Performance Computing @ CBCB

Introduction People Cloud Computing GPGPU

Introduction

Recent advances in DNA sequencing technology from Illumina, 454 Life Sciences, ABI, and Helicos, have enabled next generation sequencing instruments to sequence the equivalent of the human genome (~3 billion bp) in few days and at low cost. In contrast, the sequencing for the human genome project of the late 90's and early '00s required years of work on hundreds of machines with sequencing costs measured in hundreds of millions of dollars. This dramatic increase in efficiency has spurred tremendous growth in applications for DNA sequencing.

For example, whereas the human genome project sought to sequence the genome of a small group of individuals, the 1000 genomes project aims to catalog the genomes of 1000 individuals from all regions of the globe in just three years. Recent related projects aim to catalog all of the biologically active transcribed regions of the genome over a wide variety of environmental and disease conditions. Similar studies are also underway for model organisms such as mouse, rat, chicken, rice, and yeast, and other organisms of interest.

Cheap and fast sequencing technologies are also providing scientists with the tools to analyze the largely unknown microbial biosphere. The majority of microbes inhabiting our world and our bodies are unknown and cannot be easily manipulated in the laboratory. In recent years a new scientific field has emerged - metagenomics - that aims to characterize entire microbial communities by directly sequencing the DNA directly extracted from an environment. Several studies have already targeted a range of natural environments (ocean, soil, mine drainage) as well as the commensal microbes inhabiting the bodies of humans and other animals and insects. The latter are the target of a new NIH initiative - the Human Microbiome Project - an effort to characterize the diversity of human-associated microbial communities and to understand their contributions to human health. For more details see our description of metagenomic research at the CBCB.

The raw data generated by the new sequencing instruments often exceed 1 terabyte and are already straining the computational infrastructure typically available in an average research lab. Furthermore, biological datasets are only increasing in size, as data for more individuals and more environments are collected, further complicating computational analyses. Even seemingly simple tasks, such as mapping a collection of sequencing reads to one of the human reference genome, can require days of computation, while de novo assembly of an entire human-sized genome using new generation data has yet to be attempted. The only long-term solution to the challenges posed by the massive data-sets being generated is to combine computational biology research with advances from high performance computing (HPC).

At the CBCB, research in high-performance computational biology aims to leverage two recent technological advances: (1) massively parallel distributed computing clusters made available over the internet as a pay-per-use service - a paradigm called Cloud Computing; and (2) the availability of highly parallel graphics processing units (GPUs) in high-end graphics cards. These research directions, and recent results of our research, are described in more detail below.



People

Faculty

Mihai Pop
Steven Salzberg
Amitabh Varshney
Jimmy Lin
 

Students

Ben Langmead
Michael Schatz
Cole Trapnell


Cloud Computing Research
Research on cloud computing at the CBCB is supported under the NSF Cluster Exploratory Program (CluE) - grant IIS-0844494.

Our research has recently received quite a bit of media attention:

Our research aims to develop software tools for genome sequence alignment and genome assembly that can take advantage of the large-scale parallelism offered by the Google/IBM computing cluster. One goal of our work is to provide biologists with the ability to simply rent computational resources through one of the available cloud computing services, thereby obviating the need to establish a large computing infrastructure at their institution. Effectively we would transform computation from a capital investment into simply a line-item in the research budget similar to how laboratory reagents are budgeted.

At a more fundamental level, we will explore the limits of the MapReduce computation paradigm (as implemented in the Hadoop system) when applied to bioinformatics applications. In particular, genome assembly programs rely on graph theoretic algorithms that are notoriously difficult to parallelize. We will also evaluate the cost imposed by the transfer of data to and from the compute cluster - for the large data-sets being analyzed communication will likely account for a significant fraction of the total analysis cost.

Software

CrossbowCombines the efficiency of Bowtie with advances in Cloud Computing to enable deep-coverage human resequencing and genotyping in about an hour per individual.
CloudBurst Highly Sensitive Short Read Mapping with MapReduce
Also see a description of related research at the CBCB on software for analyzing data from new generation sequencing technologies.

Publications

1. Schatz, M.C. (2009) CloudBurst: Highly Sensitive Short Read Mapping with MapReduce. Bioinformatics


Graphics Processing Units

Research on the applications of GPUs to Bioinformatics is supported by the NIH under grants R01-LM006845, R01-GM083873,and R01-LM007938, and by the NSF under grant CNS-0403313.

Many people do not realize the significant computational resources available on their computer's graphics card. Many high-end graphics cards are contain highly-parallel processors called Graphics Processing Units (GPUs). These processors were initially designed to speed up the rendering of complex graphics,e.g. for video game applications, however their power can also be harnessed for other scientific applications. The use of graphics processors for general purpose computation is particuarly attractive as their performance is improving faster than that of typical CPUs (the Moore's law curve is steper for GPUs), furthermore the cost of high end graphics cards is rapidly decreasing and these cards are increasingly available in pre-configured desktop computers. At the CBCB we are interested in using GPU processors to accelerate bioinformatics applications such as genome alignment and genome assembly.

Software

MUMmerGPUHigh-throughput Sequence Alignment on the GPU using CUDA GPGPU API from nVidia


Publications

1. Schatz, M.C., Trapnell, C., Delcher, A.L., Varshney, A. (2007) High-throughput sequence alignment using Graphics Processing Units. BMC Bioinformatics 8:474.
2. Trapnell, C., Schatz, M.C. (2009) Optimizing data intensive GPGPU computations for DNA sequence alignment Parallel Computing doi:10.1016/j.parco.2009.05.002.