mschatz
 

Michael C. Schatz

Center for Bioinformatics and Computational Biology
3104B Biomolecular Sciences Building #296
University of Maryland
College Park, MD 20742

mschatz [a t] umiacs.umd.edu
mschatz [a t] cs.umd.edu
Telephone: 703-966-1987
Fax: 301-314-1341

Ph.D. Computer Science - University of Maryland - in progress, Advisor: Steven Salzberg
M.S. Computer Science - University of Maryland - 2008
B.S. Computer Science - Carnegie Mellon University - 2000
Contents
Research
Publications
Reviews and Articles
Software
Seminars & Presentations
Teaching
Courses
Personal


Research

Recent advances in DNA sequencing technology from Illumina, 454 Life Sciences, ABI, and Helicos, have enabled next generation sequencing instruments to sequence the equivalent of the human genome (~3 billion bp) in few days and at low cost. In contrast, the sequencing for the human genome project of the late 90’s and early ’00s required years of work on hundreds of machines with sequencing costs measured in hundreds of millions of dollars. This dramatic increase in efficiency has spurred tremendous growth in applications for DNA sequencing. For example, whereas the human genome project sought to sequence the genome of a small group of individuals, the 1000 genomes projects aims to catalog the full genomes for 1000 individuals from all regions of the globe. Recent related projects aim to catalog all of the biologically active transcribed regions of the genome over a wide variety of environmental and disease conditions. Similar studies are also underway for model organisms such as mouse, rat, chicken, rice, and yeast, and other organisms of interest. The raw outputs for these studies often exceed 1 terabyte of data, and are pushing the limits of feasibility for the computations involved. Biological dataset are only increasing in size as data for more individuals and more environments are collected, so if we have not yet reached the breaking point for traditional models of computation for computational biology, it is just over the horizon. It is clear that the only long-term solution is to combine research in computational biology with advances from high performance computing (HPC), especially to parallelize computations to multiple processors, and to utilize high performance distributed file systems.


Research Interests


Publications



18. Genomic Analyses of the Microsporidian Nosema ceranae, an Emergent Pathogen of Honey Bees
Cornman, RS, Chen, YP, Schatz, MC, et al, (2009) PLoS Pathogens 5(6):e1000466.
17. A whole-genome assembly of the domestic cow, Bos taurus
Zimin, AV, Delcher, AL, Florea, L, Kelley, DR, Schatz, MC, et al, (2009) Genome Biology 10:R42
16. CloudBurst: Highly Sensitive Read Mapping with MapReduce
Schatz, MC (2009) Bioinformatics 25:1363-1369
15. Comparative genomics of mutualistic viruses of Glyptapanteles parasitic wasps.
Desjardins, CA, et al. (2009) Genome Biology 9:R183
14. Characterization of Insertion Sites in Rainbow Papaya, the First Commercialized Transgenic Fruit Crop.
Suzuki, JY, et al. (2008) Tropical Plant Biology 1:293-309
13. Revealing Biological Modules via Graph Summarization.
Navlakha, S, Schatz, M, Kingsford, C. (2008) Journal of Computational Biology 16(2): 253-264.
12. Genome sequence and rapid evolution of the rice pathogen Xanthomonas oryzae pv. oryzae PXO99A.
Salzberg, SL, Sommer DS, Schatz, MC, et al. (2008) BMC Genomics 9:204
11. The draft genome of the transgenic tropical fruit tree papaya (Carica papaya Linnaeus).
Ming, R, et al. (2008) Nature 452, 991-996.
10. Genome Assembly forensics: finding the elusive mis-assembly.
Phillippy, AM, Schatz, MC, Pop, M. (2008) Genome Biology 9:R55.
9. High-throughput sequence alignment using Graphics Processing Units.
Schatz, MC, Trapnell, C, Delcher, AL, Varshney, A. (2007) BMC Bioinformatics 8:474.
8. Evolution of genes and genomes on the Drosophila phylogeny.
Drosophila 12 Genomes Consortium. (2007) Nature Nov 8;450(7167):203-18.
7. Draft Genome of the Filarial Nematode Parasite Brugia malayi.
Ghedin, E, et al. (2007) Science 317(5845):1756-1760.
6. Structure and evolution of a proviral locus of Glyptapanteles indiensis bracovirus.
Desjardins, CA, et al. (2007) BMC Microbiology 7:61
5. Genome Sequence of Aedes aegypti, a Major Arbovirus Vector.
Nene, V et al. (2007) Science 316(5832), 1718-1723.
4. Hawkeye: a visual analytics tool for genome assemblies.
Schatz, MC, Phillippy, AM, Shneiderman, B, Salzberg, SL. (2007) Genome Biology 8:R34.
3. Draft Genome Sequence of the Sexually Transmitted Pathogen Trichomonas vaginalis.
Carlton, JM, Hirt, RP, Silva, JC, Delcher, AL, Schatz, M, et al. (2007) Science 315 (5809), 207-212.
2. Major structural differences and novel potential virulence mechanisms from the genomes of multiple campylobacter species.
Fouts, DE et al. (2005) PLoS Biology 3 (1):e15.
1. Automated correction of genome sequence errors.
Gajer, P, Schatz, M, Salzberg, SL. (2004) Nucleic Acids Research 32 (2):562-569.

Full Citations


Reviews and Articles

Faculty of 1000 Reviews

Dec 17, 2008 Efficient de novo assembly of bacterial genomes using low coverage short read sequencing.
Reinhardt JA, et al. Genome Res 2008 Dec 1.


Articles

April 2009 GPUs: Here to Stay
Article by Matthew Dublin in Genome Technology
Dec 21, 2007 UMD Team Creates GPU-Enabled Version of MUMmer to Tackle Next-Gen Sequence Data
Article by Bernadette Toner for GenomeWeb



Software

AMOSA fast and flexible API for genome assembly and manipulations
AMOSValidateAMOS Assembly Forenics pipeline for discovering mis-assemblies
AutoEditor Automatic correction of genome sequencing errors
BlastReduce High Performance Short Read Mapping with MapReduce (superceded by CloudBurst)
Celera Assembler The program used to assemble the human genome at Celera Genomics in 2001
CloudBurst Highly Sensitive Short Read Mapping with MapReduce
CmatchExtremely fast end-to-end sequence matching on the GPU (superceded by MUMmerGPU)
Hawkeye Genome Assembly Viewer and Analysis Tool.
MUMmer A modular system for the rapid whole genome alignment of finished or draft sequence
MUMmerGPUHigh-throughput Sequence Alignment on the GPU using CUDA GPGPU API from nVidia
PhyloTracVisualization and Analysis tool for the PhlyoChip
Slice ToolsLow level tools for assembly manipulation


Seminars & Presentations

15. Assembly Bootcamp. Presentation at UMD Institute for Genome Sciences. June 26 2009.

The theory and practice of genome assembly using the Celera Assembler. Special emphasis is given to tuning the parameters and settings to get the best results for your data.

14. High Throughput Sequence Analysis with MapReduce Presentation at J. Craig Venter Institute Informatics Seminar. June 18, 2009.

MapReduce is the parallel distributed computing framework developed by Google for large data computations, including analyzing their collection of more than 1 trillion web pages on clusters with 10s of thousands of nodes. This system enables rapid development of highly scalable applications, because developers write just a few application specific functions, and the system automatically and intelligently provides the scheduling, monitoring, and partitioning necessary to scale to this size. Furthermore, MapReduce is becoming a de facto standard for executing large computations within the cloud, where remote compute resources are used generically under a pay-as-you-go pricing model.

In this presentation, I will describe the leading open-source implementation of MapReduce called Hadoop, the cloud computing capabilities of Amazon, and outline MapReduce-based sequence analysis algorithms for read alignment, SNP discovery, and genome assembly. Scalable algorithms for these problems are essential given that current sequencing technologies routinely generate tens or hundreds of gigabytes of data for a single experiment, and can require hundreds or thousands of hours of computation. The results show MapReduce is an extremely effective system for analyzing these datasets, with near linear speedups as the size of the cluster grows. Furthermore, the Amazon compute cloud can be an efficient and cost-effective resource, especially for periodic or unusually large compute tasks.


13. Genetic Sequence Analysis in the Clouds. Presented by Jimmy Lin at the Hadoop Summit 2009. Santa Clara, CA. June 10 2009.

12. CloudBurst: Highly Sensitive Read Mapping with MapReduce Presentation for Amazon Web Services Start Up Event - Washington DC. May 27, 2009.

11. High Throughput Sequence Alignment using Graphics Processing Units, Presentation for UMD induction as an nVidia CUDA Center of Excellence

10. Towards a de novo short read assembler for large genomes using cloud computing, Poster at Biology of Genomes 09, Cold Spring Harbor, NY, May 2009.

The massive volume of data and short read lengths from next generation DNA sequencing machines has spurred development of a new class of short read genome assemblers. Several of the new assemblers, such as Velvet and Euler-USR, model the assembly problem as constructing, simplifying, and traversing the de Bruijn graph of the read sequences, where nodes in the graph represent k-mers in the reads, with edges between nodes for consecutive k-mers. This approach has many advantages for these data, such as efficient computation of overlapping reads and robust handling of sequencing errors, and has demonstrated success for assembling small to moderately sized genomes. However, this approach is computationally challenging to scale to mammalian-sized genomes because it requires constructing and manipulating a graph far larger than can fit into memory.

MapReduce was developed at Google for parallel computation on their extremely large data sets, including their database of more than 1 trillion web pages. Computation in MapReduce is structured into 2 main phases: the map phase and the reduce phase, which act together to construct a large distributed hash table of key-value pairs in a map phase, and then evaluate a function on each bucket of the hash table in the reduce phase. The power of MapReduce is dozens or hundreds of map and reduce instances can execute in parallel, enabling efficient computation even on terabyte and petabyte sized data sets.

Drawing on the success of CloudBurst, a MapReduce-based short read mapping algorithm capable of mapping millions of reads to the human genome with high sensitivity, we have developed a MapReduce-based short read assembler that shows tremendous potential for enabling de novo assembly of mammalian-sized genomes. The deBruijn graph is constructed with MapReduce by emitting and then grouping key-value pairs (ki,ki+1) between successive k-mers in the read sequences. After construction, MapReduce is used again to remove spurious nodes and edges from the graph caused by sequencing error in the reads, and to compress simple chains of nodes into long sequence nodes representing the unambiguous regions of the genome between repeat boundaries. The resulting graph is a small fraction of the size of the original deBruijn graph, and is output in a format compatible with other short read assemblers for additional analysis.


9. Improving the genome sequence of D. simulans via co-assembly of multiple strains, Poster at Biology of Genomes 09, Cold Spring Harbor, NY, May 2009.

8. A whole-genome assembly of the domestic cow, B. Taurus, Poster at Biology of Genomes 09, Cold Spring Harbor, NY, May 2009.

7. Better Modules in Protein-Protein Interaction Networks, Poster at Pacific Symposium on Biocomputing, Hawaii, January 2009.
Revealing Biological Modules via Graph Summarization. Presentation at RECOMB-SB/RG/DREAM3 2008 satelite coference., Boston MA, Oct 2008

A technique called Graph Summarization can be used to partition protein-protein interaction networks to reveal modules that are more biologically relevant than the clusters produced by other graph partitioning techniques. We apply GS to predict Gene Ontology annotations of biological process for proteins of unknown annotations. We also apply it to detecting membership in protein complexes, as annotated in the MIPS catalog. GS outperforms other approaches such MCODE, MCL and modularity.


6. Genome Assembly Forensics: Finding the Elusive Mis-assembly, Poster at Biology of Genomes, Cold Spring Harbor, NY, 5/9/2008.

Since the initial "draft" sequence of the human genome was released in 2001, it has become clear that it was not an entirely accurate reconstruction of the genome. Despite significant advances in sequencing and assembly since then, genome sequencing continues to be an inexact process. Genome finishing and validation have remained a largely manual and expensive process, and consequently, many genomes are presented as draft assemblies. Draft assemblies are of unknown quality and potentially contain significant mis-assemblies, such as collapsed repeats, sequence excision, or artificial rearrangements. Too often these assemblies are judged only by contig size, with larger contigs preferred without regard to quality, because it has been difficult to gauge large scale assembly quality.

Our new automated software pipeline, amosvalidate, addresses this deficiency and automatically detects mis-assemblies using a battery of known and novel assembly quality metrics. Instead of focusing on a single assembly characteristic as other validation approaches have tried, the power of our approach comes from leveraging multiple sources of evidence. amosvalidate statistically analyzes mate-pair orientations and separations, repeat content, depth-of-coverage, correlated polymorphisms in the read alignments, and read alignment breakpoints to identify structurally suspicious regions of the assembly. The suspicious regions identified by individual metrics are then clustered and combined to identify (with high confidence) regions that are mis-assembled. This approach is necessary for accurately detecting mis-assemblies because each of the individual characteristics has unavoidable natural variation, but, when considered together, have greatly increased analysis power. Furthermore, our pipeline can easily be adjusted to analyze assemblies utilizing new sequencing technologies where some metrics are unreliable or not available, such as base pair quality or mate pairs.

Our validation pipeline provides a robust measure of assembly quality that goes beyond the simple measures commonly reported. Evaluation of the pipeline has shown it to be highly sensitive for mis-assembly detection, and has revealed mis-assemblies in both draft and finished genomes. This is particularly troubling as scientists move away from the ¿gene by gene¿ paradigm and attempt to understand the global organization of genomes. Without a correct genome sequence or even a clear understanding of the errors present, such studies may draw incorrect conclusions. Our goals are to help scientists locate mis-assembled regions of an assembly and help them correct those regions by focusing their efforts where it is needed most. amosvalidate is compatible with many common assembly formats and is released open-source at http://amos.sourceforge.net.


5. Hunting Down the Papaya Transgenes, Talk at PAG-XVI, San Diego CA, 1/16/2008

In the middle of the last century, the Papaya ringspot potyvirus (PRSV) devastated the papaya industry on the island of Oahu in Hawaii and in other fields throughout the world. With the eminent threat of the disease spreading to the fields in the Puna district of Hawaii island, researchers in the mid 1980s developed PRSV-resistant transgenic lines of papaya using the pathogen-derived resistance approach, in which genes from PRSV were inserted into the papaya genome using a gene gun. The commercialization of these transgenic lines in the late 1990s virtually saved the Hawaiian papaya industry, but without a full genome sequence, there was lingering concern as to the exact nature of the transgenic insertions.

In my presentation, I will report on the draft genome sequence of the virus-resistant ‘SunUp’ papaya, created in collaboration with the University of Hawaii, the University of Illinois at Urbana-Champaign, and other institutions. I will focus on the computational methods used for assembling the genome, validating its correctness, and the subsequent search for transgenic inserts. Our genome wide analysis, combined with Southern blot analysis and directed PCR, confirms the efficiency of the gene gun technology, with only 3 conclusive transgenic insertions. In addition, even though the papaya genome is nearly twice the size of the Arabidopsis genome, it contains fewer genes, and thus makes it an excellent candidate for further study of biosynthetic pathways and networks.



4. High-throughput sequence alignment using Graphics Processing Units, Talk at CBCB Seminar, 9/20/2007
High-throughput sequence alignment using Graphics Processing Units, Poster in 15th Annual Microbial Genomes Conference, 2007
Co-presented with Cole Trapnell

The recent availability of new, less expensive high-throughput DNA sequencing technologies has yielded a dramatic increase in the volume of sequence data that must be analyzed. Sequence alignment programs such as MUMmer have proven essential for analysis of these data, but researchers will need ever faster, high-throughput alignment tools running on inexpensive hardware to keep up with new sequence technologies. We present MUMmerGPU, a high-throughput parallel sequence alignment program that runs on commodity Graphics Processing Units (GPUs) in common workstations. MUMmerGPU uses the new Compute Unified Device Architecture (CUDA) from nVidia to align multiple query sequences against a single reference sequence stored as a suffix tree. By processing the queries in parallel on the highly parallel graphics card, MUMmerGPU achieves more than a 10-fold speedup over a serial CPU version of the sequence alignment kernel, and outperforms MUMmer by more than 3-fold in total application time when aligning reads from recent sequencing projects using Solexa/Illumina, 454, and Sanger sequencing technologies.


3. Interactive visual analytic tools for genome assembly, Talk at 9th Annual Computational Genomics Conference, 10/29/2006

Genome assembly remains an inexact science. Even when accomplished with the best software available, the assembly of a genome often contains numerous errors, both small and large. Hawkeye is a visual analytics tool for genome assembly analysis and validation, designed to aid in identifying and correcting assembly errors. Hawkeye blends the best practices from information and scientific visualization to facilitate inspection of large-scale assembly data while minimizing the time needed to detect mis-assemblies and make accurate judgments of assembly quality.

All levels of the assembly data hierarchy are made accessible to users, along with summary statistics and common assembly metrics. A ranking component guides investigation towards likely mis-assemblies or interesting features to support the task at hand. Wherever possible, high-level overviews, dynamic filtering, and automated clustering are leveraged to focus attention and highlight anomalies in the data. Hawkeyes effectiveness has been proven on several genome projects, where it has been used both to improve quality and to validate the correctness of complex genomes.


2. AMOS Assembly Validation and Visualization, Talk at The Institute for Genomic Research, 4/7/2006

During my talk, I will discuss the techniques and tools to discover and correct misassemblies in genome assemblies. The three primary sources of information used to detect misassemblies are the "happiness" of the mate-pairs, the base call agreement within the multiple alignment of reads, and the depth of coverage of those reads.

The open source AMOS assembly package provides tools for systematically analyzing these qualities to discover regions with potential misassemblies. The AMOS Assembly Investigator is a powerful genome assembly visualizer with semantic zooming capabilities. It allows one to navigate and visually inspect these potential misassemblies in a systematic fashion at all levels of detail. Once regions with misassemblies have been identified, users can correct the misassemblies with the AMOS contig patching tools.


1. Improving Genome Assemblies without Sequencing, Talk at CBCB Seminar, 9/28/2005
Improving Genome Assembly without Sequencing. RECOMB 2005 Poster.

Genome assembly is the problem of reconstructing the genome sequence of an organism from a collection of short sequenced reads. An assembly takes the form of contiguous stretches of DNA sequence (contigs) linked together in scaffolds by mate-pair and other information. Genome assembly is scientifically one of the most important areas of bioinformatics research as an accurate genome sequence is needed for addressing several fundamental biological questions. Unfortunately, it is also one of the most complex computationally, having been proved NP-hard under various formalisms and a typical problem size of thousands or millions of inputs.

During my talk, I will discuss some of the algorithmic challenges and trade-offs in genome assembly. I will also discuss some computational methods for improving an assembly, which can be applied generally but without requiring additional laboratory results. One method was implemented in AutoEditor, which acts as a second generation base-caller to find and correct base-calling errors in reads using the original chromatogram trace and the multiple alignment of reads. A second was implemented in AutoJoiner, which attempts to automatically close gaps between linked contigs, and generally enhance contig quality, by extending the usable portion of reads within an assembly.



Teaching

Teaching Assistant

Fall 2007CMSC828NTA for Prof. Steven Salzberg Computational Gene Finding and Genome Assembly


Guest Lectures

Apr 21, 2009 Towards a de novo short read assembler for large genomes with cloud computing
AMSC 664 Advanced Scientific Computing 2
Oct 7, 2008 Genome Assembly Visualization and Validation
CMSC 828N Computational Gene Finding and Genome Assembly
Feb 27, 2007 Genome Assembly Visualization and Validation
CMSC 828N Computational Gene Finding and Genome Assembly


Genome Assembly Class

An 8 part lecture series given at the University of Hawaii between August 13 - 18 2006. The lecture series covers the entire assembly process, from sequencing reactions, to assembly, and finishing. The discussion begins with an overview of the assembly process, and its theoretical foundations of Lander-Waterman statistics and Shortest-Common-Superstring. Next there is an indepth discussion of the Celera Assembler, covering the details of overlapping, unitigging, and scaffolding. Next an Introduction to AMOS is given describing the motivation, framework, and a brief discussion of some of the currently available tools. Lecture 5 discusses current methods to discover mis-assemblies and the Interactive Genome Visual Analytics tool Hawkeye, which acts as a visual portal to understanding and validating your assembly data. Next, I discuss two common problems in assembly, that of base calling and trimming and describe AutoEditor and AutoJoiner which are second generation assembly tools to address these areas. Lecture 6 is provided by Adam Phillippy and covers all aspects of Whole Genome Alignment, centered around the MUMmer suite. The following lecture, also by Adam Phillippy, describes the AMOScmp Comparative Assembler which uses MUMmer to assemble genomes without the costly overlapping step even at extremely low coverage. The Final lecture acts as a summary for the class, and a checklist for potential problem areas one might encounter during whole genome assembly.

1. Genome Assembly: Assembly Concepts and Methods : Assembly Overview, Lander-Waterman Statistics, Shortest-Common-Superstring, Contigging, Scaffolding
2. Celera Assembler: Theory and Practice : runCA, overlapper, unitigging, scaffolding
3. AMOS: A Modular Open Source Assembler : AMOS overview, runAMOS, AMOS banks, Converters
4. AMOS Assembly Validation and Visualization : Mate-pairs, SNPs, Coverage levels, Hawkeye, stitchContigs, Assembly Repair
5. Improving Assembly without Sequencing : Basecalling, AutoEditor, Trimming, AutoJoiner
6. Whole Genome Alignment : Alignment, Smith-Waterman, MUMmer, Suffix Trees
7. Comparative Genome Assembly : AMOScmp, MUMmer, reference assembly
8. Assembly Checklist : Sequencing, Libraries, Biases, Coverage, Unitigging, Scaffolding


Courses

Spring 2008LDSC878AJimmy Lin Web-Scale Information Processing Applications
Spring 2008BSCI410 Boots Quimby Molecular Genetics
Fall 2007 CMSC714 Alan Sussman High Performance Computing
Fall 2007 CMSC858LCarl Kingsford Graphs and Networks in Computational Biology
Spring 2007CMSC740 Amitabh VarshneyAdvanced Computer Graphics
Spring 2007CMSC754 Dave Mount Computational Geometry
Fall 2006 CMSC725 Hanan Samet Geographical Information Systems and Spatial Databases
Fall 2006 CMSC828ULouiqa Raschid Advanced Topics in Information Processing: Exploiting Biological Resources
Spring 2006CMSC828NSteven Salzberg Computational Gene Finding and Genome Assembly
Spring 2006CMSC838SBen Shneiderman Information Visualization
Fall 2005 CMSC858ENathan Edwards Algorithms for Biosequence Analysis
Fall 2005 CMSC818SNeil Spring Internet Reverse Engineering


Personal

My wife's site: Emery Hurst Mikel
Our Wedding Website
My Martial Arts School: Lung Chuan Fa


Last updated: Friday, 26-Jun-2009 16:52:22 EDT