mschatz
 

Michael C. Schatz

Center for Bioinformatics and Computational Biology
3120G Biomolecular Sciences Building #296
University of Maryland
College Park, MD 20742

mschatz [a t] umiacs.umd.edu
mschatz [a t] cs.umd.edu
Office: 301 405 7169
Cell: 703 966 1987
Fax: 301 314 1341

Ph.D. Computer Science - University of Maryland - in progress, Advisor: Steven Salzberg
B.S. Computer Science - Carnegie Mellon University - 2000
Contents
Research
Publications
Software
Seminars & Presentations
Courses
Teaching
Personal


Research

We are entering the era of genomics in science and medicine in which biological systems are understood in terms of their precise genetic components, and treatments are individualized for each patient. The first major milestone of this era, the sequencing of the human genome, has been achieved, but significant challenges remain in unlocking the full meaning of the genome. My research is aimed towards realizing the goals of genomics through improved computational methods for DNA sequencing and analysis. My work includes developing software for genome assembly, which can create a more accurate and more complete reconstruction of the genome than other assemblers. This is an absolutely fundamental requirement for a wide variety of important biological analyses including gene annotation, comparative genomics, and disease genotyping. I have also been researching new high performance computing hardware for use in bioinformatics, and coauthored an open source DNA sequence alignment tool that runs on highly parallel graphics processing units. We found the graphics hardware was 10x faster than a regular CPU for this application, and are eager to implement new applications on the hardware. This level of performance is necessary to manage the avalanche of sequencing data coming from recently available high throughput sequencing technologies, which can create gigabytes of sequence data in a few hours and will be used for studying health and disease at unprecedented scales. Finally, I'm developing the visualization and analysis software for the PhyloChip which can perform push-button environmental and metagenomics analysis. Only with the proper combination of biological knowledge and computer science will we realize the goals of genomics.


Research Interests

  • Genome Assembly & Validation
  • Comparative Genomics & Metagenomics
  • Sequence Alignment
  • Environmental Sampling
  • Scientific Visualization
  • High Performance and Multi-Core Computing

Publications



12. Salzberg, S.L., Sommer D.S., Schatz, M.C. et al. (2008) Genome sequence and rapid evolution of the rice pathogen Xanthomonas oryzae pv. oryzae PXO99A BMC Genomics 9:204
11. Ming, R et al. (2008) The draft genome of the transgenic tropical fruit tree papaya (Carica papaya Linnaeus) Nature 452, 991-996.
10. Phillippy, A.M., Schatz, M.C., Pop, M. (2008) Genome Assembly forensics: finding the elusive mis-assembly. Genome Biology 9:R55.
9. Schatz, M.C., Trapnell, C., Delcher, A.L., Varshney, A. (2007) High-throughput sequence alignment using Graphics Processing Units. BMC Bioinformatics 8:474.
8. Drosophila 12 Genomes Consortium. (2007) Evolution of genes and genomes on the Drosophila phylogeny. Nature Nov 8;450(7167):203-18.
7. Ghedin, E., et al. (2007) Draft Genome of the Filarial Nematode Parasite Brugia malayi. Science 317(5845):1756-1760.
6. Desjardins, C.A., et al. (2007) Structure and evolution of a proviral locus of Glyptapanteles indiensis bracovirus BMC Microbiology 7:61
5. Nene, V., et al. (2007) Genome Sequence of Aedes aegypti, a Major Arbovirus Vector Science 316(5832), 1718-1723.
4. Schatz, M.C., Phillippy, A.M., Shneiderman, B., Salzberg, S.L. (2007) Hawkeye: a visual analytics tool for genome assemblies. Genome Biology 8:R34.
3. Carlton, J.M., Hirt, R.P., Silva, J.C., Delcher, A.L., Schatz, M., et al. (2007) Draft Genome Sequence of the Sexually Transmitted Pathogen Trichomonas vaginalis Science 315 (5809), 207-212.
2. Fouts, D.E., et al. (2005) Major structural differences and novel potential virulence mechanisms from the genomes of multiple campylobacter species. PLoS Biology 3 (1):e15.
1. Gajer, P., Schatz, M., Salzberg, S.L. (2004) Automated correction of genome sequence errors. Nucleic Acids Research 32 (2):562-569.

Full Citations


Software

AMOSA fast and flexible API for genome assembly and manipulations
AMOSValidateAMOS Assembly Forenics pipeline for discovering mis-assemblies
AutoEditor Automatic correction of genome sequencing errors
BlastReduce High Performance Short Read Mapping with MapReduce
Celera Assembler The program used to assemble the human genome at Celera Genomics in 2001
CmatchExtremely fast end-to-end sequence matching on the GPU (superceded by MUMmerGPU)
Hawkeye Genome Assembly Viewer and Analysis Tool.
MUMmer A modular system for the rapid whole genome alignment of finished or draft sequence
MUMmerGPUHigh-throughput Sequence Alignment on the GPU using CUDA GPGPU API from nVidia
PhyloTracVisualization and Analysis tool for the PhlyoChip
Slice ToolsLow level tools for assembly manipulation


Seminars & Presentations

6. Genome Assembly Forensics: Finding the Elusive Mis-assembly", Biology of Genomes, Cold Spring Harbor, NY, 5/9/2008.

Since the initial "draft"sequence of the human genome was released in 2001, it has become clear that it was not an entirely accurate reconstruction of the genome. Despite significant advances in sequencing and assembly since then, genome sequencing continues to be an inexact process. Genome finishing and validation have remained a largely manual and expensive process, and consequently, many genomes are presented as draft assemblies. Draft assemblies are of unknown quality and potentially contain significant mis-assemblies, such as collapsed repeats, sequence excision, or artificial rearrangements. Too often these assemblies are judged only by contig size, with larger contigs preferred without regard to quality, because it has been difficult to gauge large scale assembly quality.

Our new automated software pipeline, amosvalidate, addresses this deficiency and automatically detects mis-assemblies using a battery of known and novel assembly quality metrics. Instead of focusing on a single assembly characteristic as other validation approaches have tried, the power of our approach comes from leveraging multiple sources of evidence. amosvalidate statistically analyzes mate-pair orientations and separations, repeat content, depth-of-coverage, correlated polymorphisms in the read alignments, and read alignment breakpoints to identify structurally suspicious regions of the assembly. The suspicious regions identified by individual metrics are then clustered and combined to identify (with high confidence) regions that are mis-assembled. This approach is necessary for accurately detecting mis-assemblies because each of the individual characteristics has unavoidable natural variation, but, when considered together, have greatly increased analysis power. Furthermore, our pipeline can easily be adjusted to analyze assemblies utilizing new sequencing technologies where some metrics are unreliable or not available, such as base pair quality or mate pairs.

Our validation pipeline provides a robust measure of assembly quality that goes beyond the simple measures commonly reported. Evaluation of the pipeline has shown it to be highly sensitive for mis-assembly detection, and has revealed mis-assemblies in both draft and finished genomes. This is particularly troubling as scientists move away from the ¿gene by gene¿ paradigm and attempt to understand the global organization of genomes. Without a correct genome sequence or even a clear understanding of the errors present, such studies may draw incorrect conclusions. Our goals are to help scientists locate mis-assembled regions of an assembly and help them correct those regions by focusing their efforts where it is needed most. amosvalidate is compatible with many common assembly formats and is released open-source at http://amos.sourceforge.net.


5. Hunting Down the Papaya Transgenes, PAG-XVI, San Diego CA, 1/16/2008

In the middle of the last century, the Papaya ringspot potyvirus (PRSV) devastated the papaya industry on the island of Oahu in Hawaii and in other fields throughout the world. With the eminent threat of the disease spreading to the fields in the Puna district of Hawaii island, researchers in the mid 1980s developed PRSV-resistant transgenic lines of papaya using the pathogen-derived resistance approach, in which genes from PRSV were inserted into the papaya genome using a gene gun. The commercialization of these transgenic lines in the late 1990s virtually saved the Hawaiian papaya industry, but without a full genome sequence, there was lingering concern as to the exact nature of the transgenic insertions.

In my presentation, I will report on the draft genome sequence of the virus-resistant ‘SunUp’ papaya, created in collaboration with the University of Hawaii, the University of Illinois at Urbana-Champaign, and other institutions. I will focus on the computational methods used for assembling the genome, validating its correctness, and the subsequent search for transgenic inserts. Our genome wide analysis, combined with Southern blot analysis and directed PCR, confirms the efficiency of the gene gun technology, with only 3 conclusive transgenic insertions. In addition, even though the papaya genome is nearly twice the size of the Arabidopsis genome, it contains fewer genes, and thus makes it an excellent candidate for further study of biosynthetic pathways and networks.



4. High-throughput sequence alignment using Graphics Processing Units, CBCB Seminar, 9/20/2007
High-throughput sequence alignment using Graphics Processing Units, Poster in 15th Annual Microbial Genomes Conference, 2007
Co-presented with Cole Trapnell

The recent availability of new, less expensive high-throughput DNA sequencing technologies has yielded a dramatic increase in the volume of sequence data that must be analyzed. Sequence alignment programs such as MUMmer have proven essential for analysis of these data, but researchers will need ever faster, high-throughput alignment tools running on inexpensive hardware to keep up with new sequence technologies. We present MUMmerGPU, a high-throughput parallel sequence alignment program that runs on commodity Graphics Processing Units (GPUs) in common workstations. MUMmerGPU uses the new Compute Unified Device Architecture (CUDA) from nVidia to align multiple query sequences against a single reference sequence stored as a suffix tree. By processing the queries in parallel on the highly parallel graphics card, MUMmerGPU achieves more than a 10-fold speedup over a serial CPU version of the sequence alignment kernel, and outperforms MUMmer by more than 3-fold in total application time when aligning reads from recent sequencing projects using Solexa/Illumina, 454, and Sanger sequencing technologies.


3. Interactive visual analytic tools for genome assembly, 9th Annual Computational Genomics Conference, 10/29/2006

Genome assembly remains an inexact science. Even when accomplished with the best software available, the assembly of a genome often contains numerous errors, both small and large. Hawkeye is a visual analytics tool for genome assembly analysis and validation, designed to aid in identifying and correcting assembly errors. Hawkeye blends the best practices from information and scientific visualization to facilitate inspection of large-scale assembly data while minimizing the time needed to detect mis-assemblies and make accurate judgments of assembly quality.

All levels of the assembly data hierarchy are made accessible to users, along with summary statistics and common assembly metrics. A ranking component guides investigation towards likely mis-assemblies or interesting features to support the task at hand. Wherever possible, high-level overviews, dynamic filtering, and automated clustering are leveraged to focus attention and highlight anomalies in the data. Hawkeyes effectiveness has been proven on several genome projects, where it has been used both to improve quality and to validate the correctness of complex genomes.


2. AMOS Assembly Validation and Visualization, The Institute for Genomic Research, 4/7/2006

During my talk, I will discuss the techniques and tools to discover and correct misassemblies in genome assemblies. The three primary sources of information used to detect misassemblies are the "happiness" of the mate-pairs, the base call agreement within the multiple alignment of reads, and the depth of coverage of those reads.

The open source AMOS assembly package provides tools for systematically analyzing these qualities to discover regions with potential misassemblies. The AMOS Assembly Investigator is a powerful genome assembly visualizer with semantic zooming capabilities. It allows one to navigate and visually inspect these potential misassemblies in a systematic fashion at all levels of detail. Once regions with misassemblies have been identified, users can correct the misassemblies with the AMOS contig patching tools.


1. Improving Genome Assemblies without Sequencing, CBCB Seminar, 9/28/2005
Improving Genome Assembly without Sequencing. RECOMB 2005 Poster.

Genome assembly is the problem of reconstructing the genome sequence of an organism from a collection of short sequenced reads. An assembly takes the form of contiguous stretches of DNA sequence (contigs) linked together in scaffolds by mate-pair and other information. Genome assembly is scientifically one of the most important areas of bioinformatics research as an accurate genome sequence is needed for addressing several fundamental biological questions. Unfortunately, it is also one of the most complex computationally, having been proved NP-hard under various formalisms and a typical problem size of thousands or millions of inputs.

During my talk, I will discuss some of the algorithmic challenges and trade-offs in genome assembly. I will also discuss some computational methods for improving an assembly, which can be applied generally but without requiring additional laboratory results. One method was implemented in AutoEditor, which acts as a second generation base-caller to find and correct base-calling errors in reads using the original chromatogram trace and the multiple alignment of reads. A second was implemented in AutoJoiner, which attempts to automatically close gaps between linked contigs, and generally enhance contig quality, by extending the usable portion of reads within an assembly.



Courses

Spring 2008LDSC878AJimmy Lin Web-Scale Information Processing Applications
Spring 2008BSCI410 Boots Quimby Molecular Genetics
Fall 2007 CMSC714 Alan Sussman High Performance Computing
Fall 2007 CMSC858LCarl Kingsford Graphs and Networks in Computational Biology
Spring 2007CMSC740 Amitabh VarshneyAdvanced Computer Graphics
Spring 2007CMSC754 Dave Mount Computational Geometry
Fall 2006 CMSC725 Hanan Samet Geographical Information Systems and Spatial Databases
Fall 2006 CMSC828ULouiqa Raschid Advanced Topics in Information Processing: Exploiting Biological Resources
Spring 2006CMSC828NSteven Salzberg Computational Gene Finding and Genome Assembly
Spring 2006CMSC838SBen Shneiderman Information Visualization
Fall 2005 CMSC858ENathan Edwards Algorithms for Biosequence Analysis
Fall 2005 CMSC818SNeil Spring Internet Reverse Engineering


Teaching

Fall 2007CMSC828NTA for Prof. Steven Salzberg Computational Gene Finding and Genome Assembly


Genome Assembly Class

An 8 part lecture series given at the University of Hawaii between August 13 - 18 2006. The lecture series covers the entire assembly process, from sequencing reactions, to assembly, and finishing. The discussion begins with an overview of the assembly process, and its theoretical foundations of Lander-Waterman statistics and Shortest-Common-Superstring. Next there is an indepth discussion of the Celera Assembler, covering the details of overlapping, unitigging, and scaffolding. Next an Introduction to AMOS is given describing the motivation, framework, and a brief discussion of some of the currently available tools. Lecture 5 discusses current methods to discover mis-assemblies and the Interactive Genome Visual Analytics tool Hawkeye, which acts as a visual portal to understanding and validating your assembly data. Next, I discuss two common problems in assembly, that of base calling and trimming and describe AutoEditor and AutoJoiner which are second generation assembly tools to address these areas. Lecture 6 is provided by Adam Phillippy and covers all aspects of Whole Genome Alignment, centered around the MUMmer suite. The following lecture, also by Adam Phillippy, describes the AMOScmp Comparative Assembler which uses MUMmer to assemble genomes without the costly overlapping step even at extremely low coverage. The Final lecture acts as a summary for the class, and a checklist for potential problem areas one might encounter during whole genome assembly.

1. Genome Assembly: Assembly Concepts and Methods : Assembly Overview, Lander-Waterman Statistics, Shortest-Common-Superstring, Contigging, Scaffolding
2. Celera Assembler: Theory and Practice : runCA, overlapper, unitigging, scaffolding
3. AMOS: A Modular Open Source Assembler : AMOS overview, runAMOS, AMOS banks, Converters
4. AMOS Assembly Validation and Visualization : Mate-pairs, SNPs, Coverage levels, Hawkeye, stitchContigs, Assembly Repair
5. Improving Assembly without Sequencing : Basecalling, AutoEditor, Trimming, AutoJoiner
6. Whole Genome Alignment : Alignment, Smith-Waterman, MUMmer, Suffix Trees
7. Comparative Genome Assembly : AMOScmp, MUMmer, reference assembly
8. Assembly Checklist : Sequencing, Libraries, Biases, Coverage, Unitigging, Scaffolding


Personal

My wife's site: Emery Hurst Mikel
Our Wedding Website
My Martial Arts School: Lung Chuan Fa