
|
|
Michael C. Schatz
Center for Bioinformatics and Computational Biology
3104B Biomolecular Sciences Building #296
University of Maryland
College Park, MD 20742
mschatz [a t] umiacs.umd.edu
mschatz [a t] cs.umd.edu
Telephone: 703-966-1987
Fax: 301-314-1341
Ph.D. Computer Science - University of
Maryland - in progress, Advisor: Steven Salzberg
M.S. Computer Science - University of
Maryland - 2008
B.S. Computer Science - Carnegie Mellon University - 2000
|
|
Recent advances in DNA sequencing technology from Illumina, 454 Life Sciences, ABI, and Helicos, have enabled next generation sequencing instruments to sequence the equivalent of the human genome (~3 billion bp) in few days and at low cost. In contrast, the sequencing for the human genome project of the late 90’s and early ’00s required years of work on hundreds of machines with sequencing costs measured in hundreds of millions of dollars. This dramatic increase in efficiency has spurred tremendous growth in applications for DNA sequencing. For example, whereas the human genome project sought to sequence the genome of a small group of individuals, the 1000 genomes projects aims to catalog the full genomes for 1000 individuals from all regions of the globe. Recent related projects aim to catalog all of the biologically active transcribed regions of the genome over a wide variety of environmental and disease conditions. Similar studies are also underway for model organisms such as mouse, rat, chicken, rice, and yeast, and other organisms of interest. The raw outputs for these studies often exceed 1 terabyte of data, and are pushing the limits of feasibility for the computations involved. Biological dataset are only increasing in size as data for more individuals and more environments are collected, so if we have not yet reached the breaking point for traditional models of computation for computational biology, it is just over the horizon. It is clear that the only long-term solution is to combine research in computational biology with advances from high performance computing (HPC), especially to parallelize computations to multiple processors, and to utilize high performance distributed file systems.
Research Interests
| 18. |
Genomic
Analyses of the Microsporidian Nosema ceranae, an Emergent Pathogen of Honey Bees
Cornman, RS, Chen, YP, Schatz, MC, et al, (2009) PLoS Pathogens 5(6):e1000466. |
|
| 17. |
A whole-genome assembly of the domestic cow, Bos taurus
Zimin, AV, Delcher, AL, Florea, L, Kelley, DR, Schatz, MC, et al, (2009) Genome Biology 10:R42 |
|
| 16. |
CloudBurst: Highly Sensitive Read Mapping with MapReduce
Schatz, MC (2009) Bioinformatics 25:1363-1369 |
|
| 15. |
Comparative genomics of mutualistic viruses of Glyptapanteles parasitic wasps.
Desjardins, CA, et al. (2009) Genome Biology 9:R183 |
|
| 14. |
Characterization of Insertion Sites in Rainbow Papaya, the First Commercialized Transgenic Fruit Crop.
Suzuki, JY, et al. (2008) Tropical Plant Biology 1:293-309 |
|
| 13. |
Revealing Biological Modules via Graph Summarization.
Navlakha, S, Schatz, M, Kingsford, C. (2008) Journal of Computational Biology 16(2): 253-264. |
|
| 12. |
Genome sequence and rapid evolution of the rice pathogen Xanthomonas oryzae pv. oryzae PXO99A.
Salzberg, SL, Sommer DS, Schatz, MC, et al. (2008) BMC Genomics 9:204 |
|
| 11. |
The draft genome of the transgenic tropical fruit tree papaya (Carica papaya Linnaeus).
Ming, R, et al. (2008) Nature 452, 991-996. |
|
| 10. |
Genome Assembly forensics: finding the elusive mis-assembly.
Phillippy, AM, Schatz, MC, Pop, M. (2008) Genome Biology 9:R55. |
|
| 9. |
High-throughput sequence alignment using Graphics Processing Units.
Schatz, MC, Trapnell, C, Delcher, AL, Varshney, A. (2007) BMC Bioinformatics 8:474. |
|
| 8. |
Evolution of genes and genomes on the Drosophila phylogeny.
Drosophila 12 Genomes Consortium. (2007) Nature Nov 8;450(7167):203-18. |
|
| 7. |
Draft Genome of the Filarial Nematode Parasite Brugia malayi.
Ghedin, E, et al. (2007) Science 317(5845):1756-1760. |
|
| 6. |
Structure and evolution of a proviral locus of Glyptapanteles indiensis bracovirus.
Desjardins, CA, et al. (2007) BMC Microbiology 7:61 |
|
| 5. |
Genome Sequence of Aedes aegypti, a Major Arbovirus Vector.
Nene, V et al. (2007) Science 316(5832), 1718-1723. |
|
| 4. |
Hawkeye: a visual analytics tool for genome assemblies.
Schatz, MC, Phillippy, AM, Shneiderman, B, Salzberg, SL. (2007) Genome Biology 8:R34. |
|
| 3. |
Draft Genome Sequence of the Sexually Transmitted Pathogen Trichomonas vaginalis.
Carlton, JM, Hirt, RP, Silva, JC, Delcher, AL, Schatz, M, et al. (2007) Science 315 (5809), 207-212. |
|
| 2. |
Major structural differences and novel potential virulence mechanisms from the genomes of multiple campylobacter species.
Fouts, DE et al. (2005) PLoS Biology 3 (1):e15. |
|
| 1. |
Automated correction of genome sequence errors.
Gajer, P, Schatz, M, Salzberg, SL. (2004) Nucleic Acids Research 32 (2):562-569. |
|
Full Citations
Faculty of 1000 Reviews
Articles
| AMOS | A fast and flexible API for genome assembly and manipulations |
| AMOSValidate | AMOS Assembly Forenics pipeline for discovering mis-assemblies |
| AutoEditor | Automatic correction of genome sequencing errors |
| BlastReduce | High Performance Short Read Mapping with MapReduce (superceded by CloudBurst) |
| Celera Assembler | The program used to assemble the human genome at Celera Genomics in 2001 |
| CloudBurst | Highly Sensitive Short Read Mapping with MapReduce |
| Cmatch | Extremely fast end-to-end sequence matching on the GPU (superceded by MUMmerGPU) |
| Hawkeye | Genome Assembly Viewer and Analysis Tool. |
| MUMmer | A modular system for the rapid whole genome alignment of finished or draft sequence |
| MUMmerGPU | High-throughput Sequence Alignment on the GPU using CUDA GPGPU API from nVidia |
| PhyloTrac | Visualization and Analysis tool for the PhlyoChip |
| Slice Tools | Low level tools for assembly manipulation |
15. Assembly Bootcamp. Presentation at
UMD Institute for Genome Sciences. June 26 2009.
The theory and practice of genome assembly using the Celera Assembler. Special emphasis is given to
tuning the parameters and settings to get the best results for your data.
14. High
Throughput Sequence Analysis with MapReduce Presentation at J. Craig Venter Institute Informatics Seminar.
June 18, 2009.
MapReduce is the parallel distributed computing framework developed by Google
for large data computations, including analyzing their collection of more than
1 trillion web pages on clusters with 10s of thousands of nodes. This system
enables rapid development of highly scalable applications, because
developers write just a few application specific functions, and the system
automatically and intelligently provides the scheduling, monitoring, and
partitioning necessary to scale to this size. Furthermore, MapReduce is
becoming a de facto standard for executing large computations within the
cloud, where remote compute resources are used generically under a
pay-as-you-go pricing model.
In this presentation, I will describe the leading open-source implementation
of MapReduce called Hadoop, the cloud computing capabilities of Amazon, and
outline MapReduce-based sequence analysis algorithms for read alignment, SNP
discovery, and genome assembly. Scalable algorithms for these problems are
essential given that current sequencing technologies routinely generate tens
or hundreds of gigabytes of data for a single experiment, and can require
hundreds or thousands of hours of computation. The results show MapReduce is
an extremely effective system for analyzing these datasets, with near linear
speedups as the size of the cluster grows. Furthermore, the Amazon compute
cloud can be an efficient and cost-effective resource, especially for
periodic or unusually large compute tasks.
13. Genetic
Sequence Analysis in the Clouds. Presented by Jimmy Lin at the Hadoop Summit
2009. Santa Clara, CA. June 10 2009.
12. CloudBurst:
Highly Sensitive Read Mapping with MapReduce Presentation for
Amazon Web Services Start Up Event - Washington DC. May 27,
2009.
11. High Throughput
Sequence Alignment using Graphics Processing Units, Presentation for
UMD induction as an
nVidia CUDA Center of Excellence
10. Towards a de novo short read
assembler for large genomes using cloud computing, Poster at Biology of Genomes
09, Cold Spring Harbor, NY, May 2009.
The massive volume of data and short read lengths from next generation DNA
sequencing machines has spurred development of a new class of short read
genome assemblers. Several of the new assemblers, such as Velvet and Euler-USR,
model the assembly problem as constructing, simplifying, and traversing the
de Bruijn graph of the read sequences, where nodes in the graph represent
k-mers in the reads, with edges between nodes for consecutive k-mers. This
approach has many advantages for these data, such as efficient computation of
overlapping reads and robust handling of sequencing errors, and has
demonstrated success for assembling small to moderately sized genomes. However,
this approach is computationally challenging to scale to mammalian-sized
genomes because it requires constructing and manipulating a graph far larger
than can fit into memory.
MapReduce was developed at Google for parallel computation on their extremely
large data sets, including their database of more than 1 trillion web pages.
Computation in MapReduce is structured into 2 main phases: the map phase and
the reduce phase, which act together to construct a large distributed hash
table of key-value pairs in a map phase, and then evaluate a function on each
bucket of the hash table in the reduce phase. The power of MapReduce is dozens
or hundreds of map and reduce instances can execute in parallel, enabling
efficient computation even on terabyte and petabyte sized data sets.
Drawing on the success of CloudBurst, a MapReduce-based short read mapping
algorithm capable of mapping millions of reads to the human genome with high
sensitivity, we have developed a MapReduce-based short read assembler that shows
tremendous potential for enabling de novo assembly of mammalian-sized genomes.
The deBruijn graph is constructed with MapReduce by emitting and then grouping
key-value pairs (ki,ki+1) between successive k-mers in the read sequences.
After construction, MapReduce is used again to remove spurious nodes and edges
from the graph caused by sequencing error in the reads, and to compress simple
chains of nodes into long sequence nodes representing the unambiguous regions
of the genome between repeat boundaries. The resulting graph is a small fraction
of the size of the original deBruijn graph, and is output in a format compatible
with other short read assemblers for additional analysis.
9. Improving the genome sequence of D. simulans via co-assembly of multiple
strains, Poster at Biology of Genomes
09, Cold Spring Harbor, NY, May 2009.
8. A whole-genome assembly of the domestic cow, B. Taurus,
Poster at Biology of Genomes
09, Cold Spring Harbor, NY, May 2009.
7. Better Modules in Protein-Protein Interaction Networks, Poster at Pacific Symposium on
Biocomputing, Hawaii, January 2009.
Revealing
Biological Modules via Graph Summarization. Presentation at RECOMB-SB/RG/DREAM3 2008 satelite coference., Boston MA, Oct 2008
A technique called Graph Summarization can be used to partition protein-protein
interaction networks to reveal modules that are more biologically relevant than the clusters
produced by other graph partitioning techniques. We apply GS to predict Gene
Ontology annotations of biological process for proteins of unknown annotations. We also apply
it to detecting membership in protein complexes, as annotated in the MIPS catalog.
GS outperforms other approaches such MCODE, MCL and modularity.
6. Genome Assembly Forensics: Finding the Elusive Mis-assembly, Poster at Biology of Genomes, Cold Spring Harbor, NY, 5/9/2008.
Since the initial "draft" sequence of the human genome was released in 2001, it has become clear that it was not an entirely accurate reconstruction of the genome. Despite significant advances in sequencing and assembly since then, genome sequencing continues to be an inexact process. Genome finishing and validation have remained a largely manual and expensive process, and consequently, many genomes are presented as draft assemblies. Draft assemblies are of unknown quality and potentially contain significant mis-assemblies, such as collapsed repeats, sequence excision, or artificial rearrangements. Too often these assemblies are judged only by contig size, with larger contigs preferred without regard to quality, because it has been difficult to gauge large scale assembly quality.
Our new automated software pipeline, amosvalidate, addresses this deficiency and automatically detects mis-assemblies using a battery of known and novel assembly quality metrics. Instead of focusing on a single assembly characteristic as other validation approaches have tried, the power of our approach comes from leveraging multiple sources of evidence. amosvalidate statistically analyzes mate-pair orientations and separations, repeat content, depth-of-coverage, correlated polymorphisms in the read alignments, and read alignment breakpoints to identify structurally suspicious regions of the assembly. The suspicious regions identified by individual metrics are then clustered and combined to identify (with high confidence) regions that are mis-assembled. This approach is necessary for accurately detecting mis-assemblies because each of the individual characteristics has unavoidable natural variation, but, when considered together, have greatly increased analysis power. Furthermore, our pipeline can easily be adjusted to analyze assemblies utilizing new sequencing technologies where some metrics are unreliable or not available, such as base pair quality or mate pairs.
Our validation pipeline provides a robust measure of assembly quality that goes beyond the simple measures commonly reported. Evaluation of the pipeline has shown it to be highly sensitive for mis-assembly detection, and has revealed mis-assemblies in both draft and finished genomes. This is particularly troubling as scientists move away from the ¿gene by gene¿ paradigm and attempt to understand the global organization of genomes. Without a correct genome sequence or even a clear understanding of the errors present, such studies may draw incorrect conclusions. Our goals are to help scientists locate mis-assembled regions of an assembly and help them correct those regions by focusing their efforts where it is needed most. amosvalidate is compatible with many common assembly formats and is released open-source at http://amos.sourceforge.net.
5. Hunting Down the Papaya Transgenes, Talk at PAG-XVI, San Diego CA, 1/16/2008
In the middle of the last century, the Papaya ringspot potyvirus (PRSV)
devastated the papaya industry on the island of Oahu in Hawaii and in other
fields throughout the world. With the eminent threat of the disease spreading
to the fields in the Puna district of Hawaii island, researchers in the mid
1980s developed PRSV-resistant transgenic lines of papaya using the
pathogen-derived resistance approach, in which genes from PRSV were inserted
into the papaya genome using a gene gun. The commercialization of these
transgenic lines in the late 1990s virtually saved the Hawaiian papaya
industry, but without a full genome sequence, there was lingering concern as
to the exact nature of the transgenic insertions.
In my presentation, I will report on the draft genome sequence of the
virus-resistant ‘SunUp’ papaya, created in collaboration with the University
of Hawaii, the University of Illinois at Urbana-Champaign, and other
institutions. I will focus on the computational methods used for assembling
the genome, validating its correctness, and the subsequent search for
transgenic inserts. Our genome wide analysis, combined with Southern blot
analysis and directed PCR, confirms the efficiency of the gene gun technology,
with only 3 conclusive transgenic insertions. In addition, even though the
papaya genome is nearly twice the size of the Arabidopsis genome, it contains
fewer genes, and thus makes it an excellent candidate for further study of
biosynthetic pathways and networks.
4. High-throughput sequence alignment using Graphics Processing Units, Talk at CBCB Seminar, 9/20/2007
High-throughput sequence alignment using Graphics Processing Units, Poster in 15th Annual Microbial Genomes Conference, 2007
Co-presented with Cole Trapnell
The recent availability of new, less expensive high-throughput DNA sequencing
technologies has yielded a dramatic increase in the volume of sequence data
that must be analyzed. Sequence alignment programs such as MUMmer have proven
essential for analysis of these data, but researchers will need ever faster,
high-throughput alignment tools running on inexpensive hardware to keep up
with new sequence technologies. We present MUMmerGPU,
a high-throughput parallel sequence alignment program that runs on commodity Graphics Processing Units (GPUs) in common workstations. MUMmerGPU uses the new Compute Unified Device Architecture (CUDA) from nVidia to align multiple query sequences against a single reference sequence stored as a suffix tree. By processing the queries in parallel on the highly parallel graphics card, MUMmerGPU achieves more than a 10-fold speedup over a serial CPU version of the sequence alignment kernel, and outperforms MUMmer by more than 3-fold in total application time when aligning reads from recent sequencing projects using Solexa/Illumina, 454, and Sanger sequencing technologies.
3. Interactive visual analytic tools for genome assembly, Talk at 9th Annual Computational Genomics Conference, 10/29/2006
Genome assembly remains an inexact science. Even when accomplished with the best software available, the assembly of a genome often contains numerous errors, both small and large. Hawkeye is a visual analytics tool for genome assembly analysis and validation, designed to aid in identifying and correcting assembly errors. Hawkeye blends the best practices from information and scientific visualization to facilitate inspection of large-scale assembly data while minimizing the time needed to detect mis-assemblies and make accurate judgments of assembly quality.
All levels of the assembly data hierarchy are made accessible to users, along with summary statistics and common assembly metrics. A ranking component guides investigation towards likely mis-assemblies or interesting features to support the task at hand. Wherever possible, high-level overviews, dynamic filtering, and automated clustering are leveraged to focus attention and highlight anomalies in the data. Hawkeyes effectiveness has been proven on several genome projects, where it has been used both to improve quality and to validate the correctness of complex genomes.
2. AMOS Assembly Validation and Visualization, Talk at The Institute for Genomic Research, 4/7/2006
During my talk, I will discuss the techniques and tools to discover and correct
misassemblies in genome assemblies. The three primary sources of information
used to detect misassemblies are the "happiness" of the mate-pairs, the
base call agreement within the multiple alignment of reads, and the depth of coverage of
those reads.
The open source AMOS
assembly package provides tools for systematically analyzing these qualities to discover
regions with potential misassemblies. The AMOS Assembly Investigator is a
powerful genome assembly visualizer with semantic zooming capabilities. It
allows one to navigate and visually inspect these potential misassemblies
in a systematic fashion at all levels of detail. Once regions with misassemblies
have been identified, users can correct the misassemblies with the AMOS
contig patching tools.
1. Improving Genome Assemblies without Sequencing, Talk at CBCB Seminar, 9/28/2005
Improving Genome Assembly without Sequencing. RECOMB 2005 Poster.
Genome assembly is the problem of reconstructing the genome sequence
of an organism from a collection of short sequenced reads. An assembly
takes the form of contiguous stretches of DNA sequence (contigs)
linked together in scaffolds by mate-pair and other information.
Genome assembly is scientifically one of the most important areas of
bioinformatics research as an accurate genome sequence is needed for
addressing several fundamental biological questions. Unfortunately, it
is also one of the most complex computationally, having been proved
NP-hard under various formalisms and a typical problem size of
thousands or millions of inputs.
During my talk, I will discuss some of the algorithmic challenges and
trade-offs in genome assembly. I will also discuss some computational
methods for improving an assembly, which can be applied generally but
without requiring additional laboratory results. One method was
implemented in AutoEditor, which acts as a second generation
base-caller to find and correct base-calling errors in reads using the
original chromatogram trace and the multiple alignment of reads. A
second was implemented in AutoJoiner, which attempts to automatically
close gaps between linked contigs, and generally enhance contig
quality, by extending the usable portion of reads within an assembly.
Teaching Assistant
Guest Lectures
Genome Assembly Class
An 8 part lecture series given at the University of Hawaii between August 13 - 18 2006. The lecture series covers the entire assembly process, from sequencing reactions, to assembly, and finishing.
The discussion begins with an overview of the assembly process, and its theoretical foundations of Lander-Waterman statistics and Shortest-Common-Superstring.
Next there is an indepth discussion of the Celera Assembler, covering the details of overlapping, unitigging, and scaffolding.
Next an Introduction to AMOS is given describing the motivation, framework, and a brief discussion of some of the currently available tools.
Lecture 5 discusses current methods to discover mis-assemblies and the Interactive Genome Visual Analytics tool Hawkeye, which acts as a visual portal to understanding and validating your assembly data.
Next, I discuss two common problems in assembly, that of base calling and trimming and describe AutoEditor and AutoJoiner which are second generation assembly tools to address these areas.
Lecture 6 is provided by Adam Phillippy and covers all aspects of Whole Genome Alignment, centered around the MUMmer suite.
The following lecture, also by Adam Phillippy, describes the AMOScmp Comparative Assembler which uses MUMmer to assemble genomes without the costly overlapping step even at extremely low coverage.
The Final lecture acts as a summary for the class, and a checklist for potential problem areas one might encounter during whole genome assembly.
1. Genome Assembly: Assembly Concepts and Methods : Assembly Overview, Lander-Waterman Statistics, Shortest-Common-Superstring, Contigging, Scaffolding
2. Celera Assembler: Theory and Practice : runCA, overlapper, unitigging, scaffolding
3. AMOS: A Modular Open Source Assembler : AMOS overview, runAMOS, AMOS banks, Converters
4. AMOS Assembly Validation and Visualization : Mate-pairs, SNPs, Coverage levels, Hawkeye, stitchContigs, Assembly Repair
5. Improving Assembly without Sequencing : Basecalling, AutoEditor, Trimming, AutoJoiner
6. Whole Genome Alignment : Alignment, Smith-Waterman, MUMmer, Suffix Trees
7. Comparative Genome Assembly : AMOScmp, MUMmer, reference assembly
8. Assembly Checklist : Sequencing, Libraries, Biases, Coverage, Unitigging, Scaffolding
My wife's site: Emery Hurst Mikel
Our Wedding Website
My Martial Arts School: Lung Chuan Fa
Last updated: Friday, 26-Jun-2009 16:52:22 EDT
|