
|
|
Michael C. Schatz
Center for Bioinformatics and Computational Biology
3120G Biomolecular Sciences Building #296
University of Maryland
College Park, MD 20742
mschatz [a t] umiacs.umd.edu
mschatz [a t] cs.umd.edu
Office: 301 405 7169
Cell: 703 966 1987
Fax: 301 314 1341
Ph.D. Computer Science - University of
Maryland - in progress, Advisor: Steven Salzberg
B.S. Computer Science - Carnegie Mellon University - 2000
|
|
We are entering the era of genomics in science and medicine in which
biological systems are understood in terms of their precise genetic
components, and treatments are individualized for each patient. The first
major milestone of this era, the sequencing of the human genome, has been
achieved, but significant challenges remain in unlocking the full meaning
of the genome. My research is aimed towards realizing the goals of genomics
through improved computational methods for DNA sequencing and analysis. My
work includes developing software for genome assembly, which can create a more
accurate and more complete reconstruction of the genome than other assemblers.
This is an absolutely fundamental requirement for a wide variety of important
biological analyses including gene annotation, comparative genomics,
and disease genotyping. I have also been researching new high performance
computing hardware for use in bioinformatics, and coauthored an open source
DNA sequence alignment tool that runs on highly parallel graphics processing
units. We found the graphics hardware was 10x faster than a regular CPU for
this application, and are eager to implement new applications on the hardware.
This level of performance is necessary to manage the avalanche of sequencing
data coming from recently available high throughput sequencing technologies,
which can create gigabytes of sequence data in a few hours and will be
used for studying health and disease at unprecedented scales. Finally, I'm
developing the visualization and analysis software for the PhyloChip which
can perform push-button environmental and metagenomics analysis. Only with
the proper combination of biological knowledge and computer science will we
realize the goals of genomics.
Research Interests
- Genome Assembly & Validation
- Comparative Genomics & Metagenomics
- Sequence Alignment
- Environmental Sampling
- Scientific Visualization
- High Performance and Multi-Core Computing
| 12. | Salzberg, S.L., Sommer D.S., Schatz, M.C. et al. (2008) Genome sequence and rapid evolution of the rice pathogen Xanthomonas oryzae pv. oryzae PXO99A BMC Genomics 9:204 |
| 11. | Ming, R et al. (2008) The draft genome of the transgenic tropical fruit tree papaya (Carica papaya Linnaeus) Nature 452, 991-996. |
| 10. | Phillippy, A.M., Schatz, M.C., Pop, M. (2008) Genome Assembly forensics: finding the elusive mis-assembly. Genome Biology 9:R55. |
| 9. | Schatz, M.C., Trapnell, C., Delcher, A.L., Varshney, A. (2007) High-throughput sequence alignment using Graphics Processing Units. BMC Bioinformatics 8:474. |
| 8. | Drosophila 12 Genomes Consortium. (2007) Evolution of genes and genomes on the Drosophila phylogeny. Nature Nov 8;450(7167):203-18. |
| 7. | Ghedin, E., et al. (2007) Draft Genome of the Filarial Nematode Parasite Brugia malayi. Science 317(5845):1756-1760. |
| 6. | Desjardins, C.A., et al. (2007) Structure and evolution of a proviral locus of Glyptapanteles indiensis bracovirus BMC Microbiology 7:61 |
| 5. | Nene, V., et al. (2007) Genome Sequence of Aedes aegypti, a Major Arbovirus Vector Science 316(5832), 1718-1723. |
| 4. | Schatz, M.C., Phillippy, A.M., Shneiderman, B., Salzberg, S.L. (2007) Hawkeye: a visual analytics tool for genome assemblies. Genome Biology 8:R34. |
| 3. | Carlton, J.M., Hirt, R.P., Silva, J.C., Delcher, A.L., Schatz, M., et al. (2007) Draft Genome Sequence of the Sexually Transmitted Pathogen Trichomonas vaginalis Science 315 (5809), 207-212. |
| 2. | Fouts, D.E., et al. (2005) Major structural differences and novel potential virulence mechanisms from the genomes of multiple campylobacter species. PLoS Biology 3 (1):e15. |
| 1. | Gajer, P., Schatz, M., Salzberg, S.L. (2004) Automated correction of genome sequence errors. Nucleic Acids Research 32 (2):562-569. |
Full Citations
| AMOS | A fast and flexible API for genome assembly and manipulations |
| AMOSValidate | AMOS Assembly Forenics pipeline for discovering mis-assemblies |
| AutoEditor | Automatic correction of genome sequencing errors |
| BlastReduce | High Performance Short Read Mapping with MapReduce |
| Celera Assembler | The program used to assemble the human genome at Celera Genomics in 2001 |
| Cmatch | Extremely fast end-to-end sequence matching on the GPU (superceded by MUMmerGPU) |
| Hawkeye | Genome Assembly Viewer and Analysis Tool. |
| MUMmer | A modular system for the rapid whole genome alignment of finished or draft sequence |
| MUMmerGPU | High-throughput Sequence Alignment on the GPU using CUDA GPGPU API from nVidia |
| PhyloTrac | Visualization and Analysis tool for the PhlyoChip |
| Slice Tools | Low level tools for assembly manipulation |
6. Genome Assembly Forensics: Finding the Elusive Mis-assembly", Biology of Genomes, Cold Spring Harbor, NY, 5/9/2008.
Since the initial "draft"sequence of the human genome was released in 2001, it has become clear that it was not an entirely accurate reconstruction of the genome. Despite significant advances in sequencing and assembly since then, genome sequencing continues to be an inexact process. Genome finishing and validation have remained a largely manual and expensive process, and consequently, many genomes are presented as draft assemblies. Draft assemblies are of unknown quality and potentially contain significant mis-assemblies, such as collapsed repeats, sequence excision, or artificial rearrangements. Too often these assemblies are judged only by contig size, with larger contigs preferred without regard to quality, because it has been difficult to gauge large scale assembly quality.
Our new automated software pipeline, amosvalidate, addresses this deficiency and automatically detects mis-assemblies using a battery of known and novel assembly quality metrics. Instead of focusing on a single assembly characteristic as other validation approaches have tried, the power of our approach comes from leveraging multiple sources of evidence. amosvalidate statistically analyzes mate-pair orientations and separations, repeat content, depth-of-coverage, correlated polymorphisms in the read alignments, and read alignment breakpoints to identify structurally suspicious regions of the assembly. The suspicious regions identified by individual metrics are then clustered and combined to identify (with high confidence) regions that are mis-assembled. This approach is necessary for accurately detecting mis-assemblies because each of the individual characteristics has unavoidable natural variation, but, when considered together, have greatly increased analysis power. Furthermore, our pipeline can easily be adjusted to analyze assemblies utilizing new sequencing technologies where some metrics are unreliable or not available, such as base pair quality or mate pairs.
Our validation pipeline provides a robust measure of assembly quality that goes beyond the simple measures commonly reported. Evaluation of the pipeline has shown it to be highly sensitive for mis-assembly detection, and has revealed mis-assemblies in both draft and finished genomes. This is particularly troubling as scientists move away from the ¿gene by gene¿ paradigm and attempt to understand the global organization of genomes. Without a correct genome sequence or even a clear understanding of the errors present, such studies may draw incorrect conclusions. Our goals are to help scientists locate mis-assembled regions of an assembly and help them correct those regions by focusing their efforts where it is needed most. amosvalidate is compatible with many common assembly formats and is released open-source at http://amos.sourceforge.net.
5. Hunting Down the Papaya Transgenes, PAG-XVI, San Diego CA, 1/16/2008
In the middle of the last century, the Papaya ringspot potyvirus (PRSV)
devastated the papaya industry on the island of Oahu in Hawaii and in other
fields throughout the world. With the eminent threat of the disease spreading
to the fields in the Puna district of Hawaii island, researchers in the mid
1980s developed PRSV-resistant transgenic lines of papaya using the
pathogen-derived resistance approach, in which genes from PRSV were inserted
into the papaya genome using a gene gun. The commercialization of these
transgenic lines in the late 1990s virtually saved the Hawaiian papaya
industry, but without a full genome sequence, there was lingering concern as
to the exact nature of the transgenic insertions.
In my presentation, I will report on the draft genome sequence of the
virus-resistant ‘SunUp’ papaya, created in collaboration with the University
of Hawaii, the University of Illinois at Urbana-Champaign, and other
institutions. I will focus on the computational methods used for assembling
the genome, validating its correctness, and the subsequent search for
transgenic inserts. Our genome wide analysis, combined with Southern blot
analysis and directed PCR, confirms the efficiency of the gene gun technology,
with only 3 conclusive transgenic insertions. In addition, even though the
papaya genome is nearly twice the size of the Arabidopsis genome, it contains
fewer genes, and thus makes it an excellent candidate for further study of
biosynthetic pathways and networks.
4. High-throughput sequence alignment using Graphics Processing Units, CBCB Seminar, 9/20/2007
High-throughput sequence alignment using Graphics Processing Units, Poster in 15th Annual Microbial Genomes Conference, 2007
Co-presented with Cole Trapnell
The recent availability of new, less expensive high-throughput DNA sequencing
technologies has yielded a dramatic increase in the volume of sequence data
that must be analyzed. Sequence alignment programs such as MUMmer have proven
essential for analysis of these data, but researchers will need ever faster,
high-throughput alignment tools running on inexpensive hardware to keep up
with new sequence technologies. We present MUMmerGPU,
a high-throughput parallel sequence alignment program that runs on commodity Graphics Processing Units (GPUs) in common workstations. MUMmerGPU uses the new Compute Unified Device Architecture (CUDA) from nVidia to align multiple query sequences against a single reference sequence stored as a suffix tree. By processing the queries in parallel on the highly parallel graphics card, MUMmerGPU achieves more than a 10-fold speedup over a serial CPU version of the sequence alignment kernel, and outperforms MUMmer by more than 3-fold in total application time when aligning reads from recent sequencing projects using Solexa/Illumina, 454, and Sanger sequencing technologies.
3. Interactive visual analytic tools for genome assembly, 9th Annual Computational Genomics Conference, 10/29/2006
Genome assembly remains an inexact science. Even when accomplished with the best software available, the assembly of a genome often contains numerous errors, both small and large. Hawkeye is a visual analytics tool for genome assembly analysis and validation, designed to aid in identifying and correcting assembly errors. Hawkeye blends the best practices from information and scientific visualization to facilitate inspection of large-scale assembly data while minimizing the time needed to detect mis-assemblies and make accurate judgments of assembly quality.
All levels of the assembly data hierarchy are made accessible to users, along with summary statistics and common assembly metrics. A ranking component guides investigation towards likely mis-assemblies or interesting features to support the task at hand. Wherever possible, high-level overviews, dynamic filtering, and automated clustering are leveraged to focus attention and highlight anomalies in the data. Hawkeyes effectiveness has been proven on several genome projects, where it has been used both to improve quality and to validate the correctness of complex genomes.
2. AMOS Assembly Validation and Visualization, The Institute for Genomic Research, 4/7/2006
During my talk, I will discuss the techniques and tools to discover and correct
misassemblies in genome assemblies. The three primary sources of information
used to detect misassemblies are the "happiness" of the mate-pairs, the
base call agreement within the multiple alignment of reads, and the depth of coverage of
those reads.
The open source AMOS
assembly package provides tools for systematically analyzing these qualities to discover
regions with potential misassemblies. The AMOS Assembly Investigator is a
powerful genome assembly visualizer with semantic zooming capabilities. It
allows one to navigate and visually inspect these potential misassemblies
in a systematic fashion at all levels of detail. Once regions with misassemblies
have been identified, users can correct the misassemblies with the AMOS
contig patching tools.
1. Improving Genome Assemblies without Sequencing, CBCB Seminar, 9/28/2005
Improving Genome Assembly without Sequencing. RECOMB 2005 Poster.
Genome assembly is the problem of reconstructing the genome sequence
of an organism from a collection of short sequenced reads. An assembly
takes the form of contiguous stretches of DNA sequence (contigs)
linked together in scaffolds by mate-pair and other information.
Genome assembly is scientifically one of the most important areas of
bioinformatics research as an accurate genome sequence is needed for
addressing several fundamental biological questions. Unfortunately, it
is also one of the most complex computationally, having been proved
NP-hard under various formalisms and a typical problem size of
thousands or millions of inputs.
During my talk, I will discuss some of the algorithmic challenges and
trade-offs in genome assembly. I will also discuss some computational
methods for improving an assembly, which can be applied generally but
without requiring additional laboratory results. One method was
implemented in AutoEditor, which acts as a second generation
base-caller to find and correct base-calling errors in reads using the
original chromatogram trace and the multiple alignment of reads. A
second was implemented in AutoJoiner, which attempts to automatically
close gaps between linked contigs, and generally enhance contig
quality, by extending the usable portion of reads within an assembly.
Genome Assembly Class
An 8 part lecture series given at the University of Hawaii between August 13 - 18 2006. The lecture series covers the entire assembly process, from sequencing reactions, to assembly, and finishing.
The discussion begins with an overview of the assembly process, and its theoretical foundations of Lander-Waterman statistics and Shortest-Common-Superstring.
Next there is an indepth discussion of the Celera Assembler, covering the details of overlapping, unitigging, and scaffolding.
Next an Introduction to AMOS is given describing the motivation, framework, and a brief discussion of some of the currently available tools.
Lecture 5 discusses current methods to discover mis-assemblies and the Interactive Genome Visual Analytics tool Hawkeye, which acts as a visual portal to understanding and validating your assembly data.
Next, I discuss two common problems in assembly, that of base calling and trimming and describe AutoEditor and AutoJoiner which are second generation assembly tools to address these areas.
Lecture 6 is provided by Adam Phillippy and covers all aspects of Whole Genome Alignment, centered around the MUMmer suite.
The following lecture, also by Adam Phillippy, describes the AMOScmp Comparative Assembler which uses MUMmer to assemble genomes without the costly overlapping step even at extremely low coverage.
The Final lecture acts as a summary for the class, and a checklist for potential problem areas one might encounter during whole genome assembly.
1. Genome Assembly: Assembly Concepts and Methods : Assembly Overview, Lander-Waterman Statistics, Shortest-Common-Superstring, Contigging, Scaffolding
2. Celera Assembler: Theory and Practice : runCA, overlapper, unitigging, scaffolding
3. AMOS: A Modular Open Source Assembler : AMOS overview, runAMOS, AMOS banks, Converters
4. AMOS Assembly Validation and Visualization : Mate-pairs, SNPs, Coverage levels, Hawkeye, stitchContigs, Assembly Repair
5. Improving Assembly without Sequencing : Basecalling, AutoEditor, Trimming, AutoJoiner
6. Whole Genome Alignment : Alignment, Smith-Waterman, MUMmer, Suffix Trees
7. Comparative Genome Assembly : AMOScmp, MUMmer, reference assembly
8. Assembly Checklist : Sequencing, Libraries, Biases, Coverage, Unitigging, Scaffolding
My wife's site: Emery Hurst Mikel
Our Wedding Website
My Martial Arts School: Lung Chuan Fa
|