CMSC828Q: Lectures in Bioinformatics

Essential details

Time: Wednesday, 1:30-3:00pm
Location: Biomolecular Sciences Building (BSB) Room 3118
Instructor: Mihai Pop (mpop at umiacs) x5-7245
Office hours: by appointment
Office address: 3120F Biomolecular Sciences Building (bldg #296).
Building is usually locked. Call me from the intercom and I'll buzz you in.
3223 AVW (by appointment)

Description

This class will involve watching video lectures from well respected scientists in the field from the past year followed by in class discussions.

Course topics

The course will cover the following main areas.

Field retrospectives/histories
Genome assembly
Metagenomics
Evolutionary game theory (coming soon)
Transcriptomics (coming soon)

Schedule

This schedule is tentative and will likely change during the earlier parts of the course. Please add any potentially interesting lectures to our google spreadsheet. Lecture abstracts are a work in progress...

Retrospectives

9/10 - What's behind BLAST?
- Speaker: Gene Myers
- Conference: 25th Annual Symposium on Combinatorial Pattern Matching (CPM2014)
- Abstract: "The BLAST search engine was published and released in 1990. It is a heuristic that uses the idea of a neighborhood to find seed matches that are then extended. This approach came from work that this author was doing to lever these ideas to arrive at a deterministic algorithm with a characterized and superior time complexity. This [talk] reviews the history and the unfolding of the basic concepts, and it attempts to intuitively describe the deeper result whose time complexity, to this author's knowledge, has yet to be improved upon."
9/17 - Stories from the Supplement
- Speaker: Lior Pachter
- Conference: Genome Informatics 2013
- Abstract: "The idea for talking about what goes on in the supplement of papers... My talk contained three examples selected to make a number of points: Methods in the supplement frequently contain ideas that transcend the specifics of the paper. These ideas can be valuable in the long run, but when they are in the supplement it is harder to identify what they are and to appreciate their significance. Supplements frequently contain errors (my own included). These errors make it difficult for others to understand the methods and implement them independently. In RNA-Seq specifically, there are a number of methodological issues buried in the supplements of various papers that have caused confusion in the field. The constant push of methods to supplements is part of a general trend to overemphasize the importance of data while minimizing the relevance of methods."
9/24 - A History of Bioinformatics (in the Year 2039)
- Speaker: C. Titus Brown
- Conference: BOSC2014
10/1 - Keynote Lecture
- Speaker: Eric Lander
- Conference: Biology of Genomes 2013
- Abstract: A 25-year historical perspective on genome meetings.

Assembly

10/8 - DNA Assembly: Past, Present, and Future
- Speaker: Gene Myers
- Conference: ISMB2014
- Abstract: "Thirty years ago I was first introduced to the DNA assembly problem and I have been captivated by it ever since. So on the occasion of the Senior Scientist award, I thought I would speak on this problem that has been a consistent thread throughout my career.
  
  I will give a brief history, from Sanger to today, of the technology and algorithmic approaches to the problem, weaving throughout it the ideas of string graphs and de-Bruijn graphs, and the surprising transition from skepticism of whole-genome shotgun sequencing to an irrational acceptance of NGS whole-genome shotgun over short reads.
  
  Fortunately, the future portends better with long read sequencers beginning to come into play. The unusually high error rates associated with these new technologies imply that some aspects of the assembly problem are harder than ever, but because the error is truly random (unlike any previous technology), the ideal of near perfect de novo assembly is again possible. We will conclude with a description of our recent algorithmic work on an assembler we call the Dazzler (the Dresden AZZembLER) that can assemble 1-10Gb genomes directly from a shotgun, long read data set produced by PacBio RS II sequencers."
10/15 - Complete Genome Assembly with Long Reads
- Speaker: Adam Phillippy
- Conference: ISMB2014
10/22 - ExSPAnder: a Universal Repeat Resolver for DNA Fragment Assembly
- Speaker: Andrey D. Prjibelski
- Conference: ISMB2014
- Abstract: Next-generation sequencing (NGS) technologies have raised a challenging de novo genome assembly problem that is further amplified in recently emerged single-cell sequencing projects. While various NGS assemblers can use information from several libraries of read-pairs, most of them were originally developed for a single library and do not fully benefit from multiple libraries. Moreover, most assemblers assume uniform read coverage, condition that does not hold for single-cell projects where utilization of read-pairs is even more challenging. We have developed an exSPAnder algorithm that accurately resolves repeats in the case of both single and multiple libraries of read-pairs in both standard and single-cell assembly projects."

Metagenomics

10/29 - Linking taxa to function through contig clustering of microbial metagenomes
- Speaker: Chris Quince
- Conference: Newton Inst.
- Abstract:"Taxonomic profiling of microbial communities can answer the question of "Who is there?"" This can be achieved either through marker gene sequencing or true shotgun metagenomics. The latter because the functional genes of all community members are sequenced allows us to answer the additional question: "What are they doing?" However, there is a third question that is key to understanding microbial communities: "Who is doing what?" This question has received much less attention because to answer it requires the extraction of complete genomes from metagenomes. Assembly of metagenomes can generate millions of contigs, assembled genome fragments, with no information on which contig derives from which genome. Here I will present CONCOCT, a novel algorithm that combines sequence composition, coverage across multiple samples, and read-pair linkage to automatically cluster contigs into genomes. CONCOCT uses a dimensionality reduction coupled to a Gaus sian mixture model, fit using a variational Bayesian algorithm which automatically identifies the optimal number of clusters. We demonstrate high recall and precision rates on artificial as well as real human gut metagenome datasets. Linking contigs into genome clusters, allows the frequencies of those clusters to be related to metadata, revealing function. We apply this approach to fecal metagenomes obtained from the E. coli O104:H4 epidemic (Germany, 2011) and are able to directly extract the outbreak genome. We also use it to identify organisms associated with inflammation in samples from children with Crohn's disease."
11/5 - Waste Not, Want Not: Why Rarefying Microbiome Data is not an optimal normalization procedure
- Speaker: Susan Holmes
- Conference: Newton Inst.
- Abstract: "Current practice in the normalization of microbiome count data is inefficient in the statistical sense. For apparently historical reasons, the common approach is either to use simple proportions (which does not address heteroscedasticity) or to use rarefying of counts, even though both of these approaches are inappropriate for detection of differentially abundant species. Well-established statistical theory is available that simultaneously accounts for library size differences and biological variability using an appropriate mixture model. Moreover, specific implementations for DNA sequencing read count data (based on a Negative Binomial model for instance) are already available in RNA-Seq focused R packages such as edgeR and DESeq. Here we summarize the supporting statistical theory and use simulations and empirical data to demonstrate substantial improvements provided by a relevant mixture model framework over simple proportions or rarefying. We show how both proportions and rarefied counts result in a high rate of false positives in tests for species that are differentially abundant across sample classes. Regarding microbiome sample-wise clustering, we also show that the rarefying procedure often discards samples that can be accurately clustered by alternative methods. We further compare different Negative Binomial methods with a recently-described zero-inflated Gaussian mixture, implemented in a package called metagenomeSeq. We find that metagenomeSeq performs well when there is an adequate number of biological replicates, but it nevertheless tends toward a higher false positive rate. Based on these results and well-established statistical theory, we advocate that investigators avoid rarefying altogether. We have provided microbiome-specific extensions to these tools in the R package, phyloseq."

Coursework and grading

Attendance.

Acknowledgments

Lecture suggestions were taken from Adam Phillippy, Pall Melsted, Stephen Turner, Steve Mount, and Roye Rozov.