SYLLABUS

CMSC828N: Computational Gene Finding and Genome Assembly


Tuesdays and Thursdays, 1230-1:45pm, Room 3118 Biomolecular Sciences Building

Professor: Steven Salzberg, 3125 Biomolecular Sciences Building, salzberg (at) umiacs.umd.edu
Office hours: By appointment.
Textbook: Computational Gene Prediction (CGP) by William H. Majoros


Supplemental texts,
free online at the NCBI Bookshelf (click title to view):
Molecular Biology of the Cell, b
y Bruce Alberts, Alexander Johnson, Julian Lewis, Martin Raff, Keith Roberts, and Peter Walter.   Garland Publishing, 2002.
Genomes, by T.A. Brown, BIOS Scientific Publishers, 2002.

Note: additional links to lecture notes and assignments will appear on the syllabus as the semester progresses

Week 1: Sept 2-4
Introduction to the course.  Molecular biology background. 
Biotechnology background on sequencing, assembly. 

Reading: Chapter 1, The Human Genome, in
Genomes, by T.A. Brown, free at the NCBI Bookshelf.

Lecture 1 slides here.
Lecture 2 slides here.

Week 2: Sept 9-11
Whole-genome shotgun sequencing.  Pairwise sequence alignment. Basic assembly: shortest common superstring, greedy assembly algorithms.  Repeat-induced mis-assemblies.

Reading: (a) Chapter 6, "Sequencing Genomes, in Genomes, by T.A. Brown, free at the NCBI Bookshelf.  (b) Gene Myers' 1999 intro paper on whole-genome sequencing.

Lecture 3 slides here.

Lab 1 instructions here.
Lab 1 sample data here.
Lab 1 input data here.

Week 3: Sept 16-18
The Celera Assembler algorithm.  Genome sequencing technology.  Error correction with AutoEditor.

Reading:  (1)
Myers, The Fragment Assembly String Graph, Bioinformatics 21 (2005); (2) The phrap assembler documentation, http://www.phrap.org/phredphrap/phrap.html.

Celera assembler slides
Sequencing technology slides, part1 and part2
AutoEditor slides

Week 4: Sept 23-25
The Arachne assembler algorithm.  Comparative assembly with AMOScmp.

Lab 1 due Sept 25. 
Lab 2 assignment is here, and the data file is here.


Readings: Myers et al, A Whole-Genome Assembly of Drosophila, Science 287 (2000).
S. Batzoglou et al., ARACHNE: A whole-genome shotgun assembler,  Genome Research 12:1 (2002), 177-189.

Arachne lecture notes
AMOScmp slides
AFG file format slides

Week 5: Sept 30-Oct 2

Trimming with Figaro.  Multiplex PCR for closing gaps.  Oct 2: guest lecture by Adam Phillippy: using MUMmer for assembly alignment and comparison.

Readings:
    A.L. Delcher et al.,  Alignment of Whole Genomes   Nucleic Acids Research, 27:11 (1999), 2369-2376.  Note that Figure 6 is supposed to be in color, and was mistakenly printed as black and white.

Readings for class presentations: choose from this list or use Wentian Li's bibliography page for more choices.

Genome closure slides
James Whites' Figaro slides.
Adam's whole-genome alignment slides  (here they are in older PowerPoint format)

Week 6: Oct 7-9
Oct 7:  guest lecture by Mike Schatz: Assembly debugging with Hawkeye.  Short read sequencing using 454 and Solexa technology.
Reading: Tettelin et al., Optimized Multiplex PCR: Efficiently Closing a Whole-Genome Shotgun Sequencing Project.  Genomics 62 (1999), 500-507.

Mike Schatz's assembly validation slides
Next-gen sequencing technology slides

Week 7: Oct 14-16
Class presentations on selected readings.
Lab 2 due Friday, Oct 17.


Week 8: Oct 21-23
Genome assembly with short reads (conclusion). Introduction to computational gene finding topics.
Lab 3 available here.

Reading: Chapters 1-2 of CGP, Introduction" and "Mathematical preliminaries".  See the textbook website
for slides from Oct 21.

Week 9: Oct 28-30
Bacterial gene finding.  Markov chains.  Case study: the Glimmer gene finder.  Oct 30: Guest lecture by Mihaela Pertea on splice site identification in eukaryotic genes.

Reading:
CGP, "Overview of Computational Gene Prediction," Chapter 3.  Also: S. Salzberg, A. Delcher, S. Kasif, and O. White. Microbial gene identification using interpolated Markov models, Nucleic Acids Research 26:2 (1998), 544-548.

Bacterial gene finding slides, including Glimmer
Mihaela Pertea's slides on signal prediction

Week 10: Nov 4-6
Overlapping genes in bacteria.  Eukaryotic gene finding: introduction to HMMs and the Forward algorithm.

Reading: CGP, "Signal and Content Sensors" chapter 7.
Slides on HMMs, part 1


Week 11: Nov 11-13
Nov 6: Class presentations on selected readings.

Lab 3 due Friday, Nov 14.
Reading: CGP, "Toy Exon Finder" chapter 5.

Week 12: Nov 18-20
HMM algorithms: forward, Viterbi, forward-backward.  Design of HMMs and the Toyscan algorithm.
Slides on HMMs, part 2
Slides on the HOMER HMM gene finder (from B. Majoros)
Details on Lab4 (ToyScan)

Get Lab 4 here (due Dec 11).
Reading: CGP, "Hidden Markov Models" chapter 6.

Week 13: Nov 25 (Nov 27 is Thanksgiving)
Class presentations (2).  Sequencing ancient DNA: the mammoth genome.

Lecture notes on exon splicing enhancers
Lecture notes on the Combiner algorithm

Reading: CGP, "Generalized HMMs" chapter 8.

Week 14: Dec 2-4
Case study: GlimmerHMM.  Generalized HMM algorithms.   Gene finding in humans: the EGASP and NGASP competitions. Combining multiple gene finders with JIGSAW. 

Lecture on GHMMs

Reading: (1) the JIGSAW paper.  (2) CGP, "Signal and Content Sensors", chapter 7, section 7.3 to end of chapter.

Week 15: Dec 9-11 (last week)
Pair HMMs.  The status of the human genome: assembly and annotation.

Lab 4 due Dec 11.  Take home exams distributed Dec 11, due Dec 18.

GRADING: The first three labs count for 15% of the grade each, the fourth lab counts for 25%, the class presentation counts for 5%, and the final exam counts for 25%.