Syllabus for CMSC858W: Algorithms for Biosequence Analysis (Spring 2010)



Course information



Lectures

Date

Topics

Slides/Lecture notes

Add'l resources

1/26

Introduction

Lecture 1


1/28

Exact matching
preprocessing with Z algorithm
KMP algorithm

Gusfield scan (TOC, chaps 1,2, 3.3)

Also in CLR book, section 32.4

Lecture notes (from Chris)


2/2

KMP continued

Boyer Moore algorithm

Lecture notes from Mark

Exact matching code from Geet

2/4

Multiple patterns (Aho Corasick)
Dealing with don't care symbols

Suffix trees intro

Lecture notes from Geet


2/9

SNOW DAY



2/11

SNOW DAY



2/16

Suffix trees
Ukkonen's algorithm

Homework 1

Lecture notes from Koyel

E. Ukkonen. (1995). On-line construction of suffix trees. Algorithmica 14(3):249-260.

Gusfield chap 5

Gusfield chap. 6


2/18

Suffix arrays
Burrows-Wheeler Transform
Compressed suffix arrays

Lecture notes from Hyongtae

Manber, Myers (1990). Suffix arrays: a new method for on-line string searches. SODA

http://en.wikipedia.org/wiki/Burrows%E2%80%93Wheeler_transform

A block-sorting lossless data compression algorithm. Burrows and Wheeler. DEC TR 124, 1994

Ferragina, Manzini. Opportunistic Data Structures with Applications. FOCS 2000.

Karkkainen. Fast BWT in small space by blockwise suffix sorting. Theor. Comp. Sci. 387(3): 249-257 (2007).

Karkkainen, Manzini, Puglisi. Permuted longest-common prefix array. CPM 2009.

Khan, Bloom, Kruglyak, Singh. A practical algorithm for finding maximal exact matches in large sequence datasets using sparse suffix arrays. Bioinformatics 25(13):1609-1616 (2009)

2/23

Suffix arrays, BWT, continued.

Notes from Raul

Project suggestions

2/25

BWT, continued

Notes from Derrick


3/2

Inexact alignment
Dynamic programming - Smith Waterman, Needleman-Wunsch

Homework 2

Notes from Ted

Smith, Temple F.; and Waterman, Michael S. (1981). "Identification of Common Molecular Subsequences". Journal of Molecular Biology 147: 195–197. doi:10.1016/0022-2836(81)90087-5.

Needleman, Saul B.; and Wunsch, Christian D. (1970). "A general method applicable to the search for similarities in the amino acid sequence of two proteins". Journal of Molecular Biology 48 (3): 443–53.

3/4

Dynamic programming...cont

Notes from Zheng


3/9

Dynamic programming...cont

Recap

Notes from Emre


3/11

Midterm



3/16

Spring break



3/18

Spring break



3/23

Alignment heuristics
Exclusion methods
Fasta, Blast
Spaced seeds
"Inexact seeds"

Dealing with long strings:chaining algorithms
MUMmer

Note from Wikum

Kun-Mao Chao, William R. Pearson, Webb Miller. Aligning two sequences within a specified diagonal band. Bioinformatics 8(5):481-487

G.M. Landau, U. Vishkin. Introducing efficient parallelism into approximate string matching and a new serial algorithm. STOC 1986.

3/25

Dealing with short strings
reversed indices - Maq, SOAP, others
backtracking on suffix array – Bowtie
inexact alignment on suffix arrays

Notes from Ferhan

Eugene W. Myers. A sublinear algorithm for approximate keyword searching. Algorithmica.12(4-5):345-374 (1994)

Bin Ma, John Tromp, Ming Li. PatternHunter: faster and more sensitive homology search. Bioinformatics 18(3):440:445 (2002)

Uri Keich, Ming Li, Bin Ma, John Tromp. On spaced seeds for similarity search. Discrete Appl. Math. 138(3):253-263 (2004)

Brona Brejova, Daniel G. Brown, Tomas Vinar. Vector seeds: an extension to spaced seeds. Journal of Computer and System Sciences. 70(3):364-380 (2005)

Art L. Delcher, Adam M. Phillippy, Jane Carleton, Steven L. Salzberg. Fast Algorithms for Large-scale Genome Alignment and Comparison. Nucleic Acids Research. 30(11):2478-2483.(2002)

Mohammad Ghodsi, Mihai Pop. Inexact local alignment search over suffix arrays Proceedings of the IEEE International Conference on Bioinformatics and Biomedicine (BIBM), 83-87 (2009)

Ben Langmead, Cole Trapnell, Mihai Pop, Steven L. Salzberg. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol 10:R25. (2009)

Heng Li, J. Ruan, Richard Durbin. Mapping short DNA sequencing reads and calling variants using mapping quality scores. Genome Research. 18(11):1851-1858 (2008)

Ruiqiang Li, Yingrui Li, Karsten Kristiansen, Jun Wang. SOAP: short oligonucleotide alignment program Bioinformatics,24 (5):713–714 (2008) doi:10.1093/bioinformatics/btn025

3/30

Inexact alignment...cont'd

Notes from Sam


4/1

Multiple sequence alignment
exact algorithms
approximations - star alignment
tree-guided alignment


Thompson, Higgins, Gibson. ClustalW: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties, and weight-matrix choice. NAR 22(22):4673-4680. 1994

R.C. Edgar. Muscle: multiple sequence alignment with high accuracy and high throughput. NAR 32(5):1792-1797. 2004.

Notredame, Higgins, Herringa. T-coffee: a novel method for fast and accurate multiple sequence alignment. J. Mol. Biol. 302:205-217. 2000.

Passover 3/30-4/6

Good Friday 4/2

4/6

Motif finding

Whole-genome multiple alignment Mauve

RNA structure
Structural alignments
hardness results
Profile HMMs


Darling, Mau, Blattner, Perna. Mauve: Multiple alignment of conserved genomic sequence with rearrangements. Genome Res.14:1394-1403. 2004.

Lawrence, Altshul, Boguski, Liu, Neuwald, Wootton. Detecting subtle sequence signals: A Gibbs sampling strategy for multiple alignments. Science 262(5131): 208-214. 1993.

Bailey, Noble. Searching for statistically significant regulatory modules. Bioinformatics. 19(s2):ii16 2003

Jiang, Lin, Ma, Zheng. A general edit distance between RNA structures. J. Comp. Biol. 9(2):371-388. 2002.

Eddy. Profile Hidden Markov Models. Bioinformatics. 14(9):755-763. 1998.

Nawrocki, Kolbe, Eddy. Infernal 1.0: inference of RNA alignments Bioinformatics 25(10):1335-1337

4/8

Profile alignments
HMM Alignments


RNA Structure


Multiple alignment notes



Altschul, Madden, Schaffer, Zhang, Zhang, Miller, Lipman. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. NAR 25(17):3389-3402. 1997.

Eddy. Profile Hidden Markov Models. Bioinformatics. 14(9):755-763. 1998.

Nawrocki, Kolbe, Eddy. Infernal 1.0: inference of RNA alignments Bioinformatics 25(10):1335-1337


4/13

Structure-sequence alignment

Sequence clustering – basics
Distance-based methods
Neighbor Joining
Picking the right cut - VI-cut

Homework 3

RNA Structure and alignment notes

Jiang, Lin, Ma, Zheng. A general edit distance between RNA structures. J. Comp. Biol. 9(2):371-388. 2002.

Navlakha, White, Nagarajan, Pop, Kingsford. Finding Biologically Accurate Clusterings in Hierarchical Tree Decompositions using the Variation of Information. RECOMB 2009.

4/15

phylogenetics: parsimony/maximum likelihood
The need for speed: Cd-hit, DNAClust, Uclust

Graph theory in sequence analysis
introduction to sequence assembly
overlap-layout consensus paradigm
Eulerian paradigm

Sequence clustering notes

Saitou, Nei. The neighbor-joining method: a new method for reconstructing phylogenetic trees.Mol. Biol. Evol. 4(4):406-425. 1987.

Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences", Weizhong Li & Adam Godzik Bioinformatics, (2006) 22:1658-9. Open source PDF Pubmed


4/20

Phymm/PhymmBL

Dr. Pop out of town


Brady, Salzberg. Phymm and PhymmBL: metagenomic phylogenetic classification with interpolated Markov models. Nature Methods. 6:673-676 (2009)

4/22

Parallel algorithms
BlastReduce
Crossbow
others

Dr. Pop out of town


Schatz, M.C., CloudBurst: highly sensitive read mapping with MapReduce. Bioinformatics, 2009. 25(11): p. 1363-9.

Langmead, B., et al., Searching for SNPs with cloud computing. Genome Biol, 2009. 10(11): p. R134.

4/27

Sequence assembly
counting Eulerian tours
NP-hardness results

Genome assembly notes

Counting spanning trees (Thanks Sam!!)

Myers. Towards simplifying and accurately formulating fragment assembly. J.Comp. Biol. 2(2):275. 1995.

Pevzner, Tang, Waterman. An Eulerian path approach to DNA fragment assembly. PNAS 98(17):9748. 2001

Myers. The genome assembly string graph. Bioinformatics 21. 2005.

Medvedev, Georgiou, Myers, Brudno. Computability of models for sequence assembly. WABI 2007.

Nagarajan, Pop. Parametric complexity of sequence assembly. J.Comp.Biol. 16(7):897. 2009

4/29

Graph simplifications
How useful are mate-pairs?


Kingsford, Schatz, Pop. Assembly complexity of prokaryotic genomes using short reads. BMC Bioinformatics. 11:21. 2010.

5/4

Genome assembly validation
Detecting breakpoints and rearrangements

Assembly validation notes

Phillippy, Schatz, Pop. Assembly forensics. Genome Biology 9:R55. 2008.

Choi, Kim, Tang, Andrews, Gilbert, Colbourne. A Machine Learning Approach to Combined Evidence Validation of Genome Assemblies. Bioinformatics 24(6):744-750. 2008

Hormozidari, Alkan, Eichler, Sahinalp. Combinatorial algorithms for structural variation detection in high-throughput sequenced genomes. Genome Research 19:1270-1278. 2009.

5/6

Optical mapping

Optical maps notes

Nagarajan, Read, Pop. Scaffolding and validation of bacterial genome assemblies using optical restriction maps. Bioinformatics 24(10):1229-1235. 2008.

Anantharaman, T., B. Mishra, et al. Genomics via optical mapping III: contiging genomic DNA. ISMB'99.

5/11

TBA

Final project due

PROJECT DELIVERABLES


5/19

Final exam (10:30-12:30)

GOOD LUCK