Transition Spaced Seeds for cDNA-to-Genome Alignment

Spaced seeds are a recent innovation in sequence comparison to help increase, sometimes dramatically, the sensitivity of alignment programs. Unlike continuous seeds (e.g., the blast seed 11111111111), spaced seeds allow for some wildcard positions in the pattern, marked with 0. For instance, the seed s=10010101010011 has the wildcard positions 2,3,5,7,9,11 and 12. The length of the seed pattern is called the span (S(s)=14) and the number of 1 positions is called the weight (W(s)=7).

Our goal is to design good seeds for accurately aligning cDNA sequences of one species to the genome of a close relative. Such seeds must balance sensitivity and specificity.

To evaluate sensitivity, we use an extended model that takes into account the compositional structure of coding sequences as well as the transition-transversion biases between biological sequences. We use an additional symbol x to specify transitions, and several 3-periodic inhomogeneous Markov models for alignments, to differentiate between transition/tranversions and among the three codon positions. We further propose that the specificity of a seed is inversely proportional to the expected number of matches of the seed in the genome sequence, and calculate it under several Bernoulli and Markov models of the mRNA and genomic sequences.

We evaluate seed sensitivity and specificity for different species and for a large number of weights and (0,1,x) weight combinations, and propose strategies for selecting good seeds for species at various evolutionary distances. In addition, as the number and type of comparisons grow quadratically with the sequencing of new species, we seek to determine and characterize seeds that perform well for a wide range of comparisons, which we call universal seeds. For a description of the methods, observations and a collection of good seeds, see our publications and the supplemental material on this page.

Publications and Presentations:

[1] Zhou, L., I. Mihai, L. Florea (2010) "Spaced seeds for cross-species cDNA-to-genome sequence alignment", Communications in Information and Systems, 10(2):115-136.

[2] Zhou, L., M. Pertea, L. Florea (2009) "Detecting coding regions in sequence alignments with spaced seeds", Workshop in Algorithms for Bioinformatics - WABI 2009, Philadelphia, PA (poster).

[3] Zhou, L., L. Florea (2008) "Sensitive and specific cross-species cDNA-to-genome alignment with spaced seeds", 6th Annual Rocky Mountain Bioinformatics Conference, Aspen, CO.

[4] L. Zhou, I. Mihai, L. Florea (2008) "Effective cluster-based seed design for cross-species sequence comparison", Bioinformatics 24(24), 2926-7. [Medline] [Supplemental Material and Code]

[5] L. Zhou, A. Delcher, M. Pertea, L. Florea (2007) "Universal spaced seeds: Improving the accuracy of cross-species cDNA-to-genome alignment", 16th Annual International Conference on Intelligent Systems for Molecular Biology - ISMB 2008, Toronto, CA.

[5] L. Zhou, J. Stanton, L. Florea (2008) "Universal seeds for cDNA-to-genome comparison", BMC Bioinformatics, 9:36. [Supplemental Material] [Medline]

[7] L. Zhou, A. Delcher, M. Pertea, L. Florea (2007) "Universal spaced seeds: Improving the accuracy of cross-species cDNA-to-genome alignment", Cold Spring Harbor Meeting - Genome Informatics, Abstracts, 122.

[8] L. Zhou, L. Florea (2007) "Good spaced seeds for cross-species mRNA-to-genome alignment", The 7th IEEE International Conference on BioInformatics and BioEngineering -BIBE 2007, Harvard University Medical School, Boston MA.

[9] L. Zhou, L. Florea (2007) "Designing sensitive and specific spaced seeds for cross-species mRNA-to-genome alignment", J. Comput. Biol. 14(2), 113-130. [Supplemental Material] [Full text] [More]

[10] L. Zhou, L. Florea (2007) "Good spaced seeds for cDNA-to-genome alignments", GWU School of Engineering and Applied Sciences R & D Showcase, Washington DC.

[11] L. Zhou, L. Florea (2006) "Designing spaced seeds for accurate cross-species cDNA-to-genome alignment with varying evolutionary distances", SMBE 2006 - Genomes, Evolution and Bioinformatics, Arizona State University, Tempe AZ.

Last updated Sep 17th 2009