Reducing storage requirements for biological sequence comparison.

Title	Reducing storage requirements for biological sequence comparison.
Publication Type	Journal Articles
Year of Publication	2004
Authors	Roberts M, Hayes W, Hunt BR, Mount SM, Yorke JA
Journal	Bioinformatics
Volume	20
Issue	18
Pagination	3363-9
Date Published	2004 Dec 12
ISSN	1367-4803
Keywords	algorithms, Databases, Genetic, Information Storage and Retrieval, Numerical Analysis, Computer-Assisted, sequence alignment, Sequence Analysis
Abstract	MOTIVATION: Comparison of nucleic acid and protein sequences is a fundamental tool of modern bioinformatics. A dominant method of such string matching is the 'seed-and-extend' approach, in which occurrences of short subsequences called 'seeds' are used to search for potentially longer matches in a large database of sequences. Each such potential match is then checked to see if it extends beyond the seed. To be effective, the seed-and-extend approach needs to catalogue seeds from virtually every substring in the database of search strings. Projects such as mammalian genome assemblies and large-scale protein matching, however, have such large sequence databases that the resulting list of seeds cannot be stored in RAM on a single computer. This significantly slows the matching process. RESULTS: We present a simple and elegant method in which only a small fraction of seeds, called 'minimizers', needs to be stored. Using minimizers can speed up string-matching computations by a large factor while missing only a small fraction of the matches found using all seeds.
DOI	10.1093/bioinformatics/bth408
Alternate Journal	Bioinformatics
PubMed ID	15256412
Grant List	1R01HG0294501 / HG / NHGRI NIH HHS / United States