CMSC828W Notes, 2/2/2010 The Knuth-Morris-Pratt Exact Matching Algorithm ----------------------------------------------- For pattern P and text T, let |P| = n and |T| = m. Consider the case where P partially matches at some point in T, but we eventually find a mismatched character (denoted by x). +--------------------------------------+ T | | +--------------------------------------+ |||||||||||x +--------------+ P | | +--------------+ What we'd like to know is whether there's another part of P that might match the region of T that we just inspected, allowing us to shift P forward and avoid redundant comparisons. Graphically, this means: +--------------------------------------+ T | | +--------------------------------------+ |||||||||||x +--------------+ P |//// //// | (regions shaded with '/' are identical) +--------------+ ----> |||| +--------------+ P |//// //// | +--------------+ We define sp(i) as the length of this matching region for some index i in P. More precisely sp(i) is ``the length of the longest proper suffix of P[1..i] that matches a prefix of P'' (Gusfield p. 24). We can use sp(i) to determine how far we can shift P after a mismatch at i + 1, saving us from having to compare P[1..sp(i)] to T again. We compute this shift distance with i - sp(i), which moves P(sp(i)) to the position with respect to T where P(i) was previously, and restart the comparison at P(sp(i) + 1). The proof that this technique does not miss any matches of P in T (by potentially skipping over occurrences of P[i - sp(i)..i] in P) is completed by straightforward contradiction with the definition of sp(i); a variant of this proof can be found in the text (p. 25). -------------------- We can also clearly see that the KMP algorithm runs on O(m) time: in the case where P and T match at P(i), we advance i by one and compare the next character; when there is a mismatch, we shift P forward by at least one, so no character in T is every compared more than twice. The runtime of the this algorithm is therefore bounded by 2m. This analysis assumes, however, that sp(i) can be computed in O(m) time. How can this be accomplished? Considering that sp(i) represents the longest suffix of P[1..i] that is also a prefix of P, our intuition tells us that calculating sp(i) should have something to do with Z values (which we learned in the previous lecture can be computed in O(n) time). Let's consider how to compute sp(i) given Z(j) for j in 2..n (remember that Z(1) is undefined). Could it be as simple as sp(j + Z(j) - 1) := Z(j) (that is, the the sp value for the right bound i of the Z box starting at j is assigned Z(j))? Unfortunately, no: this is not correct. If we compute sp(j + Z(j)) in this way from 1 to n, it is possible that some sp(i) will be given a value that is *less* than the maximum length of the suffix of P[1..i] that is also a prefix of P. For example: P: a b a c a b a d Z: X 0 1 0 3 0 1 0 sp: 0 0 1 0 0 0 1 0 Why is it that sp(7) was assigned 1 instead of 3? It was in fact assigned 3 at one point, but sp values are assigned from left to right (i.e. j is being incremented), so sp(j + Z(j)) is assigned the *rightmost* non-zero Z(j) value for some j whose Z box ends at i. Perhaps we can instead simply decrement j (from n..2) such that the sp(i) is indeed the largest possible value. Unfortunately not---we've missed sp values for all i not equal to j + Z(j) -1 for 2 <= j <= n! P: a b a c a b a d Z: X 0 1 0 3 0 1 0 sp computed: 0 0 1 0 0 0 3 0 sp correct: 0 0 1 0 1 2 3 0 (Note that sp(i - 1) >= sp(i) - 1. We can use this fact to compute sp(i) from Z(j) if we choose.) However, while this last approach may not provide us with sp(i), it *does* give us something useful: sp'(i). sp'(i) is defined as the longest suffix of P[1..i] that matches a prefix of P where P(i + 1) is not equal to P(sp'(i) + 1) (that is, the character after i is different from the corresponding character in the prefix of P). +--------------------+ P |////x ////y | +--------------------+ | i sp(i) Why is this useful? Because P(sp'(i) + 1) is not equal to P(i + 1), we can use sp'(i) to avoid shifts where the character in T that previously caused a mismatch is matched against the *same* character following the prefix of P. The weaker definition of sp(i), on the other hand, does not provide this guarantee, and so we prefer sp'(i) of sp(i). -------------------- This Z-value-based preprocessing, however simple, isn't the approach originally used for KMP. While more complex, the original KMP processing method can be applied in cases when we wish to match a *group* of patterns, and so may be useful in other contexts. To understand how this method works, let's assume we've already computed sp(j) 1 <= j <= i and now want to find sp(i + 1). Given a character x = P(i + 1), we need to consider two cases: when x = P(sp(i) + 1), and when x != P(sp(i) + 1). * If x = P(sp(i) + 1), then sp(i + 1) = sp(i) + 1, as we've simply extended the longest proper suffix of P[1..i] that matches a prefix of P by one character (note that this satisfies sp(i - 1) >= sp(i) - * If x != P(sp(i) + 1), we can use previously computed sp values to find sp(i + 1), namely sp(sp(i)). Since the suffix P[i - sp(i)..i] matches the prefix P[1..sp(i)], we also know that the suffix P[i - sp(i)..i] matches a suffix of P[1..sp(i)]; inspecting the sp value for that suffix (i.e. sp(sp(i))) will tell us if there exists a prefix of P that matches that suffix of P[1..sp(i)], which may or may not be followed by an occurrence of x. Using this strategy, we can recur through applications of sp until we we find a prefix of P that matches a suffix of P[1...i + 1], or we "bottom out," indicating that no such prefix exists. In the latter case, sp(i + 1) = 1 if P(1) = x; otherwise sp(i + 1) = 0. Given that this approach relies on recursion, we might initially suspect that its runtime would have a worse bound than the Z-based method. However, a rather subtle property of this method for computing sp(i) allows us to prove that it is no worse than the Z method. A complete proof of this bound can be found in the text (p. 50-1); the intuition behind the proof is that while this method may use a recurrence to find sp(i), the cost of the recursive check is amortized by the number of successful prefix/suffix matches. In other words, for any computation of sp for a given i, we can fall back to previously computed sp values only as many times as we moved forward (i.e. matched) in the computation of previous values. As such, the runtime is bounded by O(n). Given sp(i), sp'(i) can be easily computed in O(n) time. A description of this technique and a proof of its correctness can be found in the text (p. 51). ---------------------------------------- The Boyer-Moore Exact Matching Algorithm ---------------------------------------- The Boyer-Moore algorithm is another linear time exact matching technique. Despite sharing its performance bound with KMP, Boyer-Moore possesses some interesting properties that differentiate it from KMP significantly. First, Boyer-Moore matches P with T from right to left with respect to P. The intuition behind this approach is that by matching from the right, we are able to see what characters we'll want to mach "in the future." The algorithm also employs two rules that allow it to shift P forward by potentially significant steps on mismatches. These are known as the "bad character" rule and the "good suffix rule." Simply stated, the bad character rule stipulates: given a mismatch at P[i] with character x in T, shift P forward such that the first occurrence of x in P to the left of i aligns with the x in T that caused the mismatch, and start the match again from the right. This allows us to skip incremental shifts of P with T that we can be certain will not produce a complete match. The good suffix rule is slightly more complicated, but it can also be stated relatively simply: on a mismatch, find the rightmost occurrence[1] of the suffix of P that has already successfully matched with T, and shift P forward such that that occurrence aligns with the already matched portion of T; if no occurrence of the suffix exists, shift P forward so that it moves entirely past its current position. In other words, we shift to the next best Z box (on suffixes) when possible, otherwise we start over with a fresh region of T since we know there can't be any other possible matches in the region of T we're currently comparing. Given that description, you've probably already guessed that the (somewhat complicated) preprocessing required to support this rule can employ Z value computation (see p. 16-23 of the text). We take the larger shift derived from these rules on a mismatch to move P as far forward as possible at each opportunity. [1] That is, the rightmost copy that is not a suffix of P -- if the copy were a suffix of P, it would be the region of P we just compared! -------------------- Like KMP, Boyer-Moore runs in O(m) time. The proof of this, however elegant, is quite complicated. It defines three types of periodicity: * A string S is periodic if S = BBBBB..B for some period B * A string S is semiperiodic if S = BBBB..Ba for some prefix a of B * A string S is prefix semiperiodic if S = aBBBB..B for some suffix a of B and shows that semiperiodic/prefix semiperiodic are equivalent. The proof also relies on a property of periodic strings: * If P matches T at positions p1 and p2 where |p2 - p1| < floor(n/2), then P is semiperiodic with period p2 - p1 (as explained by Professor Pop: "if P matches T at two nearby locations, it's periodic."). From these properties, we also find that: * If P is semiperiodic with periods length p1 and p2, it is semiperiodic with period GCD(p1, p2) (as explained by Professor Pop: "if P is periodic with two different periods, it's periodic with their GCD."). which may be of use to us in the contexts of other algorithms. For the full proof, consult the text (p. 39-47). -------------------- Given this added complexity, one might wonder why we bother with Boyer-Moore (particularly when KMP seems like a reasonable alternative). While their theoretical performance may be similar, Boyer-Moore has been demonstrated to match patterns on English text extremely quickly. It's ability to shift forward by large leaps may even give it *sublinear* performance in practice (Gusfield p. 23), which may not be matched by KMP. ------------------------------------------------------------------------