CMSC828W Notes, 2/2/2010

The Knuth-Morris-Pratt Exact Matching Algorithm
-----------------------------------------------

For pattern P and text T, let |P| = n and |T| = m.

Consider the case where P partially matches at some point in T, but we
eventually find a mismatched character (denoted by x).

  +--------------------------------------+
T |                                      |
  +--------------------------------------+
       |||||||||||x
      +--------------+
P     |              |
      +--------------+

What we'd like to know is whether there's another part of P that might
match the region of T that we just inspected, allowing us to shift P
forward and avoid redundant comparisons.  Graphically, this means:

  +--------------------------------------+
T |                                      |
  +--------------------------------------+
       |||||||||||x
      +--------------+
P     |////   ////   |  (regions shaded with '/' are identical)
      +--------------+
       ---->  ||||
             +--------------+
P            |////   ////   |
             +--------------+
 
We define sp(i) as the length of this matching region for some index i
in P.  More precisely sp(i) is ``the length of the longest proper
suffix of P[1..i] that matches a prefix of P'' (Gusfield p. 24).

We can use sp(i) to determine how far we can shift P after a mismatch
at i + 1, saving us from having to compare P[1..sp(i)] to T again.  We
compute this shift distance with i - sp(i), which moves P(sp(i)) to the
position with respect to T where P(i) was previously, and restart the
comparison at P(sp(i) + 1).

The proof that this technique does not miss any matches of P in T (by
potentially skipping over occurrences of P[i - sp(i)..i] in P) is
completed by straightforward contradiction with the definition of
sp(i); a variant of this proof can be found in the text (p. 25).

--------------------

We can also clearly see that the KMP algorithm runs on O(m) time: in
the case where P and T match at P(i), we advance i by one and
compare the next character; when there is a mismatch, we shift P
forward by at least one, so no character in T is every compared more
than twice.  The runtime of the this algorithm is therefore bounded by
2m.

This analysis assumes, however, that sp(i) can be computed in O(m)
time.  How can this be accomplished?  Considering that sp(i)
represents the longest suffix of P[1..i] that is also a prefix of P,
our intuition tells us that calculating sp(i) should have something to
do with Z values (which we learned in the previous lecture can be
computed in O(n) time).

Let's consider how to compute sp(i) given Z(j) for j in 2..n (remember
that Z(1) is undefined).  Could it be as simple as sp(j + Z(j) - 1) :=
Z(j) (that is, the the sp value for the right bound i of the Z box
starting at j is assigned Z(j))?  Unfortunately, no: this is not
correct.  If we compute sp(j + Z(j)) in this way from 1 to n, it is
possible that some sp(i) will be given a value that is *less* than the
maximum length of the suffix of P[1..i] that is also a prefix of P.
For example:
P:  a b a c a b a d
Z:  X 0 1 0 3 0 1 0
sp: 0 0 1 0 0 0 1 0
Why is it that sp(7) was assigned 1 instead of 3?  It was in fact
assigned 3 at one point, but sp values are assigned from left to right
(i.e. j is being incremented), so sp(j + Z(j)) is assigned the
*rightmost* non-zero Z(j) value for some j whose Z box ends at i.

Perhaps we can instead simply decrement j (from n..2) such that the
sp(i) is indeed the largest possible value.  Unfortunately not---we've
missed sp values for all i not equal to j + Z(j) -1 for 2 <= j <= n!
P:           a b a c a b a d
Z:           X 0 1 0 3 0 1 0
sp computed: 0 0 1 0 0 0 3 0
sp correct:  0 0 1 0 1 2 3 0

(Note that sp(i - 1) >= sp(i) - 1.  We can use this fact to compute
sp(i) from Z(j) if we choose.)

However, while this last approach may not provide us with sp(i), it
*does* give us something useful:  sp'(i).  sp'(i) is defined as the
longest suffix of P[1..i] that matches a prefix of P where P(i + 1) is
not equal to P(sp'(i) + 1) (that is, the character after i is
different from the corresponding character in the prefix of P).

   +--------------------+
P  |////x      ////y    |
   +--------------------+
       |          i      
      sp(i)

Why is this useful?  Because P(sp'(i) + 1) is not equal to P(i + 1),
we can use sp'(i) to avoid shifts where the character in T that
previously caused a mismatch is matched against the *same* character
following the prefix of P.  The weaker definition of sp(i), on the
other hand, does not provide this guarantee, and so we prefer sp'(i)
of sp(i).

--------------------

This Z-value-based preprocessing, however simple, isn't the approach
originally used for KMP.  While more complex, the original KMP
processing method can be applied in cases when we wish to match a
*group* of patterns, and so may be useful in other contexts.

To understand how this method works, let's assume we've already
computed sp(j) 1 <= j <= i and now want to find sp(i + 1).  Given a
character x = P(i + 1), we need to consider two cases: when x =
P(sp(i) + 1), and when x != P(sp(i) + 1).
* If x = P(sp(i) + 1), then sp(i + 1) = sp(i) + 1, as we've simply
  extended the longest proper suffix of P[1..i] that matches a prefix of
  P by one character (note that this satisfies sp(i - 1) >= sp(i) - 
* If x != P(sp(i) + 1), we can use previously computed sp values to
  find sp(i + 1), namely sp(sp(i)).  Since the suffix P[i - sp(i)..i]
  matches the prefix P[1..sp(i)], we also know that the suffix
  P[i - sp(i)..i] matches a suffix of P[1..sp(i)]; inspecting the sp
  value for that suffix (i.e. sp(sp(i))) will tell us if there exists
  a prefix of P that matches that suffix of P[1..sp(i)], which may or
  may not be followed by an occurrence of x.  Using this strategy, we
  can recur through applications of sp until we we find a prefix of P
  that matches a suffix of P[1...i + 1], or we "bottom out,"
  indicating that no such prefix exists.  In the latter case,
  sp(i + 1) = 1 if P(1) = x; otherwise sp(i + 1) = 0.

Given that this approach relies on recursion, we might initially
suspect that its runtime would have a worse bound than the Z-based
method.  However, a rather subtle property of this method for
computing sp(i) allows us to prove that it is no worse than the Z
method.  A complete proof of this bound can be found in the text
(p. 50-1); the intuition behind the proof is that while this method
may use a recurrence to find sp(i), the cost of the recursive check is
amortized by the number of successful prefix/suffix matches.  In other
words, for any computation of sp for a given i, we can fall back to
previously computed sp values only as many times as we moved forward
(i.e. matched) in the computation of previous values.  As such, the
runtime is bounded by O(n).

Given sp(i), sp'(i) can be easily computed in O(n) time.  A
description of this technique and a proof of its correctness can be
found in the text (p. 51).  

----------------------------------------

The Boyer-Moore Exact Matching Algorithm
----------------------------------------

The Boyer-Moore algorithm is another linear time exact matching
technique.  Despite sharing its performance bound with KMP,
Boyer-Moore possesses some interesting properties that differentiate
it from KMP significantly.

First, Boyer-Moore matches P with T from right to left with respect to
P.  The intuition behind this approach is that by matching from the
right, we are able to see what characters we'll want to mach "in the
future."  The algorithm also employs two rules that allow it to shift
P forward by potentially significant steps on mismatches.  These are
known as the "bad character" rule and the "good suffix rule."

Simply stated, the bad character rule stipulates: given a mismatch at
P[i] with character x in T, shift P forward such that the first
occurrence of x in P to the left of i aligns with the x in T that
caused the mismatch, and start the match again from the right.  This
allows us to skip incremental shifts of P with T that we can be
certain will not produce a complete match.

The good suffix rule is slightly more complicated, but it can also be
stated relatively simply: on a mismatch, find the rightmost
occurrence[1] of the suffix of P that has already successfully matched
with T, and shift P forward such that that occurrence aligns with the
already matched portion of T; if no occurrence of the suffix exists,
shift P forward so that it moves entirely past its current position.
In other words, we shift to the next best Z box (on suffixes) when
possible, otherwise we start over with a fresh region of T since we
know there can't be any other possible matches in the region of T
we're currently comparing.  Given that description, you've probably
already guessed that the (somewhat complicated) preprocessing required
to support this rule can employ Z value computation (see p. 16-23 of
the text).

We take the larger shift derived from these rules on a mismatch to
move P as far forward as possible at each opportunity.

[1] That is, the rightmost copy that is not a suffix of P -- if the
copy were a suffix of P, it would be the region of P we just compared!

--------------------

Like KMP, Boyer-Moore runs in O(m) time.  The proof of this, however
elegant, is quite complicated.  It defines three types of periodicity:
* A string S is periodic if S = BBBBB..B for some period B
* A string S is semiperiodic if S = BBBB..Ba for some prefix a of B
* A string S is prefix semiperiodic if S = aBBBB..B for some suffix a
  of B
and shows that semiperiodic/prefix semiperiodic are equivalent.  The
proof also relies on a property of periodic strings:
* If P matches T at positions p1 and p2 where |p2 - p1| < floor(n/2),
  then P is semiperiodic with period p2 - p1 (as explained by Professor
  Pop: "if P matches T at two nearby locations, it's periodic.").

From these properties, we also find that:
* If P is semiperiodic with periods length p1 and p2, it is
  semiperiodic with period GCD(p1, p2) (as explained by Professor Pop:
  "if P is periodic with two different periods, it's periodic with
  their GCD.").
which may be of use to us in the contexts of other algorithms.

For the full proof, consult the text (p. 39-47).

--------------------

Given this added complexity, one might wonder why we bother with
Boyer-Moore (particularly when KMP seems like a reasonable
alternative).  While their theoretical performance may be similar,
Boyer-Moore has been demonstrated to match patterns on English text
extremely quickly.  It's ability to shift forward by large leaps may
even give it *sublinear* performance in practice (Gusfield p. 23),
which may not be matched by KMP.

------------------------------------------------------------------------