February 25, 2009

Burrows-Wheeler transform, cont.

To obtain the Burrows-Wheeler transform (BWT) of a string, we do the
following:
  1. Append a '$' character (where '$' is a character not in the string
     and is treated as being lexicographically less than all other characters)
  2. Create an array containing all rotations of this string
  3. Sort the array
  4. Record the string contained in the last column; this is the BWT.

For example, using the string "BANANA":

           BANANA$    $BANANA
           ANANA$B    A$BANAN
           NANA$BA    ANA$BAN
BANANA$ => ANA$BAN => ANANA$B => ANNB$AA
           NA$BANA    BANANA$
           A$BANAN    NA$BANA
           $BANANA    NANA$BA

Notice that removing all the characters after '$' in each rotation from the
sorted array yields a suffix array:

$------
A$-----
ANA$---
ANANA$-
BANANA$
NA$----
NANA$--

To invert the BWT, we want to take advantage of the L-F mapping; notice that
the second A in the last column is the same A as the second A in the first
column:

    $BANANA
    A$BANAN
|=> ANA$BAN
|   ANANA$B
|   BANANA$
|   NA$BANA <=|
|   NANA$BA   |
|=============|

In other words, the A located at the end of row 6 is followed by "NA$", just
as the A at the beginning of row 3 is followed by "NA$".  We use this fact to
invert the BWT.

We know that the first character in the BWT will be the last character in
our original string, since the '$' will always sort to the first row.  We
can also reconstruct the first column of the sorted array by sorting the
characters in the BWT:

$ ... A     (the interior characters of each row aren't important)
A ... N
A ... N
A ... B
B ... $
N ... A
N ... A

From the '$' in the first column, we can construct the string moving
backwards, by recording a character from the first column, and then
advancing to the character at the end of the row we're currently at.
For this string, we record the '$', then advance to the first 'A' in
the left column (because the 'A' in at the end of the row starting with
'$' is the first 'A' in the final column).  We then record that 'A', and
advance to the first 'N' in the left column (row 6).  We then proceed to
the second 'A' (row 3), the second 'N' (row 7), the third 'A' (row 4),
the first 'B' (row 5), and then stop as we encounter the '$' at the end
of row 5.  Reversing the recorded string ("ANANAB"), we obtain the
original string.

This procedure requires us to know that the 'N' (or other character)
we encounter at a given step is the first/second one in the final column.
A simple way to do this is to store the number of times each character
occurs previous to (and including) the current row:

  A B N
A 1 0 0
N 1 0 1
N 1 0 2
B 1 1 2
$ 1 1 2
A 2 1 2
A 3 1 2

This allows for constant time accesses of this information, and so
will result in an O(n) running time.  However, this will require
O(n |E| log(n)) storage for these offsets.  If we instead only
store the counts for certain rows, we can save on storage:

  A B N
A 
N 1 0 1
N 
B 1 1 2
$ 
A 2 1 2
A 3 1 2

Each row with a count now represents the end of a "bucket" - and we can
decide how big each bucket should be (bucket size = b).  Using these
buckets means that within a bucket, we'll need spend O(b) time to find
which occurrence of a character we've found, and so this algorithm will
take in total O(bn) time, but storage will be O(n |E| log(n)/b).  Selecting
b = log(n) means that time becomes O(n log(n)), and memory usage is now
O(n |E|).

We can also use the BWT to perform searches against a text.  Let
T = "BANANA" and P = "DANA".  We begin searching at the end of the
string P, looking for an 'A' in the BWT of T:

       $ ... A
   ==> A ... N <==
       A ... N
   ==> A ... B <==
       B ... $
       N ... A
       N ... A

Here, the arrows denote possible matches; we begin looking at the
range of rows that start with 'A'.  Within the current range, there
are two rows that have the next character we examine ('N') in the
last column; these correspond to the two rows starting with 'N':

       $ ... A
       A ... N
       A ... N
       A ... B
       B ... $
   ==> N ... A <==
   ==> N ... A <==

Now, our range has two rows that end with our next character, 'A';
these are the second and third 'A's, corresponding to the second
and third 'A's in the first column:

       $ ... A
       A ... N
   ==> A ... N <==
   ==> A ... B <==
       B ... $
       N ... A
       N ... A

Now, within the range, there are no rows that end with the next
character we're searching for ('D'), so the search fails.  Had our
pattern been "ANA", we would stop here, and note that the pattern
occurs twice in the text, as the range has two rows.


The BWT allows us to only store O(n) items, while increasing the
running times for many algorithms.  We can lower the running times
by adding auxilary data structures, like the buckets detailed above,
but these of course add to the storage requirements.

BWT also aids in compressibility, as the resultant strings are more
likely to have runs of particular characters, which helps run length
encoding.  Some compression algorithms use this in combination with
a move-to-front scheme, where each character is encoded as the distance
between it and the previous occurrence of the character in the string;
with a BWT string, this tends to result in lots of 1s and 0s (meaning
alternating characters or repeated characters).  

The following example was given; the left column shows what number
is recorded, the second column shows the character being examined,
and the remaining four show the mapping of characters to distances
after the recording of the number; the most recent character is moved
to the front of this mapping, hence the name of the scheme.
Note that the distances don't correspond directly to distance in indices,
but rather how many different characters have been seen since the last
occurrence of the given character.  The mapping begins with a
mapping of {'$', 'A', 'B', 'N'} to {0, 1, 2, 3}, respectively.

     0 1 2 3
     $ A B N
1 A  A $ B N
3 N  N A $ B
0 N  N A $ B
3 B  B N A $
3 $  $ B N A
3 A  A $ B N
0 A  A $ B N

This results in an encoding of "1303330" for the BWT string example.