February 25, 2009 Burrows-Wheeler transform, cont. To obtain the Burrows-Wheeler transform (BWT) of a string, we do the following: 1. Append a '$' character (where '$' is a character not in the string and is treated as being lexicographically less than all other characters) 2. Create an array containing all rotations of this string 3. Sort the array 4. Record the string contained in the last column; this is the BWT. For example, using the string "BANANA": BANANA$ $BANANA ANANA$B A$BANAN NANA$BA ANA$BAN BANANA$ => ANA$BAN => ANANA$B => ANNB$AA NA$BANA BANANA$ A$BANAN NA$BANA $BANANA NANA$BA Notice that removing all the characters after '$' in each rotation from the sorted array yields a suffix array: $------ A$----- ANA$--- ANANA$- BANANA$ NA$---- NANA$-- To invert the BWT, we want to take advantage of the L-F mapping; notice that the second A in the last column is the same A as the second A in the first column: $BANANA A$BANAN |=> ANA$BAN | ANANA$B | BANANA$ | NA$BANA <=| | NANA$BA | |=============| In other words, the A located at the end of row 6 is followed by "NA$", just as the A at the beginning of row 3 is followed by "NA$". We use this fact to invert the BWT. We know that the first character in the BWT will be the last character in our original string, since the '$' will always sort to the first row. We can also reconstruct the first column of the sorted array by sorting the characters in the BWT: $ ... A (the interior characters of each row aren't important) A ... N A ... N A ... B B ... $ N ... A N ... A From the '$' in the first column, we can construct the string moving backwards, by recording a character from the first column, and then advancing to the character at the end of the row we're currently at. For this string, we record the '$', then advance to the first 'A' in the left column (because the 'A' in at the end of the row starting with '$' is the first 'A' in the final column). We then record that 'A', and advance to the first 'N' in the left column (row 6). We then proceed to the second 'A' (row 3), the second 'N' (row 7), the third 'A' (row 4), the first 'B' (row 5), and then stop as we encounter the '$' at the end of row 5. Reversing the recorded string ("ANANAB"), we obtain the original string. This procedure requires us to know that the 'N' (or other character) we encounter at a given step is the first/second one in the final column. A simple way to do this is to store the number of times each character occurs previous to (and including) the current row: A B N A 1 0 0 N 1 0 1 N 1 0 2 B 1 1 2 $ 1 1 2 A 2 1 2 A 3 1 2 This allows for constant time accesses of this information, and so will result in an O(n) running time. However, this will require O(n |E| log(n)) storage for these offsets. If we instead only store the counts for certain rows, we can save on storage: A B N A N 1 0 1 N B 1 1 2 $ A 2 1 2 A 3 1 2 Each row with a count now represents the end of a "bucket" - and we can decide how big each bucket should be (bucket size = b). Using these buckets means that within a bucket, we'll need spend O(b) time to find which occurrence of a character we've found, and so this algorithm will take in total O(bn) time, but storage will be O(n |E| log(n)/b). Selecting b = log(n) means that time becomes O(n log(n)), and memory usage is now O(n |E|). We can also use the BWT to perform searches against a text. Let T = "BANANA" and P = "DANA". We begin searching at the end of the string P, looking for an 'A' in the BWT of T: $ ... A ==> A ... N <== A ... N ==> A ... B <== B ... $ N ... A N ... A Here, the arrows denote possible matches; we begin looking at the range of rows that start with 'A'. Within the current range, there are two rows that have the next character we examine ('N') in the last column; these correspond to the two rows starting with 'N': $ ... A A ... N A ... N A ... B B ... $ ==> N ... A <== ==> N ... A <== Now, our range has two rows that end with our next character, 'A'; these are the second and third 'A's, corresponding to the second and third 'A's in the first column: $ ... A A ... N ==> A ... N <== ==> A ... B <== B ... $ N ... A N ... A Now, within the range, there are no rows that end with the next character we're searching for ('D'), so the search fails. Had our pattern been "ANA", we would stop here, and note that the pattern occurs twice in the text, as the range has two rows. The BWT allows us to only store O(n) items, while increasing the running times for many algorithms. We can lower the running times by adding auxilary data structures, like the buckets detailed above, but these of course add to the storage requirements. BWT also aids in compressibility, as the resultant strings are more likely to have runs of particular characters, which helps run length encoding. Some compression algorithms use this in combination with a move-to-front scheme, where each character is encoded as the distance between it and the previous occurrence of the character in the string; with a BWT string, this tends to result in lots of 1s and 0s (meaning alternating characters or repeated characters). The following example was given; the left column shows what number is recorded, the second column shows the character being examined, and the remaining four show the mapping of characters to distances after the recording of the number; the most recent character is moved to the front of this mapping, hence the name of the scheme. Note that the distances don't correspond directly to distance in indices, but rather how many different characters have been seen since the last occurrence of the given character. The mapping begins with a mapping of {'$', 'A', 'B', 'N'} to {0, 1, 2, 3}, respectively. 0 1 2 3 $ A B N 1 A A $ B N 3 N N A $ B 0 N N A $ B 3 B B N A $ 3 $ $ B N A 3 A A $ B N 0 A A $ B N This results in an encoding of "1303330" for the BWT string example.