Rabin-Karp Algorithm

The Rabin-Karp algorithm is a string-searching algorithm that uses hashing to find patterns in strings. A string is an abstract data type that consists of a sequence of characters. Letters, words, sentences, and more can be represented as strings.

String matching is a very important application of computer science. If you’ve ever searched through a document for a particular word, then you have benefitted from string-matching technology. String matching can also be used to detect plagiarism by comparing strings in document \(A\) with strings in document \(B\).

Here are examples of strings in Python:

string = 'abcdefghijklmnopqrstuvwxyz'
sport = 'basketball'
food = 'pizza'

Naive String Matching

Say there is a word of length \(m\) and a document of length \(n\). Both the word and the document are represented as strings. The naive way to search the document for the word would be to start at the beginning of the document and check every \(m\) sequential letters to see if it matches up with the letters in the word. If the \(m\) letter slice of the document matches with the \(m\) letters in the word, then a match has been found. This is a brute-force method of string checking, and after it has run, it will certainly output the correct answer (whether the word is in the document or not). However, this algorithm takes \(O(nm)\) time, in its worst case, to complete since for nearly every position in the document, the algorithm performs \(m\) comparisons to check if there is a match. If \(n\) is very large and \(m = O(n)\), then the algorithm takes \(O\big(n^2\big)\) time to complete. This is very slow.

Say you want to search a DNA sequence for a particular sequence of three bases, ATA. Below is an animation depicting this search using the naive string-matching algorithm.

Here is a Python implementation of the brute-force string-searching algorithm described above:

def brute_force_match(word, text):
    m = len(word)
    n = len(text)

    counter = 0

    for i in range(0,n-m+1):
        found = True
        for j in range(0, m):
            if word[j] != text[i+j]:
                found = False
                break
        if found: 
            counter += 1

    if counter == 0:
        print "No match"
    else:
        print "The word appears: ",counter, "times!"

#brute_force_match("ata","ata")
#brute_force_match("ata","atata")
#brute_force_match("ata","aaaaaaa")
#brute_force_match("ata","")
#brute_force_match("ata","aagcgagcgatatatat")

Imagine searching a string of a billion “ACC” segments followed by a single “ATA” segment. To search this string for “ATA” would result in the worst-case running time of \(O(nm)\).

One way to make this brute-force algorithm a bit smarter and more efficient is to reduce the number of needless comparisons it makes. For example, if the string we are looking for is “GAC” and the segment we are comparing it to is “TCG,” we know after the first comparison between the G and the T that this will not be a match. Therefore, we can move on to the next section of text and not waste time comparing the remaining two letters of the sequences since we know for certain they won’t match.

If we were to implement the optimization described above, what would be the best-case running time of the algorithm when checking for a word of length \(n\) in a text of length \(m\) if the algorithm finds that the word does not appear in the text?

First, let’s identify the best-case situation—the one with the fewest comparisons. We would make the fewest comparisons if the first letter of the sequence of text never matches with the first letter of the word. In this case, we would perform \(O(n)\) comparisons, so the best-case running time is \(O(n)\). \(_\square\)

The Rabin-Karp Algorithm

The Rabin-Karp algorithm makes use of hash functions and the rolling hash technique. A hash function is essentially a function that maps one thing to a value. In particular, hashing can map data of arbitrary size to a value of fixed size.^[1]

Hashing

Here is a crash course on hashing:

Hashing is a way to associate values using a hash function to map an input to an output. The purpose of a hash is to take a large piece of data and be able to be represented by a smaller form. Hashing is incredibly useful and but does have some downfalls, namely, collisions. The pigeonhole principle tells us that we can't put \(x\) balls into \(x - 1\) buckets without having at least one bucket with two balls in it. With hashing, we can think of the inputs to a function as the balls and the buckets as the outputs of the function—some inputs must inevitably share an output value, and this is called a collision.

A rolling hash allows an algorithm to calculate a hash value without having to rehash the entire string. For example, when searching for a word in a text, as the algorithm shifts one letter to the right (as in the animation for brute force), instead of having to calculate the hash of the section of text[1:3] as it shifts from text[0:2], the algorithm can use a rolling hash to do an operation on the hash to get the new hash from the old hash.

Say you have a hash function that maps each letter in the alphabet to a unique prime number, \([\text{prime}_a \ldots \text{prime}_z]\). You are searching a text \(T\) of length \(n\) for a word \(W\) of length \(m\). Let’s say that the text is the following string “abcdefghijklmnopqrstuvwxyz” and the word is “fgh”. How can we use a rolling hash to search for “fgh”?

At each step, we compare three letters of the text to the three letters of the word. At the first step, we can multiply \(\text{prime}_a \times \text{prime}_b \times \text{prime}_c\) to get the hash of the string “abc”. Similarly, the hash value of “fgh” is \(\text{prime}_f \times \text{prime}_g \times \text{prime}_h\). Compare the hash value of “abc” with “fgh” and see that the numbers do not match.

To move onto the next iteration, we need to calculate the hash value of “bcd”. How can we do this? We could compute \(\text{prime}_b \times \text{prime}_c \times \text{prime}_d\), but we would actually be repeating work we’ve already done!

We know \(\text{prime}_a \times \text{prime}_b \times \text{prime}_c\) and we know that \(\frac{\text{prime}_a \times \text{prime}_b \times \text{prime}_c}{\text{prime}_a} = \text{prime}_b \times \text{prime}_c \). If we divide out the \(\text{prime}_a\) and multiply in the \(\text{prime}_d,\) we can get the hash value of “bcd”. We compare this value with “fgh”, and so on. \(_\square\)

Rabin-Karp uses only simple multiplications and additions in its rolling hash implementation.

All the operations are done modulo a prime number \(p\) to avoid dealing with large numbers. For the sake of readability and simplicity, the modulo \(p\) has been excluded, but the calculations still hold when modulo \(p\) is present.

The Rabin-Karp Rolling Hash^[2]

\[H = c_1a^{k-1} + c_2a^{k-2} + c_3a^{k-3} + \cdots + c_ka^{0}\]

\(a\) is a constant, \(c_1 \ldots c_k\) are the input characters, and \(k\) is the number of characters there are in the string we are comparing (this is the length of the word).

Let’s use Rabin-Karp’s rolling hash to hash the alphabet. Here, \(a\) will be \(26\), \(k\) will be \(3\), and \(c\) will represent the place in the alphabet where the character appears—so for “a” \(c\) will be \(1\), and for “z” \(c\) will be \(26\).

Let’s find the hash of “abc” and use that to find the hash of “bcd”:

\[H(\text{“abc”}) = 1 \times 26^2 + 2 \times 26^1 + 3 \times 26^0.\]

To get the hash of “bcd”, we need to remove “a” and add “d”:

\[H(\text{“bcd”}) = H(\text{“abc”}) - H(\text{“a”}) + H(\text{“d”}).\]

We will need to multiply \(H(\text{“abc”}) - H(\text{“a”})\) by \(26\) since we will be subtracting out the \(26^2\) term that is currently associated with “a” and will need to now be associated with “b”. This will shift values to the left so that “d” will have the \(26^0\) coefficient:

\[\begin{align} H(\text{“bcd”}) &= \left(\big(1 \times 26^2 + 2 \times 26^1 + 3 \times 26^0\big) - 1 \times 26^2\right) \times 26 + 4 \times 26^0\\ &= 2 \times 26^2 + 3 \times 26^1 + 4 \times 26^0. \end{align}\]

At each step, there are a constant number of operations (one subtraction, one multiplication, and one addition) and these are done in constant time, \(O(1)\). When checking if a substring of the text is equal to the word, the algorithm just needs to see if the hash values are equal. This can also be done in constant time. Because the algorithm performs \(O(1)\) operations on each character of the text and there are \(n\) characters in the text, this step takes \(O(n)\) time.

Implementation of Rabin-Karp

Here is one way to implement the Rabin-Karp algorithm in Python^[3].

class RollingHash:
    def __init__(self, text, sizeWord):
        self.text = text
        self.hash = 0 
        self.sizeWord = sizeWord

        for i in range(0, sizeWord):
            #ord maps the character to a number
            #subtract out the ASCII value of "a" to start the indexing at zero
            self.hash += (ord(self.text[i]) - ord("a")+1)*(26**(sizeWord - i -1))

        #start index of current window
        self.window_start = 0
        #end of index window
        self.window_end = sizeWord

    def move_window(self):
        if self.window_end <= len(self.text) - 1:
            #remove left letter from hash value
            self.hash -= (ord(self.text[self.window_start]) - ord("a")+1)*26**(self.sizeWord-1)
            self.hash *= 26
            self.hash += ord(self.text[self.window_end])- ord("a")+1
            self.window_start += 1
            self.window_end += 1

    def window_text(self):
        return self.text[self.window_start:self.window_end]

def rabin_karp(word, text):
    if word == "" or text == "":
        return None
    if len(word) > len(text):
        return None

    rolling_hash = RollingHash(text, len(word))
    word_hash = RollingHash(word, len(word))
    #word_hash.move_window()

    for i in range(len(text) - len(word) + 1):
        if rolling_hash.hash == word_hash.hash:
            if rolling_hash.window_text() == word:
                return i
        rolling_hash.move_window()
    return None

#print rabin_karp("a", "abcdefgh")
#print rabin_karp("d", "abcdefgh")
#print rabin_karp("cupcakes", "balloonsandcupcakes")

Complexity of Rabin-Karp

Assume the text is length \(n\) and the length of the word is \(m\). The best- and average-case running time of Rabin-Karp is \(O(m + n)\) because the rolling hash step takes \(O(n)\) time and once the algorithm finds a potential match, it must verify each letter to make sure that the match is true and not a result of a hashing collision and therefore must check each of the \(m\) letters in the word.

The worst-case running time of Rabin-Karp is \(O(nm)\). This would occur with an extremely awful hash function that resulted in a false positive at each step. Since whenever the algorithm thinks it found a match, it must verify each of the \(m\) letters in the word, if there is a collision at each step, \(m\) letters will be checked \(n\) times resulting in a running time of \(O(nm)\). This can be avoided with a good choice of hash function.

When searching for a single pattern to match in a text, the Knuth–Morris–Pratt algorithm will be a faster choice. For matching multiple patterns, the Rabin-Karp algorithm paired with a Bloom filter can efficiently find multiple patterns in a text.^[4]

References

, . Hash Function. Retrieved May 28, 2016, from https://en.wikipedia.org/wiki/Hash_function
, . Rolling Hash. Retrieved May 28, 2016, from https://en.wikipedia.org/wiki/Rolling_hash
, m. Rabin-Karp. Retrieved May 28, 2016, from https://github.com/mccricardo/Rabin-Karp/blob/master/rabin_karp.py
, . Rabin–Karp algorithm. Retrieved May 29, 2016, from https://en.wikipedia.org/wiki/Rabin%E2%80%93Karp_algorithm

Computer Science Fundamentals

Contents