Knuth-Morris-Pratt Algorithm

The KMP algorithm is an efficient string matching algorithm due to Donald Knuth, Vaughan Pratt, and James H. Morris. It is a linear time algorithm that exploits the observation that every time a match (or a mismatch) happens, the pattern itself contains enough information to dictate where the new examination should begin from.

String Matching Problem

Problem: Given a string S and a string p shorter than or equal to S, find all occurrences of p in S.

The string matching problem can be relevant to many situations including but not limited to using the search feature in text editors, building up a database for a search engine or processing genomic sequences.

As one can see, the algorithm hits the worst case possibility (with no occurrences of the pattern) in practice very frequently and needs to be carefully designed.

The naive approach would be to slide a window of size \(|p|\) all over \(|S|\) and check if it matches.

def match(S,p):
    matches = []
    for i in range(0,len(S)-len(p)+1):
        for j in range(0,len(p)):
            if S[i+j] != p[j]:
                break
        else:                #executes only when the loop above does not break anywhere
            matches.append(i)
    return matches

That works but actually is inefficient. It takes \(O\big(|S|^2\big)\) or more tightly \(O\big(|S| \cdot |p|\big)\) time.

The Prefix Function

Every time the naive function fails (or succeeds), it starts matching over from the next character. This might not be necessary. We could use our knowledge of the point of last matching and the structure of the test string to know where the next matching should begin from.

The above-mentioned "structure" of the string is encapsulated in the prefix function (denoted \(\pi\) ) of the test string. This is a description of how well the string matches against shifts of itself.

Given a pattern \(p\) of length \(m\), the function \(\pi\) maps \(\left \{ 1, 2, \ldots, m \right \}\) to \(\left \{ 0, 1, \ldots, m-1 \right \}\) such that \(\pi(q)\) is the length of the longest prefix of \(p\) that is a proper suffix of \(p_q:\)

\[\pi[q] = \max \left \{ k : k < q \text{ and } p_k \sqsupset p_q \right \} .\]

Finding Matches

We are now ready to start finding occurrences of the pattern in the string.

Let's say that we match a few characters of the pattern in the target string and find a mismatch. Should we start all over again from the very next letter where we did our previous matching? Perhaps not. We could use the pre-computed knowledge in the prefix function that would tell us if it is possible that a new match could include the previously matched letters.

Let's try and match the pattern ATAAT against GATCCATATG.

We try matching A against G which is a negative. So, we proceed to the next character.

We match AT in the pattern against AT in the string. We then encounter C in the string and try to match it with Tin the pattern which does not succeed.

We ignore trying to match from the T immediately after the A because the prefix-function tells us that \(\pi[1]=0\). In other words, it is impossible for a match to start from that location since there is a T there which does not agree with the beginning of the pattern.

Nothing interesting happens until the ATA is matched. The next character A in the pattern disagrees with T in the string.

The next match should take care of the fact that we've already matched an AT near the end of the previous match which need not be matched again. We just need to check if the letters following AT in the pattern match.

The idea being described here takes some time to sink in. This video might help.

Complexity

Time Complexity

Because the prefix-function already tells us what needs to be done about the already matched parts, we do not need to check those parts again and hence, only one pass over the string is sufficient. So, the time complexity of the matching is \(O(n)\).

What about the pre-computation? It turns out that, if correctly implemented, it can be done in \(O(m)\) time.

Overall, it takes \(O(m+n)\) time. But since \(m \leq n\), the time is bounded by \(O(n)\). This is certainly better than the usual \(O(n^2).\)

\[\] Space Complexity

We only require as much space as the pattern. So, the space requirement is \(O(m)\).

Notice that reading the actual string character by character in one linear pass is just fine here.

Implementation

Prefix Function

def prefix(p):
   m=len(p)
   pi=[0]*m
   j=0 
   for i in range(1,m):
      while j>=0 and p[j]!=p[i]:
         if j-1>=0:
                j=pi[j-1]
         else:
                j=-1 
      j+=1
      pi[i]=j
   return pi

Matching

def find_occurrences(S,p):
   matches = []
   f=prefix(p)
   n,m=len(S),len(p)
   j=0
   for i in range(n):
      while j>=0 and t[i]!=p[j]:
         if j>0: j=f[j-1]
         else: j=-1
      j+=1  
      if j==m:
         j=f[m-1]
         matches.append(i-m+1)
   return matches

Contents

Prefix Function

Matching