Knuth-Morris-Pratt Algorithm
The KMP algorithm is an efficient string matching algorithm due to Donald Knuth, Vaughan Pratt, and James H. Morris. It is a linear time algorithm that exploits the observation that every time a match (or a mismatch) happens, the pattern itself contains enough information to dictate where the new examination should begin from.
String Matching Problem
Problem: Given a string
S
and a stringp
shorter than or equal toS
, find all occurrences ofp
inS
.
The string matching problem can be relevant to many situations including but not limited to using the search feature in text editors, building up a database for a search engine or processing genomic sequences.
As one can see, the algorithm hits the worst case possibility (with no occurrences of the pattern) in practice very frequently and needs to be carefully designed.
The naive approach would be to slide a window of size \(|p|\) all over \(|S|\) and check if it matches.
1 2 3 4 5 6 7 8 9 |
|
That works but actually is inefficient. It takes \(O\big(|S|^2\big)\) or more tightly \(O\big(|S| \cdot |p|\big)\) time.
The Prefix Function
Every time the naive function fails (or succeeds), it starts matching over from the next character. This might not be necessary. We could use our knowledge of the point of last matching and the structure of the test string to know where the next matching should begin from.
The above-mentioned "structure" of the string is encapsulated in the prefix function (denoted \(\pi\) ) of the test string. This is a description of how well the string matches against shifts of itself.
Given a pattern \(p\) of length \(m\), the function \(\pi\) maps \(\left \{ 1, 2, \ldots, m \right \}\) to \(\left \{ 0, 1, \ldots, m-1 \right \}\) such that \(\pi(q)\) is the length of the longest prefix of \(p\) that is a proper suffix of \(p_q:\)
\[\pi[q] = \max \left \{ k : k < q \text{ and } p_k \sqsupset p_q \right \} .\]
Finding Matches
We are now ready to start finding occurrences of the pattern in the string.
Let's say that we match a few characters of the pattern in the target string and find a mismatch. Should we start all over again from the very next letter where we did our previous matching? Perhaps not. We could use the pre-computed knowledge in the prefix function that would tell us if it is possible that a new match could include the previously matched letters.
Let's try and match the pattern
ATAAT
againstGATCCATATG
.
We try matching
A
againstG
which is a negative. So, we proceed to the next character.We match
AT
in the pattern againstAT
in the string. We then encounterC
in the string and try to match it withT
in the pattern which does not succeed.We ignore trying to match from the
T
immediately after theA
because the prefix-function tells us that \(\pi[1]=0\). In other words, it is impossible for a match to start from that location since there is aT
there which does not agree with the beginning of the pattern.Nothing interesting happens until the
ATA
is matched. The next characterA
in the pattern disagrees withT
in the string.The next match should take care of the fact that we've already matched an
AT
near the end of the previous match which need not be matched again. We just need to check if the letters followingAT
in the pattern match.
The idea being described here takes some time to sink in. This video might help.
Complexity
Time Complexity
Because the prefix-function already tells us what needs to be done about the already matched parts, we do not need to check those parts again and hence, only one pass over the string is sufficient. So, the time complexity of the matching is \(O(n)\).
What about the pre-computation? It turns out that, if correctly implemented, it can be done in \(O(m)\) time.
Overall, it takes \(O(m+n)\) time. But since \(m \leq n\), the time is bounded by \(O(n)\). This is certainly better than the usual \(O(n^2).\)
\[\] Space Complexity
We only require as much space as the pattern. So, the space requirement is \(O(m)\).
Notice that reading the actual string character by character in one linear pass is just fine here.
Implementation
Prefix Function
1 2 3 4 5 6 7 8 9 10 11 12 13 |
|
Matching
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
|