The KMP algorithm is an efficient string matching algorithm due to Donald Knuth, Vaughan Pratt, and James H. Morris. It is a linear time algorithm that exploits the observation that every time a match (or a mismatch) happens, the pattern itself contains enough information to dictate where the new examination should begin from.
Problem: Given a string
Sand a string
pshorter than or equal to
S, find all occurrences of
The string matching problem can be relevant to many situations including but not limited to using the search feature in text editors, building up a database for a search engine or processing genomic sequences.
As one can see, the algorithm hits the worst case possibility (with no occurrences of the pattern) in practice very frequently and needs to be carefully designed.
The naive approach would be to slide a window of size all over and check if it matches.
1 2 3 4 5 6 7 8 9
That works but actually is inefficient. It takes or more tightly time.
Every time the naive function fails (or succeeds), it starts matching over from the next character. This might not be necessary. We could use our knowledge of the point of last matching and the structure of the test string to know where the next matching should begin from.
The above-mentioned "structure" of the string is encapsulated in the prefix function (denoted ) of the test string. This is a description of how well the string matches against shifts of itself.
Given a pattern of length , the function maps to such that is the length of the longest prefix of that is a proper suffix of
We are now ready to start finding occurrences of the pattern in the string.
Let's say that we match a few characters of the pattern in the target string and find a mismatch. Should we start all over again from the very next letter where we did our previous matching? Perhaps not. We could use the pre-computed knowledge in the prefix function that would tell us if it is possible that a new match could include the previously matched letters.
Let's try and match the pattern
We try matching
Gwhich is a negative. So, we proceed to the next character.
ATin the pattern against
ATin the string. We then encounter
Cin the string and try to match it with
Tin the pattern which does not succeed.
We ignore trying to match from the
Timmediately after the
Abecause the prefix-function tells us that . In other words, it is impossible for a match to start from that location since there is a
Tthere which does not agree with the beginning of the pattern.
Nothing interesting happens until the
ATAis matched. The next character
Ain the pattern disagrees with
Tin the string.
The next match should take care of the fact that we've already matched an
ATnear the end of the previous match which need not be matched again. We just need to check if the letters following
ATin the pattern match.
The idea being described here takes some time to sink in. This video might help.
Because the prefix-function already tells us what needs to be done about the already matched parts, we do not need to check those parts again and hence, only one pass over the string is sufficient. So, the time complexity of the matching is .
What about the pre-computation? It turns out that, if correctly implemented, it can be done in time.
Overall, it takes time. But since , the time is bounded by . This is certainly better than the usual
We only require as much space as the pattern. So, the space requirement is .
Notice that reading the actual string character by character in one linear pass is just fine here.
1 2 3 4 5 6 7 8 9 10 11 12 13
1 2 3 4 5 6 7 8 9 10 11 12 13 14