An understanding of the KMP algorithm

Source: Internet
Author: User
Tags first string

1. Introduction

Locating a small substring in a large string is called a pattern match of a string, which should be considered one of the most important operations in a string. The KMP itself is not complicated, but most of the articles on the Internet confuse it. Now, we start with the brute force matching algorithm, then we explain the KMP process steps, the simple solution of next array, the recursive principle, the code solution, and then based on next array matching, talking about finite state automata, next array optimization, KMP time complexity analysis, Finally, the extension algorithm of two KMP is briefly introduced.

2. Brute Force matching algorithm

2.1 Problem Description:

There is a text string s and a pattern string p, now to find the position of P in S, how to find?

If the idea is to match with violence and assume that the text string matches to the I position, the pattern string matches to the J position.

2.2 Algorithm Description:

For the pattern matching of strings, first of all, one of the simplest algorithms is the brute force law. The specific algorithm describes:

(1) Initialize I to the initial position of the main string, which is assumed to be the 0 position of the main string, and J points to the 0 position of the substring.

(2) If the current character matches successfully, i.e. s[i] = = P[j], then i++,j++ continues to match the next string.

(3) If the match is unsuccessful, that is s[i]! = p[j], then i = i-j+1,j=0. The equivalent of every match failure, I backtracking, J reset to 0.

tip: In the unsuccessful match, I is the pointer to the first string, then the position before the start of the match to move the amount of J. So I-j is the starting position of this match. +1 represents a position to move backwards.

2.3 Brute force algorithm implementation:

//returns the position of the substring p in the main string s, and returns 1 if it does not exist .intViolentmatch (Char*s,Char*p) {    inti =0; intj =0; intSlen =strlen (s); intPlen =strlen (P);  while(I < Slen && J <Plen) {        if(S[i] = =P[j]) {            ++i; ++J; }        Else{i= I-j +1;//J represents the number of times this move, I-j represents the starting position of the current of the S string, and +1 represents a forward positionj =0; }    }// while    if(J = =Plen) {        returnIPlen; }    Else    {        return-1; }}

The time complexity of this brute-force matching algorithm is in the worst case of o[(n-m+1) *m], where n is the length of the main string, and M is the length of the pattern string. We can see that the time complexity of the algorithm is very large, obviously we do not need such an algorithm, because it sucks. But this is the basis for other algorithms that we design.

3. KMP algorithm Pre-analysis

Before starting the KMP algorithm, let's review the string pattern matching violence method, as shown in the following procedure:

Note: The matching of the two strings in the graph starts with 1, and the match in the code starts at 0.

You can see that the value of I that matches the main string is constantly backtracking, yet the KMP three-bit master finds that this backtracking is not really needed. This algorithm is the main thrust of this paper KMP algorithm, it uses the previously already partially matched this effective information, keep I do not backtrack, by modifying the position of J, let the pattern string as far as possible to move to a valid position.

Since I value can not be traced back, that is, can not be smaller, then to consider the change is the J value. In order to be able to clarify the KMP algorithm, we first analyze two examples based on the brute force algorithm.

3.1 Example 1:

The main string s= "Abcdefgab" mode string p= "Abcdex", which is the problem shown in the above figure. For this example it is not difficult to find that the first letter "a" of the pattern string is not equal to any of the characters in the subsequent string "Bcdex", so the first five characters are successfully matched to the main string as shown in ①. That is, the first character of the pattern string "a" is not possible with the 2nd bit of the main string to the 5th bit of the word typeface and so on. That is to say, ②③④⑤ 's judgment is more than. So you just have to keep the ①⑥.

Note: The matching of the two strings in the graph starts with 1, and the match in the code starts at 0.

This example shows that if the first character in the pattern string is not the same as the following character, I will not backtrack when the match is unsuccessful. Just keep the value of I constant and then match J to the starting position. So the original brute force algorithm code only needs to change the else part of the modification to:

 while (I < Slen && J < Plen) {    if(s[i] = = P[j]    )        {+ +i;         + +j    ;    } Else     {        0;    }} //  while
However, there is a premise that the first character of the pattern string is different from the other characters in the pattern string. But what if this condition is not met? That is, the first character of the pattern string does not match the other characters.

3.2 Example 2:

The main string s= "abcabcabc", the pattern string p= "ABCABX", for the beginning of the judgment, the first 5 characters are exactly equal, the 6th character varies. However, the pattern string does not meet the conditions in Example 1 and does not satisfy the difference between the first character and the other characters in the pattern string.

Note: The matching of the two strings in the graph starts with 1, and the match in the code starts at 0.

However, it is possible to use the experience in Example 1, that is, the first character of the pattern string differs from the 2nd character and the 3rd character, so the ②③ step is superfluous.

It's interesting. Again: for the pattern string, its first and 4th characters are "a", the 2nd character and the 5th character are "B", that is, in the pattern string there is a common substring "ab", then in the ① has been seen, the pattern string 45th position and the main string 45th position is also "AB", So the process of ④⑤ is also superfluous. So you just have to keep the ①⑥.

Note: The matching of the two strings in the graph starts with 1, and the match in the code starts at 0.

So it seems that the source code can be modified like this:

 while (I < Slen && J < Plen) {    if(s[i] = = P[j]    )        {+ +i;         + +j    ;    } Else     {        // if the current character match fails (i.e. s[i]! = P[j]), then I will not change, j =              Next[j]// Next[j] is the next value                corresponding to J j = next[j];}      } //  while

The concrete this next[j] is how much, look backwards.

4. KMP algorithm

Earlier, we talked about the pattern-matching violence method of string, and made some improvements on the basis of the brute force method: The match pointer I of the main string is not traced back, and the matching pointer of the pattern string is constantly modified by discovering some characteristics of the pattern string. But how to modify the matching pointer of the pattern string, it is necessary to combine some of its own characteristics, and then produce corresponding modified values, recorded in the next[j] this array.

4.1 Look for the longest common element length of the prefix suffix:

For, look for the maximum length and equal prefixes and suffixes in the pattern string p. If there is =, then there is the same prefix suffix with the maximum length of k+1 in the pattern string containing PJ. For example, if the given pattern string is "abab", then the maximum length of the common element of the prefix suffix of its various substrings is shown in the table below:

For example, for string ABA, it has the same prefix suffix a of length 1, and for string abab, it has the same prefix suffix ab with length 2 (the length of the same prefix suffix is k + 1,k + 1 = 2).

4.2 for next array:

The next array considers the longest prefix suffix common element except for the current character, why is it in addition to the current character? Recalling the pattern string "Abcdex" and "ABCABX" in the previous two examples, it is found that when the matching pointer reaches a certain character and is going to use the next array, the character must be a character that matches the failure, which is also visible in the source code, so the character, if any, is to be compared. Cannot escape from the algorithm. So look at the characters that match the failed character before the characters, take advantage of these already partially matched this valid information, keep I do not backtrack, by modifying the position of J, let the pattern string as far as possible to move to a valid position.

After the 1th step to obtain the maximum length of the common elements of each prefix suffix, as long as the deformation can be: the value obtained in the 1th step to move the whole to the right one bit, and then the initial value is assigned to 1 (here 1 does not represent the longest same prefix suffix length, just means that the character is the first character of the pattern string, You can see why the initial value is 1, as shown in the following table:

For example, for ABA, the string ab before the 3rd character A has the same prefix suffix of length 0, so the 3rd character a corresponds to the next value of 0, and for Abab, the string in ABA before the 4th character B has the same prefix suffix a of length 1. So the 4th character B corresponds to the next value of 1 (the length of the same prefix suffix is k,k = 1).

4.3 matches According to the next array:

Said the above two steps, the following to get to the point, how to according to the value of the next array, in the case of guaranteed I value does not backtrack, adjust the value of J to match.

Match mismatch, j = Next [j], the number of bits that the pattern string moves to the right relative to the main string is: J-next[j]. In other words, when the suffix of the pattern string matches the text string successfully, but Pj 's match with Si fails, because next[j] = k, which is equivalent to a maximum length of k in a pattern string without PJ has the same prefix suffix, that is, so j = Next[j], So that the pattern string is shifted to the right j-next[j] bit, so that the pattern string prefix

Corresponds to the text string, and then let the Pk and si continue to match. As shown in the following:

To sum up, the next array of KMP tells us where the pattern string should go next when a character in the pattern string matches the mismatch with a character in the text string. If the character at J in the pattern string matches the character mismatch in the text string at I, the next character at Next [J] continues to match the character at text string I, which is equivalent to moving the pattern string to the right j-next[j] bit.

Next look at an example, specifically explaining what's above:

4.4 Look for an example of the longest prefix suffix:

If the given pattern string is: "Abcdabd", traversing the entire pattern string from left to right, the prefix suffix of each of its substrings is shown in the following table:

In other words, the maximum length table for the common elements of each prefix suffix corresponding to the string substring of the original pattern is ( hereinafter referred to as "Maximum length table"):

4.5 Based on the example of "Maximum length table matching":

Because there may be duplicate characters in the pattern string, the following conclusions can be drawn:

Mismatch, the number of bits that the pattern string moves to the right is: matched number of characters-maximum length value for the previous character of the mismatch character

The following, combined with the previous "Maximum length table" and the above conclusions, for string matching. If the given text string "BBC Abcdab Abcdabcdabde", and the pattern string "Abcdabd", now to take the pattern string to match the text string, here first directly with the maximum prefix suffix common element length table, first without the next data. Another thing is that the maximum length value of the previous character of the mismatch character is actually the value of the matching pointer of the pattern string at the next match, with the right offset from the main string in the example, and of course both of them are possible. As shown in the following:

1. Because the character a in the pattern string does not match the character B, B, C, and space in the text string, it is not necessary to consider the conclusion that the pattern string will be moved right one at a stroke until the character a in the pattern string matches the 5th character of the text string a success:

2. Continue to match, when the last character of the pattern string is mismatched with the text string, it is obvious that the pattern string needs to move to the right. But how many bits to move to the right? Since the number of characters that have been matched at this time is 6 (ABCDAB), then the length value corresponding to the previous character B of the "max-length table" can be a gain and loss character D is 2, so according to the previous conclusion, you need to move 6-2 = 4 bits to the right.

3. When the pattern string moves 4 bits to the right, it finds that C is again mismatched because 2 characters (AB) are already matched, and the maximum length value for the previous character B is 0, so move right: 2-0 = 2 bits.

4. A with a space mismatch, move to the right 1 bits.

5. Continue to compare, found D and C mismatch, so the number of digits to move to the right: the number of matched characters 6 minus the previous character B corresponds to the maximum length 2, that is, moving to the right 6-2 = 4 bits.

6. After the 5th step, we find that the match is successful and the process is complete.

This completes the entire matching process, even if there are strings that can be matched successfully later. Of course, it is also possible to call the matching function again.

Through the matching process, it can be seen that the crux of the problem is to look for the same prefix and suffix of the maximum length in the pattern string, and after finding the maximum length of the prefix and suffix common part of each character in the pattern string, it can be based on this match. And this maximum length is exactly what the next array is meant to say.

Tip: The reason for finding the longest common sequence of prefix suffixes for characters that have been matched is: ( 1 ) can be used to prevent the backtracking of I by using the matching substring ( 2) can use the same as the matching substring, find a reasonable effective position of the pattern string

5. Find Next Array

The basic idea of the KMP algorithm is basically introduced, but there are some problems not solved, one of which is how to solve the next array according to the maximum length table, and how to use the program to iterate over the value of the next array.

5.1 According to the "Maximum length table" to find the next array:

As we have already learned, the maximum common element lengths for each prefix suffix of the string "Abcdabd" are:

Furthermore, according to this table, the following conclusions can be drawn:

Mismatch , the number of bits that the pattern string moves to the right is: matched number of characters-maximum length value for the previous character of the mismatch character

Using this table and the conclusion to match, we found that when matching to a character mismatch, there is no need to consider the current mismatch of the characters, not to mention that each time we mismatch, is to see the mismatch character of the last character corresponding to the maximum length value. This leads to the next array.

Given the string "Abcdabd", the next array can be evaluated as follows:

After comparing the next array with the maximum length table previously obtained, it is not difficult to find that thenext array is equivalent to the "Maximum length value" of the whole to the right one bit, and then the initial value is assigned to-1. Aware of this, you will exclaim that the solution to the next array is so simple: to find the maximum symmetric length of the prefix suffix, and then the overall right to a bit, the initial value is assigned to 1 (of course, you can also directly calculate a character corresponding to the next value, is to see how long the same prefix suffix is in the string before the character, which is not the Pj, as mentioned earlier.

In other words, for a given pattern string: ABCDABD, its maximum length table and next array are as follows:

After the next array is calculated based on the maximum length table,

Mismatch, the number of bits to the right of the pattern string is: where the mismatch character is located-the next value corresponding to the mismatch character

You will then find that the number of bits that are moved to the right is the same either based on the match of the maximum length table or on the next array. Why is it? Because:

    • The number of digits to the right of the pattern string = number of matched characters-the maximum length of the previous character of the mismatch character, according to the maximum length table
    • And according to the next array, mismatch, the number of bits to the right of the pattern string = The position of the mismatch character-the next value corresponding to the mismatch character
    • Where, starting from 0, the position of the mismatch character = the number of characters that have been matched (the mismatch character does not count), and the next value of the mismatch character = The maximum length of the previous character of the mismatch character, the result must be exactly the same.

So, you can think of the "max-length table" as the prototype of the next array, even as the next array is also possible, the difference is just how to use the problem.

5.2 The next array is computed by code recursion:

(1) If for value K, already, equivalent to next[j] = k.

What does this mean? In its essence,next[j] = k is the same prefix and suffix in the pattern string string that precedes p[j], with a length of K . With this next array, in the KMP match, when the character at J in the pattern string is mismatched, the next character in the Next[j] continues to match the text string, which is equivalent to moving the pattern string to the right j-next[j] bit.

(2) The following question is: What is known next [0, ..., j], how do I find Next [j + 1]?

For P's first j+1 sequence characters:

    • If p[k] = = P[j], then next[j + 1] = next [j] + 1 = k + 1;
    • If P[k]≠p[j], if at this time p[next[k]] = = P[j], then next[j + 1] = Next[k] + 1, otherwise continue recursive prefix index k = next[k], and then repeat the process. Equivalent to the character p[j+1] does not exist before the length of k+1 prefix "P0 p1, ..., pk-1 PK" with the suffix pj-k pj-k+1, ..., pj-1 PJ "equal, then whether there may be another value t+1 < k+1, making a smaller prefix" P0 p1, ..., Pt-1 pt "equals a smaller suffix" pj-t pj-t+1, ..., pj-1 PJ "? If present, then this t+1 is the value of next[j+1], which is equivalent to matching the P string prefix with the p-string suffix using the already-evaluated next array (next [0, ..., K, ..., j]).

So what is the theoretical basis for this use of the next array solution?

look at a pattern string p = "Ababaaaba", according to the algorithm a matching prefix and suffix, until the prefix is "abab" suffix is "Abaa" when, that is, k=3,j=5, the match failed. The brute force solution at this point is to traverse all the characters before the 4th character B, find a "a", and then look at whether the prefix can have a corresponding suffix.

In other words, if you want to find the prefix that contains the suffix of the 5th character "A", then the character in the suffix other than "a" must have been preceded by a prefix match, as long as a "a" is at the end of the prefix. Coincidentally, the matching records of the previous strings are stored in the next array. This is why a prefix match is used for the next array, which narrows the source range of the matching prefix and does not omit which string. Conversely, if the characters in front of "a" have not been successfully matched, how can a "a" be successfully matched?

This method also backtracking the subscript of the next array, which is the core of the KMP algorithm.

So the code is:

voidGetNext (Char*p,int*next) {    intPlen =strlen (P); intj =0; intK =-1; next[0] = -1;  while(J < Plen-1)    {        //P[k] Represents a prefix, p[j] represents a suffix        if(k = =-1|| P[J] = =P[k]) {            ++K; ++J; NEXT[J]= k;//represents the maximum number of characters that can form a public prefix before the J character        }        Else{k= Next[k];//There's been a matching prefix before backtracking.        }    }}

At this point, we fully understand the idea of violent matching, the principle of the KMP algorithm, the flow, the internal logical connection between the process, and the next array of simple solution ("max length table" The whole right shift one bit, then the initial value of 1) and the Code solution, finally based on the "next array" matching, seemingly voluminous, Clear, but the above ignores a small problem. The KMP algorithm is still not perfect, it still has problems.

6. KMP Pattern Matching algorithm improvement

If you use the previous next array method to find the next array of "ABAB" of the mode string, the next array is 1 0 0 1 (0 0 1 2 overall right shift one, the initial value is assigned to 1), when it matches the text string to match, and then found that B and C mismatch, then the pattern string right shift j-next[j] = 3 -1 = 2 bits.

After moving the 2-bit right, B is also mismatch with C. In fact, because in the previous step of the match, it has been learned that p[3] = B, and s[3] = C mismatch, and the right to move two bits, let p[next[3]] = p[1] = B and then s[3] match, inevitably mismatch. What is the problem?

the problem is not to appear p[j] = p[Next[j]]. for what? The reason: when p[j]! = S[i], the next match must be p[next [j]] and s[i], if p[j] = p[Next[j]], will inevitably cause the subsequent match failed (because P[j] has been mismatched with s[i, then you also use the same value as P[j] P[next[j ]] [to S[i], it is obvious that it must be mismatched), so p[j] = p[Next[j]] is not allowed. What if p[j] = p[Next[j]]? If it does, it needs to be recursive again, even if next[j] = next[Next[j]].

voidGetnextval (Char*p,int*next) {    intPlen =strlen (P); intj =0; intK =-1; next[0] = -1;  while(J < Plen-1)    {        //P[k] Represents a prefix, p[j] represents a suffix        if(k = =-1|| P[J] = =P[k]) {            ++K; ++J; if(P[j]! =P[k]) {Next[j]= k;//represents the maximum number of characters that can form a public prefix before the J character            }            Else{Next[j]= Next[k];//because p[j] = P[next[j]] is not present, so you need to continue recursion when it appears, k = next[k] = next[next[k]]            }                        }        Else{k= Next[k];//There's been a matching prefix before backtracking.        }    }}

See an example:

The main string s= "AAAABCDE" mode string p= "Aaaaax". If you use the method of the preceding next array, the procedure is as follows:

2345 of these are redundant judgments, and the reason for this is p[j] = p[Next[j]]. This leads to the search for the next array multiple times at the time of the match.

All in all, it computes the next array, and if the A-bit character is equal to the B-bit character to which its next value points, the nextval of the a bit points to the Nextval value of the B-bit.

7. KMP algorithm implementation and time complexity analysis:

In front of the KMP algorithm to do a variety of explanations, now the algorithm to do a code implementation:

intKmpsearch (Char*s,Char*p) {    inti =0; intj =0; intSlen =strlen (s); intPlen =strlen (P);  while(I < Slen && J <Plen) {        //If J =-1, or if the current character matches successfully (that is, s[i] = = P[j]), make i++,j++        if(j = =-1|| S[i] = =P[j]) {i++; J++; }        Else        {            //if J! =-1, and the current character match fails (that is, s[i]! = P[j]), then I is unchanged, j = Next[j]//Next[j] is the next value corresponding to Jj =Next[j]; }    }// while    if(J = =Plen) {        returnIJ; }    Else    {        return-1; }}

time complexity analysis of KMP algorithm:

If the length of the text string is n and the length of the pattern string is M, then the time complexity of the matching process is O (n), which calculates the O (m) time of next, and the overall time complexity of the KMP is O (M + N).

8. References

July Teacher's blog, from to to tail thoroughly understand the KMP algorithm: http://blog.csdn.net/v_july_v/article/details/7041827

Data structure book: Big talk Data structure.

An understanding of the KMP algorithm

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.