KMP algorithm and its Java implementation

Source: Internet
Author: User

KMP algorithm, also known as "See the Cat Slice" algorithm (mistake), is an improved string pattern matching algorithm, can be in O (n+m) time complexity to complete the string matching operation, the core idea is: when the match process occurs when the character mismatch, do not need to backtrack the main string pointer, but the use of the already obtained " Partial match, "slide the pattern string as much as possible to the right, and then continue the comparison."

KMP (see Cat slice) algorithm

1. Simple string pattern matching algorithm

Find the position of a string (pattern string) in another string (the main string), called a string pattern match.

In the naïve string pattern matching algorithm, we set the pointer I and J respectively for the main string s and the pattern string T, assuming that the string subscript starts at 0 and the initial I and J point to the No. 0 position of each string respectively. At the beginning of the Nth pass match, I points to the n-1 position in the main string s, J points to the No. 0 position of the pattern string T, and then compares backwards by one. If each character in T is equal to the character in S, the match succeeds, otherwise, when a character is not equal, I re-points to the nth position of S, J again points to the No. 0 position of T, and continues the n+1 match.

For example, we match the pattern string t= "ABAABCAC" and the main string s= "ABCABAABAABCACB". 1.1, at this time the 4th pass match, s[3...7] and t[0...4] are equal, but when i=8,j=5, S[8] and t[5] are not equal, the match failed. So, set i=4,j=0, equivalent to move the pattern string to the right one after the next match, 1.2.


Figure 1.1 when i=8,j=5, the characters are not equal, the match fails

Figure 1.2 Restart the next match after you move the pattern string one to the right
Using this method for string matching, the worst-case time complexity is O (n*m), where N and M are respectively the length of the main string and the pattern string.

2. Improved--KMP algorithm for string pattern matching algorithm

In the above example, we can see that when i=8,j=5, S[8] and t[5] are not equal, so the i=4,j=0, equivalent to move the pattern string to the right one bit, and then start the next match. However, by observing we can find that the next two matches, namely I=4,j=0 and i=5,j=0, are unnecessary. This is because, in a previous trip to the match process, we have partially matched the substring "Abaab" of T. By moving T Right one bit, the equivalent of "Abaab ..." in T is matched with "Baab ..." in S, which obviously does not match successfully. Continuing to the right of T, the equivalent of "Abaab ..." in t matches the "AaB ..." in S and still does not match successfully. It is only when T moves 3 bits to the right that the "Abaab ..." in T is matched with "ab ..." in S, and it is necessary to continue the comparison back in time. 2.1.


Figure 2.1 If the match fails, move T to the right 3 bits, the need to continue the comparison
Thus, when I=8,j=5,t's substring "Abaab" has been successfully matched, and the subsequent character is not equal, it is not necessary to retrace the I pointer, set i=8,j=2, and continue backward, equivalent to moving t to the right 3 bits and starting from the 3rd bit of T to the backward comparison. 2.2.

Figure 2.2 After the match fails, direct i=8,j=2 to continue the backward comparison

This is the basic idea of the KMP algorithm. For a substring of the first J characters in the pattern string T, set the array next[j] to hold a value when the pattern string T matches the main string not equal to the first, then the I pointer is not changed, the J pointer is placed to the value of next[j], and then the comparison continues. In the example above, the string "Abaab" is a substring of the first 5 characters of the pattern string T, so that next[5]=2, when I=8,j=5, S[8] and t[5] are not equal, then the i=8,j=next[j]=next[5]=2 is placed, and then the comparison is continued.

Therefore, the core of the KMP algorithm is to find the value of the array next, which is the next[j for each prefix in the pattern string T with a length of J (0<j<t.length).

Next Array solving algorithm

Before solving the next array, we first need to understand the meaning of the next array. Back to the previous example, when the next character of T's substring "Abaab" is not equal to the main string, the pointer I of the main string is not changed, and J is traced to 2, pointing to the 3rd character of T, which is essentially because the string "Abaab" has the longest common string "AB" with a length of 2, so we omit the prefix and the suffix "ab" comparison process, directly to their next character, namely T[2] and s[8].

Another example, assuming that there is a pattern string t= "Abacaabadad", which has partially matched t[0...7], that is, "Abacaaba", when matching t[8] encountered a matching failure, because t[0...7] prefix and suffix has a length of 3 of the longest common string "ABA", So next[8]=3, the j=next[j]=next[8]=3,i is not changed, and then the 4th character of T is compared from t[3]. 2.3.


Figure 2.3 Matching t[8] failed, I unchanged, J back to 3

In summary, for pattern string t,next[j] represents the length of the longest common string in a substring consisting of the first J characters of T, its prefix and suffix.

The algorithm for solving the next array of string T is as follows:

  1. Next[0]=-1, next[1]=0.
  2. When solving next[j], make k=next[j-1],
  3. Compare the values of t[j-1] and T[k],

    A. If t[j-1] equals t[k], then next[j]=k+1.

    B. If t[j-1] is not equal to t[k], so k=next[k], if K equals-1, then next[j]=0, otherwise jump to 3.

The following is an example of the pattern string t= "ABAABCAC", which gives the procedure for finding the next array:

  1. Next[0]=-1, next[1]=0.
  2. When j=2, k=next[j-1]=next[1]=0, because t[j-1]=t[1]= ' B ', t[k]=t[0]= ' a ', t[j-1] is not equal to t[k], so k=next[k]=next[0]=-1.
  3. When J=3, k=next[j-1]=next[2]=0, because t[j-1]=t[2]= ' a ', t[k]=t[0]= ' a ', t[j-1] equals t[k], so next[3]=k+1=1.
  4. When J=4, k=next[j-1]=next[3]=1, because t[j-1]=t[3]= ' a ', t[k]=t[1]= ' B ', t[j-1] is not equal to t[k], so k=next[k]=next[1]=0. At this time t[k]=t[0]= ' a ', t[j-1] equals t[k], so next[3]=k+1=1.
  5. When J=5, k=next[j-1]=next[4]=1, because t[j-1]=t[4]= ' B ', t[k]=t[1]= ' B ', t[j-1] equals t[k], so next[5]=k+1=2.
  6. When J=6, k=next[j-1]=next[5]=2, because t[j-1]=t[5]= ' C ', t[k]=t[2]= ' a ', t[j-1] is not equal to t[k], so k=next[k]=next[2]=0. At this time t[k]=t[0]= ' a ', t[j-1] is not equal to t[k], and then k=next[k]=next[0]=-1, so next[6]=0.
  7. When J=7, k=next[j-1]=next[6]=0, because t[j-1]=t[6]= ' a ', t[k]=t[0]= ' a ', t[j-1] equals t[k], so next[7]=k+1=1.

After the next array is all calculated, simply modify the simple matching algorithm, we get the KMP matching algorithm: When the pattern string T matches to the first J character, the match fails, I pointer is not changed, the J pointer is set to the value of next[j], if the value of J is 1, I and J are added 1. Then proceed to a one-by-one comparison.

The following is an example of pattern string t= "ABAABCAC" and the main string s= "ABCABAABAABCACB", and gives the whole process of the KMP matching algorithm.
The next array of the pattern string T was previously evaluated as [-1, 0, 0, 1, 1, 2, 0, 1].

    1. Initial, i=0,j=0, match succeeded.
    2. I=1,j=1, the match succeeds.
    3. I=2,j=2, the match failed.
    4. I=2,j=next[2]=0, the match failed.
    5. I=2,j=next[0]=-1, the match failed.
    6. I=2+1=3,j=-1+1=0, the match succeeds.
    7. I=4,j=1, the match succeeds.
    8. I=5,j=2, the match succeeds.
    9. I=6,j=3, the match succeeds.
    10. I=7,j=4, the match succeeds.
    11. I=8,j=5, the match failed.
    12. I=8,j=next[5]=2, the match succeeds.
    13. Continue backward comparison, the intermediate process is matched successfully, so no longer repeat, when i=13,j=7, the pattern string matching is completed.

The above is the whole process of KMP matching algorithm. To sum up, the essence of the KMP algorithm is to use space for time, before matching some information of the pattern string (next array), in the subsequent matching process, using this information to reduce the number of unnecessary matches to improve the matching efficiency. In the actual application process, the execution time of the simple pattern matching algorithm is often close to the KMP algorithm, and the KMP algorithm can improve performance only when there are many "partial matches" between the main string and the pattern string.

3. Java implementation of the KMP algorithm

The Java code for the KMP algorithm is given below. The whole algorithm is divided into two parts, one is to solve the next array, and the other is the KMP matching process.

public class KMP {/** * Find the next array of a character array * @param t-character array * @return Next array */public static int[] Get        Nextarray (char[] t) {int[] next = new Int[t.length];        Next[0] =-1;        NEXT[1] = 0;        int k;            for (int j = 2; J < T.length; J + +) {k=next[j-1];                    while (K!=-1) {if (t[j-1] = = T[k]) {Next[j] = k + 1;                Break                } else {k = next[k];  } Next[j] = 0;    When K==-1 jumps out of the loop, next[j] = 0, otherwise next[j] will be assigned the value before break}} return next; /** * KMP pattern matching for the main string s and pattern string T * @param s main String * @param t-mode String * @return If the match succeeds, returns the position of T in S (the position of the first same character), if the match fails, the return        Back -1 */public static int Kmpmatch (string s, String t) {char[] S_arr = S.tochararray ();        char[] T_arr = T.tochararray ();        int[] Next = Getnextarray (T_arr);        int i = 0, j = 0; while (I<s_arr. length && J<t_arr.length) {if (j = =-1 | | s_arr[i]==t_arr[j]) {i++;            j + +;        } else J = next[j];        } if (j = = t_arr.length) return i-j;    else return-1;    } public static void Main (string[] args) {System.out.println (Kmpmatch ("ABCABAABAABCACB", "ABAABCAC")); }}
References and acknowledgements

In the process of learning KMP algorithm, I extensively refer to the "King's Examination series" of the "data structure of the Review Guide", as well as csdn Bo Master V_july_v article: From beginning to end thoroughly understand KMP, hereby thanks.

At the same time thanks to the Micro Bobo Master @ Memories of the special vest and the lab's red Robe Cocoxu provide a large number of cat pieces, let me learn KMP algorithm in the process of a continuous momentum.

If there are errors in the article, please correct me!

KMP algorithm and its Java implementation

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.