KMP algorithm, also known as "look at the pornography" algorithm, is a very efficient string matching algorithm. But because it is difficult to understand, it has not been understood for a long time. Although there is a lot of information on the Internet, it is rare for a good blog to be clear and straightforward. In this, the integrated online several good blog (see last), do their best to strive to KMP algorithm ideas and implementation of clear.
The task of the KMP algorithm is to give the two strings O and F, the lengths of N and M, respectively, to determine if f is present in O and, if so, to return to the location where it appears. The general method is to traverse each location of a, and then start and match B from that position, but the complexity of this method is O (nm). The KMP algorithm uses an O (m) preprocessing to reduce the complexity of matching to O (n+m).
KMP Algorithm Idea
We first use a graph to describe the idea of the KMP algorithm. Looking for f in the string O, when the two strings are not equal when matching to position I, we need to move the string F forward. The usual method is to move forward one at a time, but it does not consider the fact that the former i-1 bit has been compared, so it is inefficient. In fact, if we calculate some information in advance, it is possible to move forward multiple bits at a time. Assuming we know that we can move the K-bit forward based on the information we have, we analyze the characteristics of f before and after the shift. We can get the following conclusion:
- The A-segment string is a prefix of F.
- The B-segment string is a suffix of f.
- The A-segment string is equal to the B-segment string.
So, after moving the K-bit forward, you can continue to compare position I as long as the first i-1 of F is satisfied: The prefix A and suffix B of i-k-1 are the same. Only in this way can we move the K-bit forward and continue the comparison from the new location.
So the core of the KMP algorithm is to calculate the maximum length of the common part of the prefix and suffix of the string before each position of the string F (excluding the string itself, otherwise the maximum length is always the string itself). Once you have the maximum common length for each position in F, you can use the maximum common length for fast and string o comparisons. When a character that compares two strings is not the same, we can move the string F forward (matched length-maximum common length) bit by the maximum common length, and then continue to compare to the next position. In fact, the forward shift of the string F is only a conceptual forward move, as long as we compare F and O from the maximum public length to the end of the string f before comparing it.
Next Array calculation
Understanding the fundamentals of the KMP algorithm, the next step is to get the maximum common length for each position of the string F. This maximum public length is remembered as the next array in the introduction of the algorithm. one thing to note hereis that the next array represents the length, and the subscript starts at 1, but when you traverse the original string, the subscript starts at 0 . Suppose we have now obtained next[1], next[2] 、...... Next[i], representing the maximum common length of the prefix and suffix of a string of length 1 to I, which now requires next[i+1]. As we can see, if the two characters at position I and position next[i] are the same (the subscript starts from zero), then next[i+1] equals next[i] plus 1. If the characters in the two position are not the same, we can continue to split the string of length next[i], get its maximum common length next[next[i]], and then compare the character of position I. This is because the length of the next[i] prefix and suffix can be divided into the upper structure, if the position next[next[i]] and position I of the same character, then next[i+1] is equal to Next[next[i]] plus 1. If not equal, you can continue to split the string of length next[next[i]] until the string length is 0. Thus we can write the code for the next array (Java version):
1 Public int[] GetNext (String b)2 {3 intlen=b.length ();4 intJ=0;5 6 intnext[]=New int[Len+1];//next represents the longest common part of the string prefix and suffix of length I, starting with 17Next[0]=next[1]=0;8 9 for(inti=1;i<len;i++)//I represents the subscript of a string, starting with 0Ten{//J represents the value of next[i] at the beginning of each loop, and also represents the next position to be compared One while(J>0&&b.charat (i)!=b.charat (j)) j=Next[j]; A if(B.charat (i) ==b.charat (j)) J + +; -next[i+1]=J; - } the - returnNext; -}
The problem with the above code is that the next array we are seeking represents the maximum common length of the string f prefix with a length of 1 to M, so you need to allocate more space. While traversing the string F, still starting from subscript 0 (position 0 and 1 of the next value is 0, so put on the outside of the loop), until m-1. The structure of the code is consistent with the explanation above, using the previous next value to find the next next value.
string Match
After the calculation completes the next array, we can use the next array to find where the string F appears in the string O. The matching code is very similar to the code for the next array, because the process of matching and the process of finding the next array are actually the same. Suppose now that the first I position of the string F matches the string o starting at a certain position, and now compares the first i+1 position. If the i+1 position is the same, then the i+2 position is compared, and if the i+1 position is different, then there is a mismatch, we still want to split the string of length I, get its maximum common length next[i], and then continue to compare two strings from next[i]. This process is consistent with the next array, so you can match the code as follows (Java edition):
1 Public voidSearch (string original, string find,intnext[]) {2 intj = 0;3 for(inti = 0; I < original.length (); i++) {4 while(J > 0 && original.charat (i)! =Find.charat (j))5j =Next[j];6 if(Original.charat (i) = =Find.charat (j))7J + +;8 if(J = =find.length ()) {9System.out.println ("Find at position" + (I-j));TenSystem.out.println (Original.subsequence (i-j + 1, i + 1)); Onej =Next[j]; A } - } -}
One thing to note about the above code is that we re-assign a value to J every time we get a match.
Complexity of
The complexity of the KMP algorithm is O (n+m), can be used to solve the analysis of averaging, the specific reference algorithm introduction.
References
1. KMP Algorithm Summary
2. KMP algorithm Detailed
3. KMP algorithm
4. Understanding and implementation of KMP algorithm
Open Source Implementation
If you want to actually use this algorithm, give us an example: Java notepad
Ps:
Finally, give you a few more diagrams, hope to help you understand.
Koch curve
Repeated expansion of its own structure
KMP algorithm Learning (detailed)