Reprint Please specify source: http://blog.csdn.net/fightlei/article/details/52712461
First we need to understand what pattern matching is?
The substring localization operation is also known as pattern matching (pattern Matching) or string matching (string Matching). In string matching, the main string is usually called the target string, and the substring is called the pattern string. This blog unified with S for the target string, T represents the pattern string, the process of finding the pattern string T from the target string s is called pattern matching.
Although our main character is the KMP pattern matching algorithm, we should start with the brute force matching algorithm, and find out the problem of the brute force matching algorithm, and bring out the KMP pattern matching algorithm.
A simple pattern matching algorithm
"Basic Ideas"
The first character of the target string s is compared with the first character of the pattern string T, and if equality is further compared with the successor of the two characters, the second character of the target string is then re-compared to the first character of the pattern string T, and so on until the pattern string T is equal to a substring in the target string s. Called the match succeeds, returns the position of T in S, or S does not have a substring equal to T, and the match fails, returning-1. This algorithm is also known as the BF (Brute-force) algorithm.
Let's start with a simple example of how the BF algorithm is going to work. Suppose there is a target string S "Ababb" and the pattern string T is "ABB". As the example is relatively simple, we can draw the entire matching process. As shown in the following:
You can see that the matching process is entirely based on the basic idea given above, first starting with the first character of the target string s and the first character of the pattern string T (first trip), and if equal compares the subsequent characters of both (second trip), Otherwise, start with the second character of the target string and re-compare to the first character of the pattern string T (third trip, fourth trip). Let's focus on the third trip, when we find s[i]! = t[j], then start again from the second character of the target string s, I back to i = I-j + 1. Since I-j represents the starting match location for this trip, I-j + 1 means to continue the comparison from the next position in the comparison position for this trip. At the same time, J goes back to 0, which is a comparison to the first character of the pattern string T.
"BF Algorithm Implementation"
/* * BF matching algorithm */public static int violentmatching (string s, string t) {int i = 0;int j = 0;while (I < s.length () &&am P J < T.length ()) {if (S.charat (i) = = T.charat (j)) {i++;j++;} else {//i goes back to the next position of this trip start match position i = i-j + 1;j = 0;}} When J==t.length () indicates that a substring in the target string s matches the pattern string T exactly if (j = = T.length ()) {//returns this trip to the starting match position, that is, T in s position return i-j;} else {return-1;}}
The implementation of BF algorithm is simple, the way of thinking is also very straightforward, relatively easy to understand. But we find that there is a problem:
At the end of the first comparison, we can find information: s[0] =t[0], the second trip to the end, get the information: s[1] = t[1], the third trip after the information: s[2]! = t[2]. Next we can find T[0]!=t[1] by observing the pattern string T. Thus it is immediately possible to conclude t[0]! = s[1], so there is no need to make a fourth trip comparison at all. Perhaps because the example is relatively simple, can not clearly reflect the advantages of the KMP algorithm, below we give a slightly more complicated example to see:
Suppose there is a target string S "ABABCABCACB", the pattern string is "ABCAC", and when compared to s[2] and t[2] there is a mismatch
If the BF algorithm is followed, the next trip should start with a comparison between s[1] and t[0]. But by the comparison of the previous trip we can find: s[0] = t[0],s[1] = t[1],s[2]! = t[2]. Again observe the pattern string T itself we find t[0]! = t[1], so you can immediately draw the conclusion s[1]! = t[0], so you can omit their comparisons and start by comparing them directly from s[2] to T[0]:
As you can see, the mismatch occurs again when comparing to s[6] and t[4]. If you continue to follow the BF algorithm, you will obviously make several more unnecessary comparisons. So what are the two positions of the target string and the pattern string to start comparing?
As can be seen at the end of the comparison, there is the following information: s[2] = t[0],s[3] = t[1],s[4] = t[2],s[5] = t[3],s[6]! = t[4]. Then we look at the pattern string T and can get:
(1) t[0]! = t[1], so t[0]! = s[3], so you can omit their comparisons.
(2) t[0]! = t[2], so t[0]! = s[4], omitting their comparisons.
(3) T[0] = t[3], so t[0] = s[5], and when equal continues to compare the subsequent characters of two strings, the comparison begins with S[6] and t[1].
You can see that this method has been applied, only three times the re-match, the successful conclusion of the match, speed up the execution speed of the match.
The above example just describes the idea of the method, but what exactly is this method, exactly how exactly to describe it, and how to implement it in code? Here's how to solve these problems.
KMP Pattern Matching algorithm
This algorithm is discovered simultaneously by D.e.knuth,j.h.morris and V.r.pratt, so the algorithm is called the Knus-Morris-Pratt operation, referred to as the KMP algorithm.
The KMP algorithm is a pattern matching algorithm that does not require backtracking on the target string s. The reader can look back at the example above, with no backtracking of the target string s in the whole process, but only a backtracking of the pattern string T. From the previous analysis, we find that the key to this matching algorithm is that when a mismatch occurs, it should be possible to determine which character in the pattern string T is compared to the mismatch character of the target string s. So, the three predecessors, through the study found that the use of the pattern string T in which the character to compare, only dependent on the pattern string T itself, and the target string s independent.
Here is the key to the KMP algorithm is the next array,the function of the next array is when the mismatch situation s[i]! = T[j], next[j] to indicate the use of T in next[j] as subscript of the character and S[i] to compare (note in the KMP algorithm , I is never backtracking). It is also necessary to note that when next[j] =-1, any character in T is compared to s[i], and the next comparison starts with T[0] and s[i+1] . This shows that the KMP algorithm needs to find out the next function value on the various positions of the pattern string T before the pattern matching. That is next[j],j = 0,1,2,3,... n-1.
Solve next array
According to the characteristics of the next array, once the matching process appears s[i]! = T[j], then the comparison is continued with t[next[j]] and s[i, which is equivalent to sliding the pattern string T to the right j-next[j] a position, as follows:
Understanding the above image is the key to understanding the next array, in order to draw simple, use K to represent NEXT[J]. In the figure, j+1 represents the number of characters in the pattern string T, and when a mismatch occurs, use T[next[j]] to compare with S[i], which is the t[k] and s[i in the diagram. So the right side is t[0]~t[k] k+1 characters, so the left side is J + 1-(k + 1) = J-k characters, that is, the right glide j-next[j] position.
The following information can be obtained when a mismatch occurs:
S[I-J] = t[0],s[i-j+1] = t[1],...,s[i-k] = t[j-k],s[i-k+1] = t[j-k+1],...,s[i-2] = t[j-2],s[i-1] = T[j-1]
After the pattern string T is right-slid, the following must be guaranteed:
S[I-K] = t[0],s[i-k+1] = t[1],s[i-k+2] = t[2],...,s[i-2] = t[k-2],s[i-1] = t[k-1]
The above two formulas are available:
T[0] = t[j-k],t[1] = t[j-k+1],t[2] = t[j-k+2],...,t[k-2] = t[j-2],t[k-1] = t[j-1]
It means that the t[0]~t[j-1],k of a substring in a pattern string T is equal to a subsequence (i.e., t[0]~t[k-1], called a prefix subsequence) consisting of the first k characters, and a sub-sequence (that is, t[j-1]~t[j-k], called the suffix subsequence), which consists of the following K-characters. There are multiple k values that satisfy this condition, whichever is the maximum value.
Then the problem of next array is solved and the problem of solving the maximal prefix suffix subsequence is transformed.
An example is given to illustrate what the maximum prefix suffix sub-sequence is.
For the substring "aaabcdbaaa", the K value satisfying the condition has a value of 3, which takes the maximum k value as the function of next[j]. At this point the maximum prefix subsequence is "AAA" and the maximum suffix subsequence is "AAA".
such as the substring "ABCABCA", the equivalent maximum prefix suffix subsequence is "ABCA"
"Algorithm implementation of the next array"
public static int[] GetNext (String t) {int[] next = new Int[t.length ()];next[0] = -1;int suffix = 0; suffix int prefix =-1; prefix while (suffix < t.length ()-1) {//If the prefix index is-1 or equal, the prefix suffix index is +1if (prefix = =-1 | | t.charat (prefix) = = t.charat (suffix)) {++prefix;++suffix;next[suffix] = prefix; 1 } else {prefix = Next[prefix]; 2}}return Next;}
The code is not complex, the whole idea is to t[0]~t[suffix] as a substring, in turn, to find the equivalent of these substrings of the maximum prefix suffix sub-sequence, that is, the value of Next[suffix].
It is difficult to understand that there are two places, I marked out with 1 and 2 respectively. We'll see in turn. As shown in the initialization process, prefix points to -1,suffix 0,next[0] =-1.
The If condition prefix =-1 is established, so enter the IF statement, prefix = Prefix+1,suffix = suffix+1, at which point Next[suffix] is assigned to prefix. i.e. next[1] = 0. What exactly does prefix+1 represent? Next[suffix] And what does it mean?
Next[suffix] Represents the length of the equivalent longest prefix suffix subsequence of a substring that does not include suffix, or t[0]~t[suffix-1]. It means that the substring is preceded by a next[suffix] character, followed by a next[suffix] word typeface, and so on.
The value of suffix+1 after a value of 1,prefix+1 later in the code is 0,next[1] for the substring "a", the length of its equal longest prefix suffix subsequence, which is prefix,0. prefix has always said that for a substring t[0]~t[suffix-1] preceded by prefix characters and prefix characters later, is Next[suffix]
Continue down, meet if condition t[0] = t[1], then the suffix+1 value is 2,prefix+1 after the value is 1,next[2] = 1, which represents the substring "AA", has a length of 1 equal to the longest prefix suffix sub-sequence "a".
Continue down, meet if condition t[1] = t[2], then the suffix+1 value is 3,prefix+1 after the value is 2,next[3] = 2, which represents the substring "AAA", has a length of 2 equal to the longest prefix suffix sub-sequence "AA".
When you continue to go down will find t[2]! = t[3], do not meet the conditions, then entered the Else statement, prefix to go back, prefix = Next[prefix], this has encountered the second difficult point, why should we do so backtracking?
Borrow a picture from the Internet to answer this question
This map is a lot of online, but the detailed description of the figure of the specific meaning is very few. The J in the diagram corresponds to the suffix,k in the corresponding code in the corresponding code, and the P in the pattern string T is shown in the prefix.
Now they are also encountering this problem, Pj]! = P[k], and then K goes back and turns into next[k]. Since prefix can go to K,suffix can go to J, then at least to ensure that the substring p[0] ~ p[j-1], preceded by K characters and the following K word typeface. This is the blue area of the front and back sides of the diagram. If the condition p[k] = P[j] is satisfied, the p[0] ~ p[j] of the substring, preceded by k+1 characters and k+1 words typeface and so on. But now not satisfied, it means that for substring p[0] ~ P[j] There is no equal longest prefix suffix subsequence length of k+1, there may be the longest prefix suffix subsequence smaller than k+1, possibly K, possibly k-1,k-2, k-3 ... Or not at all, it's 0. Then our normal thinking should be back to K again to judge, there is no K and then back to K-1, and so on, then why the algorithm is directly traced to Next[k]?
For the sake of description, I label the different areas in the diagram with uppercase letters. The previous said normal thinking is not exist k+1, go back to K to judge, now we see why not back to K? When backtracking to K, the condition to be satisfied is that the characters in the X region are equal to the characters in the Z region, and we know that the characters in the X region are equal to the characters in the Y region, and to satisfy the condition, the Y area character is equal to the z-region character, and the Y-region character is only one bit apart In fact, the first character of the Y area is compared to the second character, the second character and the third character, and so on, so unless the characters in the Y area are all equal and are the same character, it is not possible to satisfy the condition. However, when all the characters in the Y area are equal, the characters in the X area are all equal, then next[k] is equal to K, so it is better to backtrack directly to Next[k].
So why go straight back to next[k] and you will meet the conditions?
It is known that the X area is equal to the Y region, so the B area must be equal to the D region, because the B region represents the after Next[k] character of the X region, and the D region represents the after Next[k] character of the Y region.
The meaning of Next[k] is that for the substring p[0] ~ p[k-1], preceded by a next[k] character is equal to the following next[k] characters, that is, a area is equal to the B area, so you can get a region must be equal to the D region, so when the next time to meet the conditions p[next[k]] = p[j ], there must be an equal longest prefix suffix subsequence of length next[k] + 1.
The solution of next array is clear, then we can give the complete implementation of the KMP algorithm.
"KMP Pattern matching algorithm implementation"
public static int KMP (string s, string t) {int i = 0;int j = 0;//Get Next array int[] next = getNext (t); while (I < s.length () && J < T.length ()) {if (j = =-1 | | S.charat (i) = = T.charat (j)) {i++;j++;} else {//) backtracking according to the instruction J of the next array, and I never backtrack j = NEXT[J];}} if (j = = T.length ()) {return i-j;} else {return-1;}}
The code is similar to the BF algorithm, unlike when backtracking is performed, J is based on Next[j] and I do not backtrack.
KMP algorithm Optimization
The KMP algorithm given above also has some minor problems, such as having a pattern string "Aaaaax", the target string "AAAABCDE". By the algorithm given above we can easily get the next array of pattern string T, the matching process is as follows:
Can be seen when matching to s[4] and t[4] mismatch, and according to the next array of instructions, Next[4], the next step should compare s[4] and t[3], again mismatch, and then according to the instructions, compare s[4] and t[2], again mismatch, compare s[4] and t[2], and mismatch, Until the comparison to S[0] and t[0] is still mismatch, then next[0] =-1, then any character in T is not compared to s[4], the next comparison starts with S[5] and t[0]. For this particular case, although we used the next array, the efficiency is still low. When s[4]! = t[4], because t[4] = t[3] = t[2] = t[1] = t[0], they are not equal to s[4], so they should be compared directly with t[0].
In this case, only the solution of the next array can be improved
"Next Array Solver algorithm optimization implementation"
public static int[] GetNext (String t) {int[] next = new Int[t.length ()];next[0] = -1;int suffix = 0; suffix int prefix =-1; prefix while (suffix < t.length ()-1) {//if equal or prefix index is-1, prefix suffix index is +1if (prefix = =-1 | | t.charat (prefix) = = t.charat (suffix)) {++prefix;++suffix;//Improved place if (T.charat (prefix) = = t.charat (suffix)) {Next[suffix] = Next[prefix];} else {Next[suffix] = prefix;}} else {prefix = Next[prefix];} } return next;}
The improvement is that if t[suffix]! = T[prefix], then the previous processing is still followed, next[suffix] = prefix.
T[suffix] = T[prefix] may appear in the above example of the special case, so that next[suffix] = Next[prefix]. Actually this is a backtracking process, originally should be next[suffix] = prefix, here from prefix back to Next[prefix]. It can be understood that when the re-t[suffix] position is mismatch, it would have followed the next array of instructions using t[next[suffix]] = T[prefix] To make a second comparison, but we have learned t[suffix] = T[prefix] So there will be a mismatch again, so we'll use the next round directly using T[next[prefix]] to compare.
If there is any mistake, please haihan.
A detailed interpretation of KMP pattern matching algorithm