ACM also has three years, during the period of learning a lot of algorithms, to December to the Shanghai station to become retired dogs. Recently suddenly want to learn some of the algorithm back to a good summary, so there is my algorithm summary series. This is the beginning of this series, so first write a simple point algorithm, will slowly review some of the complex algorithms, and finally hope that they can persist.
kmp algorithm
KMP algorithm is a linear time complexity of string matching algorithm, it is the BF (Brute-force, the most basic string matching algorithm) improvement. For a given primitive string s and pattern string T, the index of the position where the string T appears from the string s is required. The KMP algorithm is discovered by D.e.knuth and V.r.pratt and J.h.morris, so it is called Knuth--morris--pratt algorithm, or KMP algorithm for short. Before explaining the KMP algorithm, it is necessary to understand its predecessor--BF algorithm, so we will introduce the most naïve BF algorithm first.
One: BF algorithm Introduction
As shown, the original string S=abcabcabdabba, the pattern string is abcabd. (Subscript starting from 0) starting from s[0] to compare the s[i] and t[i] are equal, until t[5] when the inequality was found, this time to indicate a mismatch, in the BF algorithm, the mismatch occurs, T must go back to the beginning, s subscript +1, and then continue to match, as shown:
This time there was a mismatch, so continue backtracking until s starts the following table to increase to 3, matching succeeds.
Easy to get, the time complexity of the BF algorithm is O (n*m), where n is the length of the original string, and M is the length of the pattern string. BF code implementation is also very simple and intuitive, here is not given, because the next introduction of the KMP algorithm is the improvement of BF algorithm, its time complexity is linear O (n+m), the implementation of the algorithm is not much more difficult than the BF algorithm.
two: KMP algorithm
The simple matching algorithm mentioned earlier, its advantage is straightforward and clear, the disadvantage of course is the time consumption is very large, since know the shortcomings of BF algorithm, then to the right remedy, design a time-consuming small string matching algorithm.
KMP algorithm is one of the classic examples, its main idea is:
When mismatch occurs during match matching, it is not easy to re-match from the next character in the original string, but to skip the unnecessary matches according to the information obtained in some matching process, thus achieving a high matching efficiency.
Or the previous example, the original string S=abcabcabdabba, the pattern string is abcabd. When the first match is made to t[5]!=s[5], the KMP algorithm does not retrace the following table of T back to 0, but instead goes back to the 2,s subscript and continues to match from S[5] until the match is complete.
So why does the KMP algorithm know that the subscript of T goes back to 2? As mentioned earlier, the KMP algorithm will maintain some information during the matching process to help skip unnecessary detection, which is the focus of the KMP algorithm--next array. (also known as the Fail array, the prefix array).
1:next Array
(1) Next array definition:
Set mode string t[0,m-1], (length is m), then Next[i] denotes both the string t[0,i-1] and the suffix of the string t[0,i-1] the longest string length (may be called prefix), note that the prefix and suffix here does not include the string t[0,i-1] itself.
As in the above example, T=ABCABD, then next[5] represents the longest length of the string that is both the prefix of abcab and the suffix of the abcab, which should obviously be 2, that is, string AB. Note that in the previous example, when a mismatch occurs, T goes back to table 2, and the Next[5] array is consistent, which is certainly not a coincidence, in fact, the KMP algorithm is to use the next array to calculate where the pattern string should backtrack when the mismatch occurs.
(2) Next array calculation:
Here is a description of how the next array is calculated.
Set mode string t[0,m-1], length m, defined by next array, known as next[0]=next[1]=0, (because of the suffix of the string here, the prefix does not include the string itself).
Next, suppose we compute the next array from left to right, at some point, we've got next[0]~next[i], now we're going to calculate next[i+1], set j=next[i], because we know next[i], so we know T[0,j-1]=t[i-j, I-1], now compared to t[j] and t[i], if equal, by the definition of the next array, can be directly derived next[i+1]=j+1.
If not equal, then we know next[i+1]<j+1, so to reduce j to a suitable position po, so that Po satisfies:
1) t[0,po-1]=t[i-po,i-1].
2) T[po]=t[i].
3) PO is the maximum value that satisfies the condition (1) and (2).
4) 0<=po<j (apparently established).
How to obtain this PO value? In fact, it is not possible to directly find the PO value, can only step by step close to the PO, looking for the current position of the next potential position J. If the condition (1) is met, then J is one, so what is the next satisfying condition (1)? , by the definition of the next array, easy to get is next[j]=k, this time just to determine whether t[k] is equal to t[i], you can determine whether the conditions (2), if not equal, continue to reduce to next[k] and then judge, until a position p, so that p satisfies the condition (1) and Conditions (2). We can get p must satisfy the maximum value of the condition (1), (2), because if there is a position x that satisfies the condition (1), (2), (4) and X>po, then the position x can be found before backtracking to p, otherwise it does not match the definition of the next array. After obtaining the position po, it is easy to get next[i+1]=po+1. So next[i+1] is calculated, by mathematical induction, we can find all the next[i]. (0<=i<m)
Note: There may be a situation in the backtracking process, that is, can not find the appropriate PO to meet the above 4 conditions, which shows that t[0,i] the longest prefix string length of 0, directly next[i+1] assignment 0, can be.
Computes the next array of string str int GETNEXT (char *str,int next) { int len=strlen (str); next[0]=next[1]=0;//Initialize for (int i=1;i<len;i++) { int j=next[i]; while (J&&str[i]!=str[j])//Always backtracking J until Str[i]==str[j] or J reduced to 0 j=next[j]; next[i+1]=str[i]==str[j]?j+1:0;//update next[i+1] } return len;//returns the length of Str}
The above is the code implementation that computes the next array. Isn't it very brief.
2.KMP matching process
With the next array, we can skip unnecessary detection through the next array and speed up the string match. So why does the next array guarantee that the match won't miss the matching position?
First, assume that the mismatch occurs when the subscript of T at I, then the expression t[0,i-1] and the original string s[l,r] match, set next[i]=j, according to the KMP algorithm, you can know the T back to subscript J and then continue to match, according to Next[i] definition, can get t[0,j-1] and s[ R-j+1,r] matches, at the same time know for any j<y<i,t[0,y] not and s[r-y,r], so that the matching process can not miss the matching position.
The calculation of the same next array, in general, may go back to next[i] after the mismatch, as long as the continuation of backtracking to next[j], if not continue to backtrack, and finally back to next[0], if not match, this means that the original string of the current position and T start position is different, As long as the original string's current position +1, continue to match.
The following shows the code for the KMP algorithm matching process:
Returns the start position of the first occurrence of the pattern string T in the S string int KMP (char *s,char *t) { int l1=strlen (S), L2=getnext (t),//l2 for T, GETNEXT function is given below int i,j=0,ans=0; for (i=0;i<l1;i++) { while (J&&s[i]!=t[j])//mismatch is backtracking j=next[j]; if (S[i]==t[j]) j + +; if (J==L2)//Successful match exits break ; } if (J==L2) return i-l2+1;//returns the location of the first match success, else return-1;//returns 1} if the match is unsuccessful
3. Time complexity analysis
As mentioned earlier, the time complexity of the KMP algorithm is linear, but this is not easy to get from the code, many readers may think, if each match has to backtrack many times, is not the time complexity of the algorithm will degenerate to non-linear it?
In fact, we discuss several variables in the code, the first is the KMP function, it is obvious that the KMP function time complexity of the variables only two, I and J, where I only added Len, is O (Len), the following discussion J, because the next array of definitions we know next[j]< J, so J subtracts at least 1 at the time of backtracking, and J is guaranteed to be a non-negative number. In addition, the code shows that J adds up to Len at most, and increases by only 1 at a time. Simply put, j increments can only increase by 1, each decrease at least minus 1, and to ensure that J is a non-negative, then the number of J reduction must not exceed the number of increases. Therefore, the number of backtracking will not exceed Len. In summary, the time complexity of the KMP function is O (len). Similarly, for the calculation of the next array the same method is used to prove that its time complexity is O (len), which is not mentioned here. For the original string s of length n, and the time complexity of the pattern string T,KMP algorithm of length M is O (n+m).
Here, the implementation of the KMP algorithm is complete. But this is not the most complete KMP algorithm, the real KMP algorithm needs to further optimize the next array, but now the algorithm has reached the time complexity of the downline, and now the definition of the next array retains some very useful properties, which is helpful in solving some problems.
For the optimized KMP algorithm, interested friends can consult the relevant information on their own.
KMP Algorithm Summary