I am afraid that people who have used computers now know that most software with text editing functions have a shortcut key Ctrl + f (such as Word ). This function is mainly used to complete the "Search", "replace", and "replace all" functions. In fact, this is a typical pattern matching application, that is, searching strings in text files.
1. Pattern Matching
The pattern matching model is like this: Given two string variables S and P, S becomes the target string, which contains n characters. P is called the pattern string and contains M characters, m <= n. The search mode P starts from the given position of S (usually the first position of S. If it is found, the position of the mode P in the target string is returned (that is, the subscript of the first character of P in S ). If no pattern string P is found in the target string S,-1 is returned. This is the definition of pattern matching. Let's take a look at how to implement pattern matching.AlgorithmRight.
2. Simple pattern matching
The simple pattern matching algorithm is very simple and easy to understand. The general idea is as follows: Compare the characters in P with the characters in s starting from the first character S0 of S, if S0 = P0 &&...... & Sm-1 = Pm-1, it proves that the match is successful, the rest of the match does not need to be done, return subscript 0. If you are in a step Si! = Pi, the remaining characters in P do not need to be compared, and the match cannot be successful. Then, the second character in S starts to be compared with the first character in P. Similarly, also Know Sm = Pm-1 or find an I to make si! = S-1. And so on. If you know that the start character is n-m in S, if no matching is successful, the mode P is not saved in S. (Think about why n-M is used here .)CodeThe implementation should be very simple. For details, refer to the internal implementation of the strstr function. Let's take a look at Baidu encyclopedia and give a link to Baidu.
3. Fast pattern matching algorithm (KMP)
The primary cause of low efficiency in simple pattern matching is repeated character comparison. There is no connection between the next comparison and the previous comparison. It is a disadvantage of simple pattern matching. In fact, the comparison result of the previous comparison can be used, which leads to a fast pattern matching. In simple pattern matching, the subscript of the target string S is moved step by step, which is actually not good. There is no need to set the number of moving steps to 1.
Now let's assume that the current match is like this: S0 ...... St + 1 ...... St + J and P0 P1 ...... PJ. The matching characters are ST + J + 1 and PJ + 1, and ST + J + 1! = PJ + 1. The implication is that st + 1 ...... St + J and P0 P1 ...... PJ is exactly matched. At this time, what is the starting position of the next matching in S ?? In simple mode matching, the next comparison should start with ST + 1 and compare st + 1 with P0, but this is not the case in quick mode matching, for quick mode matching, select st + J + 1 and PK + 1. What is k? K is such a value, making P0 P1 ...... PK and PJ-k pj-k + 1 ...... If PJ exactly matches, set K = next [J]. Therefore, P0 P1 ...... PK and ST + J-K st + J-k + 1 ...... St + J exactly matches. The two characters to be matched next time should be st + J + 1 and PK + 1. S and P are not compared with subscript 0, Which is why KMP is fast.
Now the key question comes. How can this K be obtained? If the K value is highly complex, it is not a good idea. In fact, this K is only related to the mode string P and requires M k, K = next [J]. therefore, you only need to store the data in the next array once, and the time complexity is related to M (linear relationship ). See how to calculate the value of the next array, that is, K.
Calculate next [] by induction: set next (0) =-1. If next (j) = K is known, obtain next [J + 1].
(1) If PK + 1 = PJ + 1, apparently next [J + 1] = k + 1. If PK + 1! = PJ + 1, next [J + 1] <next [J], so we look for H <K to make P0 P1 ...... PH=PJ-H + 1 ...... PJ=PK-H + 1 ...... PK. That is to say, H = next (k); you can see it. This isIteration. (That is, the previous results are useful for the value after the evaluation)
(2) If such H is not stored, it indicates P0 P1 ...... PJ + 1 does not have equal substrings, so next [J + 1] =-1.
(3) If such H exists, continue to test whether the pH and PJ are equal. The process of finding an equal value in this case, or determining it as-1 to find next [J + 1] is over.
Let's look at the implementation code:
View code
Int Next [ 20 ] = { 0 }; // Note that the returned result is an array of next values, where M k values are saved, that is, if next [J] = K // STR [0] STR [1]… STR [k] = STR [J-K] STR [J-k + 1]… STR [J] // In this case, if the matching between des [T + J + 1] and pat [J + 1] fails, the next matching position is des [T + J + 1] and next [J] + 1. Void Next ( Char STR [], Int Len) {next [ 0 ] =- 1 ; For ( Int J = 1 ; J <Len; j ++ ){ Int I = next [J- 1 ]; While (STR [J]! = STR [I + 1 ] & I> = 0 ) // Iteration Process {I = Next [I];} If (STR [J] = STR [I + 1 ]) {Next [J] = I + 1 ;} Else {Next [J] =-1 ;}}}
Now with the K value saved in the next array, You can implement the KMP algorithm:
View code
// Des is the target string, Pat is the mode string, and len1 and len2 are the length of the string. Int KMP ( Char Des [], Int Len1, Char Pat [], Int Len2) {next (str2, len2 ); Int P = 0 , S =0 ; While (P <len2 & S < Len1 ){ If (Pat [p] = Des [s]) {P ++; S ++ ;} Else { If (P = 0 ) {S ++; // If the first character fails to match, it starts with the next character of DES. } Else {P = Next [p- 1 ] + 1 ; // Use the failed function to determine the characters that Pat should backtrack }}} If (P <len2) // An error occurred while matching the entire process. { Return -1 ;} Return S- Len2 ;}
Time Complexity:
For the next function, the time complexity of the KMP algorithm is O (n), so the time complexity of the entire algorithm is O (n + M)
Spatial complexity:
The space complexity of O (m) is introduced.
4. An interview with KMP
Given two strings are S1 and S2, you must determine whether S2 can be contained by the string obtained by S1's cyclic shift. For example, if S1 = aabcd, S2 = cdaa, true is returned because the S1 cyclic shift can be changed to cdaab. If S1 = acbd and S2 = acbd are given, false is returned.
Analysis:It is not difficult to find that all the strings obtained from the S2 shift will be the child strings of the s1s1 string. If S2 can be obtained from the S1 round robin shift, s2 must be the child string of s1s1, then, does the KMP algorithm work well.
Think: is there a better idea than KMP ??
Let's take a look at some of the lessons learned.