The most common string matching algorithm will not be remembered. Simply paste the code
- Int strstr (char * sub, char * str ){
- Int I = 0;
- Char * P = STR, * q = sub;
- While (* (p + I )! = '/0' & * (q + I )! = '/0 '){
- If (* (q + I) = * (p + I ))
- I ++;
- Else {
- P ++;
- I = 0;
- }
- }
- If (* (q + I) = '/0 ')
- Return p-STR;
- Return-1;
- }
Next, let's talk about the KMP algorithm I understand. Unlike the common matching algorithm, when the KMP algorithm fails to match the substring, the next step is not to re-start matching from the substring header, the next function is used to calculate the position where the substring is matched. For example, it is easier to describe
Red indicates the invalid position, '^' indicates the current pointer position ,'~ 'Indicates the starting position of string.
If it is a common matching algorithm, when it fails, the C string pointer should be traced back to the header, and the string pointer should also be traced back '~ . That is, the next step should start matching the second character of a (e.g. 'B') and the beginning of C (e.g. 'A. This cycle
Until a position in string a matches string C.
However, from the above matching process, we can find that the blue part of A and B has been confirmed in the first step, the match in the above four steps can be regarded as comparing the first half of the blue part with the second half, and the match in the blue part has expired, therefore, these comparisons are meaningless because they are irrelevant to the parent string a and are information of the Child string C itself. Only Step 4 involves the comparison with the character ('C') in parent string.
The KMP algorithm uses this to calculate the position where the next matching should continue when a matching fails by using the information of the substring itself. That is, when the string C is invalid in the last 'B' match, it skips the unnecessary match in the first three steps (1, 2, 3, the next comparison will be performed between 'D' and 'c. In this way, the parent string a does not need to be traced back, and the C string uses a next function to determine the position where it will be traced back. Therefore, the next function is crucial and also the key to this algorithm.
From the above analysis, we can know that the next function is determined by the nature of the substring C itself.
Assume that the substring
Next (j) = K (k> = 0): When the J + 1 character of P fails to match, if the parent string pointer does not backtrack, the next step is to compare and compare. Function compute of next (j) can be expressed
When next (j) = K (k> = 0), the Child string pointer goes back to the position where the parent string pointer remains unchanged;
When next (j) =-1, the Child string pointer goes back to the header, and the parent string pointer goes one step forward;
When designing a program for calculating the next value, we do not need to calculate maximum (k) in every step. We can do this in recursion.
For example
Assume that the substring is P: "abacabab", and we will require the next value of 'B', e.g. next [7]
Suppose next [0 ~ 6] are known: next [0] =-1, next [1] =-1, next [2] = 0, next [3] =-1, next [4] = 0, next [5] = 1, next [6] = 2
"Aba caba B"
Next [6] = 2 can indicate P [0 ~ 2] (blue) and P [4 ~ 6] (red) is the same
If the value of next [7] is required, we can find the substrings with the longest first half of the first six digits ("abacaba") that are equal to the second half, then compare whether the next digit of the first half of the substring is equal to P [7. In this example, P [0 ~ Next [6] (e.g. P [0 ~ 2]) This is the substring. next we compare c and B, that is, P [next [6] + 1] ('C ') and P [7] ('B ').
- If they are equal, next [7] = next [6] + 1
- If not, we can further find the substrings that are equal to the shorter first half and the second half, because ABA and ABA are the same, to find a substring that is shorter than 'abc' in 'aba Caba ', the value of next [2] In 'aba' is the same. that is, the value of next [next [6. Then compare P [next [next [6] + 1] and P [7]. If not, continue to look for a shorter string like this.
In the preceding example, P [next [6] + 1] = P [3] ('C') is not equal to P [7] ('B, however, P [next [next [6] + 1] = P [next [2] + 1] = P [1] ('B '), equal to P [7] ('B ')
Next [7] = next [next [6] + 1 = next [2] + 1 = 1;
Code for calculating the next value:
- Void calnext (char * P, int next []) {
- Next [0] =-1; // the next of the first element is always-1, because according to (1), we cannot find a K smaller than J = 0.
- For (INT I = 1; I <strlen (p); I ++ ){
- Int K = next [I-1]; // because the recursive method is used, to calculate next [I], record next [I-1] First and assume next [I-1] is known
- While (P [k + 1]! = P [I] & K> = 0) {// Recursion
- K = next [k];
- }
- If (P [k + 1] = P [I]) // if the end is equal, find a pair of prefix strings and suffix strings whose length is K.
- Next [I] = k + 1; // an identical item is added.
- Else
- Next [I] =-1; // other cases
- }
- }
Matched code:
- Int find (char * t, char * Pat ){
- Int n = strlen (PAT );
- Int * Next = new int [N];
- Calnet (Pat, next );
- Char * P = T, * q = pat;
- Int I = 0;
- While (* P! = '/0' & (* (q + I )! = '/0 ')){
- If (* P = * (q + I )){
- P ++;
- I ++;
- } Else {
- If (I = 0)
- P ++;
- Else
- I = next [I-1] + 1;
- }
- }
- If (* (q + I) = '/0 ')
- Return p-T-n;
- Else
- Return-1;
- }
Record the KMP you understand so that you do not forget what you understand.