Original article, reproduced please indicate the source: http://blog.csdn.net/fastsort/article/details/9903153
1. String Matching
The so-called string matching is to find the pattern string P in the main string S. If yes, the start subscript of P in S is returned; otherwise, no information is returned (represented by-1 here ). For example:
S = abcdabce, t = CDA,
2 is returned. If t = CDD,-1 is returned.
In C/C ++, the subscript of a string starts from 0. In the following discussion, the subscript of a string starts from 0. Unless otherwise specified ].
2. Conventional algorithms
Before finding an efficient algorithm, the "brute force" algorithm is always trustworthy, although inefficient.
The same is true for string matching algorithms. Direct Matching is also very easy. the first character of the Main string S is aligned with the first character of the pattern string P, and then the two strings are compared backward at the same time:
If it matches until the end of P, It is good. If the match is successful, 0 is returned;
If no matching character exists in the middle, the second character of S is aligned with the first character of P, and then the two strings are compared backward at the same time, returns 1 until the third character of S is continued or the matching is successful ......
Repeat the above process until the end of S. Then-1 is returned, indicating that the matching fails.
The Code is as follows:
int match_1(const char * s, const char *p){ int slen=strlen(s),plen=strlen(p); int i=0,j=0; while(i<slen) { while(j<plen &&(i+j)<slen && s[i+j]==p[j]) j++; if(j==plen) return i; i++; j=0; } return -1;}
This code runs well, but for comparison, it is changed to the following form:
Int match_2 (const char * s, const char * P) {int slen = strlen (s), Plen = strlen (p); int I = 0, j = 0; while (I <slen & J <Plen) {If (s [I] = P [J]) // matches {I ++; j ++ ;} else // does not match {I = I-j + 1; // returns the S pointer J = 0; // P start from scratch} If (j = Plen) return I-j; // else return-1 ;}
Simply put, J points to the starting position of S and P respectively at the beginning of I and J. When I match, J moves backward at the same time. When the mismatch occurs, I pointers are rolled back, and J pointers are set to zero.
For the above Code, you should fully understand it. You can write it right away with pen and paper. In this way, you will be familiar with the work process, otherwise the subsequent content will be difficult to understand.
Complexity Analysis: In the worst case, P matches the second to last character each time, and the last character does not match. At this time, it is clear that O (m) is required for each P match) (M = strlen (p), matches each character in S, and also requires O (N) (where n = strlen (s )), the total complexity is O (Mn ).
For example, if S = 0000000000001, P = 00001, M = strlen (p) = 5, n = strlen (S) = 13. for I * [0-12], J * [0-4], s [I] is found every time P [4] is matched. = P [J] does not match. All of them must start from scratch (j = 0, I = I-j + 1). This is called "backtracking ".
We found that there are a lot of backtracking procedures in this process that are not necessary. For example:
I 0123456
S 0000000000001
P 00001
J 01234
In the discovery s [4]! = P [4], the next comparison starts from S [1] and P [0:
I 0123456
S 0000000000001
P 00001
J 01234
Always compare to s [5]! = P [4]. In fact, if you notice that in the previous comparison, s [4]! = P [4] (I = 4, j = 4), already has
S [0... 3] = P [0... 3]
①
So there are
S [1... 3] = P [1... 3]
②
While
P [1... 3] = P [0... 2]
③
③, P [0... 2] is the prefix of P, P [1... 3] is the suffix, which is determined by the nature of P, and can be obtained from ② and ③.
S [1... 3] = P [0... 2]
④
How do you feel here? Since s [1... 3] = P [0... 2]. Why does the next round of matching process start with s [1] and P [0 ?! Directly compare s [4] and P [3] (that is, s [I] and P [K], K = 3. The KMP algorithm is based on this principle, that is, after a mismatch occurs, you do not need to start a new round of comparison from the starting position of the P string each time (that is, I-pointer rollback and J-pointer 0 ).
The above analysis, especially the three equations, is better understood. Ijk is not used here, but expressed by actual strings and numbers to facilitate understanding. After a thorough understanding, we began to formally explain the KMP algorithm.
3. KMP Algorithm
The improvement of the KMP algorithm for the regular String Matching Algorithm (match_2) is that after the mismatch, the I pointer does not move, and the J Pointer Points to the appropriate position to start the next round of matching.
Assume that the appropriate position has been saved in an array next [N], n = strlen (P ). Then the KMP code is:
Int match_3 (const char * s, const char * P)
{
Int slen = strlen (s), Plen = strlen (P );
Int I = 0, j = 0;
While (I <slen & J <Plen)
{
If (s [I] = P [J]) // match
{
I ++;
J ++;
}
Else // Mismatch
{
// I = I-j + 1; The I pointer remains unchanged and will not be rolled back.
J = next [J]; // The J pointer starts from a certain position.
}
}
If (j = Plen) return I-j; // match successful
Else return-1;
}
The change occurs in else: The I pointer is not rolled back. The J pointer is not set to zero, but a specific value. Does it seem simpler than a simple matching algorithm? It seems yes, but the question is, what is the specific value? That is to say, how can we find this next array? Okay. Now let's talk about this next array.
From the analysis at the end of the second part (that is, ① ② ③ those three formulas), we know that when s [I]! = P [J] (mismatch), the matched part is
S [I-j... I-1] = P [0... J-1] (4)
That is, the first J-1 characters of S [I] and P [J.
If you want to start the next round of matching instead of backtracking from the K character of P, you must meet the following requirements: the first K-1 character of P matches the K-1 character before S [I] (that is, K = next [J]).
The first K-1 character of P is: P [0... K-1
S [I] The first K-1 character is: s [I-K... I-1
That is
P [0... K-1] = s [I-K... I-1] (5)
Available from ④:
S [I-K... I-1] = P [J-K... J-1] (6)
That is, s [I-j... The I-1] and the rear K-1 characters are equal to the rear K-1 characters of P.
Obtained from ⑤
P [0... K-1] = P [J-K... J-1] 7
If you see this for the first time, it may be a bit dizzy. Let's review our questions. Mismatch occurs at P [J] (s [I]! = P [J]), I does not move, J does not need to set zero, but is obtained from k = next [J], then this K must satisfy 7.
7. What does it mean? The left side is the first k characters (prefix) of P, and the right side is P [0... The last k characters (suffix) of the J-1 ). The larger the value of K, the better. Why? Because the larger J, the more sliding backward, the higher the efficiency. However, k <= J-1 => K <j.
For ease of understanding, for example, P = "000011 ":
J = 0?
J = 1 P [1] Before p [0] = P [0], next [1] = 0
J = 2 P [0-1] = P [0-1], next [2] = 1
J = 3 P [0-2] = P [0-2], next [3] = 2
J = 4 P [0-3] = P [0-3], next [4] = 3
J = 5 p [0... 4] No prefix = suffix, next [5] = 0
What should I do when J = 0? Consider the most primitive situation, s [I]! = P [0]. The first one does not match. Of course it cannot be jumped. Just start the next round. Therefore, a special value can be set here, indicating that the first (j = 0) does not match. Generally, it is set to-1, which can simplify the program. When J =-1, the next round of matching (I ++; j ++) J ++ is exactly 0. If you must set it to-9, you must judge that when J =-9, the next round will be: (I ++; j = 0 ;). You will also notice that for any p string with a length greater than 2, next [1] = 0 is always standing.
Another example:
Next array of P [] = "abababacd" next [9]:
J = 0, K = next [0] =-1;
J = 1, K = next [1] = 0
J = 2, P [2] has "AB", and its prefix is only 'A '! = Suffix 'B', So k = next [2] = 0
J = 3, P [3] is "ABA", P [0] = P [2], K = next [3] = 1
J = 4, P [4] is "Abab", P [0... 1] = P [2... 3], K = next [4] = 2
J = 5, before P [5], it is Ababa, P [0... 2] = P [2... 4], K = next [5] = 3
J = 6, P [6] Is ababab, P [0... 3] = P [2... 5], K = next [6] = 4
Before J = 7, P [7] Is abababa, P [0... 4] = P [2... 6], K = next [7] = 5
J = 8, P [8] Is abababac before, because the last letter is C, no prefix and Its suffix is equal, so K = next [8] = 0
I believe that through these two examples, you should understand the next array.
Since J =-1 exists in the next array, we should make some modifications to our function:
Intkmp (const char * s, const char * P) {int slen = strlen (s), Plen = strlen (p); int * Next = (int *) malloc (sizeof (INT) * Plen); getnext (p, next); int I = 0, j = 0; while (I <slen & J <Plen) {If (j =-1 | s [I] = P [J]) // when J =-1, the next round of matching {I ++; j ++;} else {J = next [J]; // I, j no longer backtracing} Free (next); If (j = Plen) return I-j; else return-1 ;}
I also changed the function name to KMP, which is also the final form of the KMP algorithm (not optimized, I will discuss the optimization later ). There are less than 10 rows in total.
The getnext () function is used to calculate the next array. If you understand the calculation method of the next array, you can try it by writing it. There are only 10 lines of code, and it is surprisingly similar to the above KMP function. As for the idea and code of getnext (), leave it to the next blog.
Original article, reproduced please indicate the source: http://blog.csdn.net/fastsort/article/details/9903153