Abstract: KMP is a classic algorithm for string matching. Due to its O (m + n) time complexity, KMP is still widely used. The KMP algorithm is very concise. However, it contains a mysterious theory, so many people do not know it. This article aims to solve the internal mystery of the KMP algorithm, hoping to help you learn and understand it.
1. KMP Algorithm
An Improved string matching algorithm, which was found at the same time by D. E. Knuth and V. R. Pratt and J. H. Morris, is called KMP algorithm. This algorithm can complete the string-based pattern matching operation at an order of magnitude O (n + m). The basic idea is that no backtracking pointer is required when character strings are compared during the matching process, instead, use the "partially matched" result to "slide" the right of the pattern as far as possible to continue the comparison.
2. Algorithm Based on Finite Automaton
The KMP algorithm seems simple, but it is still difficult to fully understand it. The KMP algorithm can be regarded as a finite automatic machine, which is divided into two parts: the first part is the structure of the automatic machine (which corresponds to the failure function, transfer function, overlap function ), the second part is the search process on the automatic machine. For example, the target string T = acabaabaabcacaabc; the pattern string P = abaabcac; the automatic machine is constructed based on the pattern string, and the forward arrow indicates the direction of the search. The backward arrow indicates unmatched backtracking, that is, the failure function or the state transition function. For example:
F (j = 1) = 0;
F (j = 2) = 0;
F (j = 3) = 1;
F (j = 4) = 1;
F (j = 5) = 2;
F (j = 6) = 0;
F (j = 7) = 1;
KMP essentially constructs DFA and carries out simulation. Therefore, it is obvious that once an automatic D is constructed from the template T, the process of matching the main string S with D is linear. The most fascinating thing about KMP is to construct the self-matching process of D, which makes full use of the Nature of D as a DAG and makes the construction process linear. The KMP algorithm does not need to calculate the change function. It only uses the secondary array Next, that is, the feature vector of the pattern string itself. A feature vector can be compared with its own pattern and computed in advance. It can be used to accelerate the execution speed of the string matching algorithm and the finite automatic machine matching algorithm.
3. Next feature Array Construction
Any character starting with a pattern string P. It is called a prefix substring, such as p0p1p2... Pm-1. On the left side of position I of P, k characters are taken out, which is the left substring of position I, that is, pi-k + 1... pi-2 pi-1 pi. Find the longest (maximum k) so that the prefix substring matches the left substring, and the longest prefix string at the I-bit. The maximum length of the I-bit prefix string k is the number of features of the template string P on Position I n [I] The number of features of the vector is called the feature vector of the pattern string.
It can be proved that for any mode string p = p0p1... Pm-1, there is indeed an array next that is uniquely identified by the pattern string itself and is irrelevant to the target string, calculated:
(1) Ask p0... In pi-1, the maximum length of the same prefix and suffix is k;
(2) next [I] = k;
In special cases, when I = 0, next [I] =-1; obviously, for any I (0 ≤ I <m), there is next [I] <I; if you have calculated next [I], next [I + 1] =? Feature number ni (-1 ≤ ni ≤ I) is recursively defined as follows:
(1) n [0] =-1, for n [I] of I> 0, it is assumed that the number of features in the previous position is known n [I-1] = k;
(2) If pi = pk, n [I] = k + 1;
(3) When pi is less than pk and k is less than 0, k is equal to n [k-1], and (3) loops until the conditions are not met;
(4) When qi =qk and k = 0, ni = 0;
Based on the above analysis, we can obtain the calculation method of the Next feature array. The algorithm code is as follows:
1. void get_next (SString T, int & next [])
2 .{
3. // evaluate the next function value of the pattern string T and store it to the array next
4. I = 1; next [1] = 0; j = 0;
5. while (I <T [0])
6 .{
7. if (j = 0 | T [I] = T [j])
8 .{
9. ++ I; ++ j; next [I] = j;
10 .}
11. else
12 .{
13. j = next [j];
14 .}
15 .}
16 .}
The article [5] explains that the above calculation method has some defects and there are many comparisons. You can correct it and obtain the following algorithms:
1. void get_next (SString T, int & next [])
2 .{
3. // evaluate the next function value of the pattern string T and store it to the array next
4. I = 1; next [1] = 0; j = 0;
5. while (I <T [0])
6 .{
7. if (j = 0 | T [I] = T [j])
8 .{
9. ++ I; ++ j;
10. if (T [I]! = T [j])
11. next [I] = j;
12. else
13. next [I] = next [j];
14 .}
15. else
16 .{
17. j = next [j];
18 .}
19 .}
20 .}
4. Algorithm Implementation
The difficulty of the KMP algorithm is the construction of finite automatic machines and the calculation of feature vectors. After solving these two problems, the exact matching algorithm is simple.
Int Index_KMP (SString S, SString T, int pos ){
// Use the next function of pattern string T to calculate the KMP algorithm at the position after the pos character of T in the main string S.
// Where T is not null, 1 ≤ pos ≤ StrLength (S ).
I = pos; j = 1;
While (I <= S [0] & j <= T [0]) {
If (j = 0 | S [I] = T [j]) {++ I; ++ j;} // compare subsequent characters
Else j = next [j]; // move the pattern string to the right
}
If (j> T [0]) return I-T [0]; // match successful
Else return 0;
} // Index_KMP
Theoretical Analysis and proof of algorithms, as well as algorithm complexity analysis. For more information, see [3], [4], and [5.
5. References
[1] http://wansishuang.javaeye.com/blog/402018
[2] http://richard dxx.yo2.cn/articles/kmpand extend-kmpalgorithm .html
[3] KMP algorithm handout PPT (Hu Junfeng, Peking University)
[4] Introduction to algorithms (Chapter 1 string matching)
[5] data structure (Chapter 1)