Preface
Although we know the principle of KMP algorithm, that is, to find the maximum length of the same prefix suffix K for P0 · Pi. But the question is how to find the maximum length of the prefix suffix? I think many posts on the Internet are not very clear, and I always feel that I have not penetrated the layer of paper. Later I looked at the introduction to algorithms. Although the 32-Chapter string matching mentioned the correctness of the pre-suffix calculation, however, a lot of reasoning proves that it is not a good understanding and is not combined with the program. Today, I would like to share some of my understanding here. I hope you can give me more advice. If you have any questions or errors, please leave a message for me.
1. Principles of the KMP algorithm:
This part of content is transferred from: http://www.ruanyifeng.com/blog/2013/05/Knuth%E2%80%93Morris%E2%80%93Pratt_algorithm.html
String Matching is one of the basic tasks of a computer.
For example, there is a string "BBC abcdab abcdababde". I want to know if it contains another string "abcdabd "?
Many algorithms can complete this task. The knuth-Morris-Pratt algorithm (KMP) is one of the most commonly used algorithms. It is named after three inventors, and the leader K is the famous scientist Donald knuth.
This algorithm is not easy to understand. There are a lot of explanations on the Internet, but it is hard to read. I didn't really understand this algorithm until I read an article on the part of Jack boxer. Next, I use my own language to try to write a better understanding of the KMP algorithm.
1.
First, the first character of the string "BBC abcdab abcdabde" is compared with the first character of the search term "abcdabd. Because B does not match a, the search term is removed by one.
2.
Because B does not match a, the search term is moved back.
3.
In this case, until the string has a character, which is the same as the first character of the search term.
4.
Then compare the next character of the string and the search term, which is the same.
5.
Until the string contains a character, which is different from the character corresponding to the search term.
6.
At this time, the most natural reaction is to move the entire search term one by one, and then compare them one by one from the beginning. Although this method is feasible, it is inefficient because you need to move the "Search location" to a position that has been compared and repeat it.
7.
A basic fact is that when spaces do not match D, you actually know that the first six characters are "abcdab ". The idea of the KMP algorithm is to try to use this known information, instead of moving the "Search location" back to a position that has already been compared and moving it back, which improves the efficiency.
8.
How can we achieve this? You can calculate a partial match table for the search term ). How to generate this table. We will introduce it later. We only need to use it here.
9.
If space and d do not match, the first six characters "abcdab" match. The table shows that the "partial matching value" corresponding to the last matching character B is 2, so the number of digits to move backward is calculated according to the following formula:
Number of mobile digits = number of matched characters-partially matched values
Because 6-2 is equal to 4, the search term is moved four places backward.
10.
Because the space does not match C, the search term must be moved back. At this time, the number of matched characters is 2 ("AB"), and the corresponding "partially matched value" is 0. Therefore, if the number of digits to be moved is 2-0 and the result is 2, the search term is moved to the second place.
11.
The space does not match.
12.
Compare by bit until it finds that c and d do not match. As a result, the number of digits to be moved is 6-2, and the search term is moved back to four places.
13.
Compare by bit until the last digit of the search term is found to be completely matched, so the search is complete. If you want to continue searching (that is, to find all matches), move the number of digits to 7-0, and then move the search term to the back to seven places.
14.
The following describes how some matching tables are generated.
First, you need to understand two concepts: "prefix" and "suffix ". "Prefix" refers to the combination of all the headers of a string except the last character. "suffix" refers to all the Tail Combinations of a string except the first character.
15.
The "partially matched value" is the length of the longest co-element of the "prefix" and "suffix. Take "abcdabd" as an example,
-The prefix and suffix of "A" are empty sets, and the length of the common elements is 0;
-The prefix of "AB" is [a], the suffix is [B], and the length of common elements is 0;
-The prefix of "ABC" is [a, AB] And the suffix is [BC, C]. The length of all elements is 0;
-The prefix of "ABCD" is [a, AB, ABC], and the suffix is [BCD, CD, d]. The length of all elements is 0;
-The prefix of "abcda" is [a, AB, ABC, ABCD], and the suffix is [BCDA, CDA, da, A]. The total element is "A" and the length is 1;
-The prefix of "abcdab" is [a, AB, ABC, ABCD, abcda], and the suffix is [bcdab, cdab, dab, AB, B]. The total element is "AB ", the length is 2;
-The prefix of "abcdabd" is [a, AB, ABC, ABCD, abcda, abcdab], and the suffix is [bcdabd, cdabd, dabd, Abd, BD, d]. the length of the common element is 0.
16.
The essence of "partial matching" is that sometimes the string header and tail are duplicated. For example, if "abcdab" contains two "AB", its "partially matched value" is 2 (the length of "AB ). When the search term is moved, the first "AB" moves four digits backward (String Length-partially matched value) to the second "AB" position.
2. Solution to the next Array
Through the above, we can have a clear understanding of the principles of the KMP algorithm, so the next step is to implement programming, the most important thing is how to find the maximum length of each digit with the same prefix and suffix Based on the template string to be matched. I will first give my code:
1 void makenext (const char P [], int next []) 2 {3 int Q, K; // Q: Template string subscript; K: max prefix and suffix length 4 int M = strlen (p); // template String Length 5 next [0] = 0; // The maximum prefix and suffix length of the first character of the template string is 0 6 for (q = 1, K = 0; q <m; ++ q) // For loop, starting from the second character, calculate the next value corresponding to each character 7 {8 while (k> 0 & P [Q]! = P [k]) // recursively obtain the maximum same prefix suffix length of P [0] · P [Q] K 9 K = next [k-1]; // It doesn't matter if you don't understand it. Let's look at the following analysis. This while loop is the essence of the entire code. It's really hard to understand 10 if (P [Q] = P [k]). // if they are equal, the maximum length of the same prefix suffix is 111 {12 K ++; 13} 14 next [Q] = K; 15} 16}
Now I will focus on the work of the while loop:
- It is known that the maximum length of the prefix suffix for the previous step is K (k> 0), that is, P [0] · P [k-1];
- In this case, compare the K items P [k] and P [Q], as shown in 1.
- If P [k] is equal to P [Q], it is easy to jump out of the while loop;
- Key! The key is the wood! What if it doesn't matter ???So we should take advantage of the previous next [0] · next [k-1]Evaluate P [0] · P [k-1], the largest of the substrings with the same prefix suffixSome people may ask-Why must P [0] · P [k-1] have the same maximum prefix suffix ??? Yes! Why?CauseP [k] is out of match with P [Q, and P [q-K] · P [q-1] is the same as P [0] · P [k-1, it seems that P [0] · P [k-1] Such a long substring is useless, so I want to find a substring that is also P [0] headers, P [k-1] end is P [0] · P [J-1] (j = next [k-1 ]), check whether the next P [J] matches P [Q. 2.
Code:
1 #include<stdio.h> 2 #include<string.h> 3 void makeNext(const char P[],int next[]) 4 { 5 int q,k; 6 int m = strlen(P); 7 next[0] = 0; 8 for (q = 1,k = 0; q < m; ++q) 9 {10 while(k > 0 && P[q] != P[k])11 k = next[k-1];12 if (P[q] == P[k])13 {14 k++;15 }16 next[q] = k;17 }18 }19 20 int kmp(const char T[],const char P[],int next[])21 {22 int n,m;23 int i,q;24 n = strlen(T);25 m = strlen(P);26 makeNext(P,next);27 for (i = 0,q = 0; i < n; ++i)28 {29 while(q > 0 && P[q] != T[i])30 q = next[q-1];31 if (P[q] == T[i])32 {33 q++;34 }35 if (q == m)36 {37 printf("Pattern occurs with shift:%d\n",(i-m+1));38 }39 } 40 }41 42 int main()43 {44 int i;45 int next[20]={0};46 char T[] = "ababxbababcadfdsss";47 char P[] = "abcdabd";48 printf("%s\n",T);49 printf("%s\n",P );50 // makeNext(P,next);51 kmp(T,P,next);52 for (i = 0; i < strlen(P); ++i)53 {54 printf("%d ",next[i]);55 }56 printf("\n");57 58 return 0;59 }
3. KMP Optimization
To be continued ....
Detailed explanation of KMP Algorithms-a thorough understanding of KMP Algorithms