Objective
Prior to the KMP algorithm although understand its principle, that is to find out P0 The maximum prefix length of pi is k; But the question is how to find the maximum prefix length? I think many posts on the Internet are not very clear, the total feeling did not put that layer of paper, and then turn to see the introduction of the algorithm, 32 string Matching although the correctness of the prefix calculation, but a lot of reasoning proved not good understanding, not with the program to speak. Today I am here to talk about some of my understanding, I hope that we have a lot of advice, if there is not clear or wrong please leave me a message.
The principle of the 1.KMP algorithm:
This section of the content is transferred from: http://www.ruanyifeng.com/blog/2013/05/Knuth%E2%80%93Morris%E2%80%93Pratt_algorithm.html
String matching is one of the basic tasks of a computer.
For example, there is a string "BBC Abcdab Abcdabcdabde", and I want to know if it contains another string "Abcdabd"?
Many algorithms can accomplish this task, and the Knuth-morris-pratt algorithm (abbreviated KMP) is one of the most common. It was named after three inventors, and the first K was the famous scientist Donald Knuth.
This algorithm is not easy to understand, there are many explanations on the Internet, but it is very laborious to read. I didn't really understand this algorithm until I read Jake Boxer's article. Below, I use my own language, trying to write a comparison of understood KMP algorithm interpretation.
1.
First, the first character of the string "BBC Abcdab Abcdabcdabde" is compared to the first character of the search term "abcdabd". Because B does not match A, the search term moves one after the other.
2.
Because B does not match A, the search term moves backwards.
3.
This is the case until the string has a character that is the same as the first character of the search term.
4.
It then compares the string to the next character of the search term, or the same.
5.
Until the string has a character that is not the same as the character that corresponds to the search term.
6.
At this point, the most natural response is to move the search term to one place, and then compare it from the beginning to the next. While this works, it is inefficient, because you want to move the "search location" to a location that has been compared again.
7.
One basic fact is that when the pod does not match D, you actually know that the first six characters are "Abcdab". The idea of the KMP algorithm is to try to take advantage of this known information and not move the "search location" back to the location that has already been compared and move it backwards, which improves efficiency.
8.
How do you do that? A partial match table can be calculated for the search term. This table is how to produce, after the introduction, here as long as it can be used.
9.
When a known space does not match D, the first six characters "Abcdab" are matched. The table shows that the last matching character B corresponds to a "partial match value" of 2, so the following formula calculates the number of bits moved backwards:
Move digits = number of matched characters-corresponding partial match values
Because 6-2 equals 4, the search term is moved backwards by 4 bits.
10.
Because the spaces do not match the C, the search term continues to move backwards. At this point, the matched number of characters is 2 ("AB"), corresponding to the "partial match value" of 0. So, move the number of bits = 2-0, the result is 2, and then move the search word back 2 bits.
11.
Because the spaces do not match a, continue to move back one bit.
12.
Bitwise comparison until you find that C and D do not match. So, move the number of digits = 6-2 and continue to move the search word backwards by 4 bits.
13.
The search is completed by a bitwise comparison until the last one in the search term finds an exact match. If you want to continue searching (that is, find all matches), move the number of digits = 7-0, and then move the search word back 7 bits, there is no repetition.
14.
Here's how the partial match table is produced.
First, you need to understand the two concepts: prefix and suffix. "prefix" means the combination of all the headers of a string except the last character; "suffix" means all the trailing combinations of a string in addition to the first character.
15.
The partial match value is the length of the longest common element of the prefix and suffix. Take "Abcdabd" as an example,
-the prefix and suffix of "A" are empty, and the total element length is 0;
-the "AB" prefix is [A], the suffix is [B], the total element length is 0;
-the "ABC" prefix is [A, AB], the suffix is [BC, C], the length of the common element is 0;
-the "ABCD" prefix is [A, AB, ABC], suffix [BCD, CD, D], the length of the common element is 0;
-the "abcda" prefix is [A, AB, ABC, ABCD], the suffix is [bcda, CDA, DA, a], the common element is "a", the length is 1;
-"Abcdab" is prefixed with [A, AB, ABC, ABCD, abcda], suffix [Bcdab, Cdab, DAB, AB, B], the total element is "AB", the length is 2;
-"ABCDABD" is prefixed with [A, AB, ABC, ABCD, ABCDA, Abcdab], suffix [bcdabd, cdabd, Dabd, ABD, BD, D], with a total element length of 0.
16.
The essence of "partial match" is that sometimes the string header and tail are duplicated. For example, "Abcdab" has two "AB", then its "partial match value" is 2 ("ab" length). When the search term moves, the first "AB" Moves backwards 4 bits (the length of the string-part of the match), and it can come to the second "ab" position.
Solving ideas of 2.next arrays
Through the above can completely understand the principle of the KMP algorithm, then the next step is the implementation of the program, the most important thing is how to match the template string to find the corresponding maximum length of the same prefix for each bit. I will give my code first:
1 void makenext (const char p[],int next[]) 2 {3 int q,k;//q: template string subscript; k: Maximum prefix length 4 int m = strlen (P);//stencil string length 5
next[0] = The maximum prefix length for the first character of the 0;//template string is 0 6 for (q = 1,k = 0; q < m; ++q)//for loop, starting with the second character, and sequentially calculating the next value for each character 7 {8
while (k > 0 && p[q]! = P[k])//recursively find out p[0] P[Q] The largest of the same prefix length k 9 k = next[k-1]; Don't understand it's okay look at the following analysis, this while loop is the essence of the whole code, it is really not good to understand if (p[q] = = P[k])//If equal, then the maximum length of the same prefix and 111 { k++;13 }14 Next[q] = k;15 }16}
Now I'll focus on the work done by the while loop:
- It is known that the maximum prefix length is K (k>0) when the previous step is calculated, i.e. p[0] P[K-1];
- At this time compare the K-item p[k] and p[q],1 as shown
- If p[k] equals p[q], then it is simple to jump out of the while loop;
- The Key! The key is wood! What if the key is not equal??? then we should take advantage of the next[0] we've got. Next[k-1] to beg for p[0] P[K-1] This substring is the largest of the same prefix , there may be classmates to ask-why ask p[0] P[K-1] The largest and the same prefix??? Oh, yes! Why is it? The reason is that p[k] has been mismatch with p[q] and p[q-k] P[q-1] again with p[0] P[k-1] The same, it seems p[0] P[K-1] So long the substring is not used, then I want to find a same also p[0] beginning, p[k-1] end of the substring that is p[0] P[j-1] (J==next[k-1]) to see if its next p[j] matches p[q]. 2 is shown
Attached code:
1 #include <stdio.h> 2 #include <string.h> 3 void makenext (const char p[],int next[]) 4 {5 int q,k; 6 int m = strlen (P); 7 Next[0] = 0; 8 for (q = 1,k = 0; q < m; ++q) 9 {Ten while (k > 0 && p[q]! = P[k]) One k = next[k- 1];12 if (p[q] = = P[k]) {k++;15}16 next[q] = k;17}18}19 int KMP (const char T[],const char p[],int next[]): {$ int n,m;23 int i,q;24 n = strlen (T); m = strlen (P); 26 Makenext (P,next); (i = 0,q = 0; i < n; ++i) (q > 0 && p[q]! = T[i]) 30 Q = next[q-1];31 if (p[q] = = T[i]) (+ q++;34}35 if (q = = m) 36 {Notoginseng printf ("Pattern occurs with shift:%d\n", (i-m+1));}39}}41 int main () 43 {44 int i;45 int next[20]={0};46 char t[] = "ababxbababcadfdsss"; char p[] = "ABCDABD"; 48printf ("%s\n", T), ("%s\n", P), 0//Makenext (P,next), KMP (T,p,next), for (i =; I < strlen (P); ++i) ("%d", Next[i]),}56 printf ("\ n");
Optimization of 3.KMP
KMP, in-depth explanation of next array (reprint)