Document directory
- Rules for pre-Calculation of the P array:
I. Introduction to KMP Algorithms
KMP (algorithms designed by knuth, Morris, and Pratt );
The KMP algorithm is mainly used for pattern matching. Simply put, it is string matching, for example, a = "ABC", B = "B". Q: Is B a substring of, in this case, the KMP algorithm is used because the efficiency of common algorithms is too low;
While KMP can achieve the linear time of O (m + n;
Ii. normal pattern matching algorithm
The pattern matching algorithm is simply to specify two strings A and B to check whether B is a substring of A. in Java, the indexof () of the string class implements this function;
Algorithm idea of normal pattern matching:
For example, there are two strings A: "ababc", B: "ABC ";
Step 1: Initialize I = 1, j = 1; (a [0] and B [0] can be empty or store the string length );
Step 2: traverse a and B cyclically. If a [I] = B [J], then I ++, J ++;
Step 3: When the loop jumps out, determine whether B is a substring of;
Steps:
1. I = 1, j = 1;
2. Because a [I] = B [J], I ++, J ++;
3. Because a [I] = B [J], I ++, J ++;
4. Because a [I]! = B [J], so J is restored to 1, and I is restored to I-j + 2;
5. Because a [I]! = B [J], so J is restored to 1, and I is restored to I-j + 2;
6. Because a [I] = B [J], I ++, J ++;
7. Because a [I] = B [J], I ++, J ++;
8. Because a [I] = B [J], I ++, J ++;
9. Because I and j are out of bounds at the same time, it indicates that B is the substring of;
From the above steps, we can see that:
Common pattern matching algorithms are inefficient, and many steps are redundant. For example, the comparison between Step 1: A [2] and B [1] is as follows, we have already compared: A [2] = B [2], B [1]! = B [2], so a [2]! = B [1], so this step is redundant.
The algorithm is as follows:
Private Static int indexof (string a, string B) {// 1. define two pointers to A and Bint I = 0; Int J = 0; // 2. traverse while (I <. length () & J <B. length () {if (. charat (I) = B. charat (j) {I ++; j ++;} else {I = I-(J-1); j = 0 ;}/ ** match * 1. matches a's intermediate string I <. length () * 2. match the last string of A (I =. length () & J = B. length () **/If (I =. length () & J = B. length () | I <. length () {return i-b.length () ;}else {return-1 ;}}
Iii. KMP Algorithm
The main idea of the KMP algorithm is: I don't need to go back, but it only needs to increase progressively;
Rules of shuoj:
(1) extract the matched string and find a substring so that the substring is the longest prefix and longest Suffix of the matched string. For example:
The matched string is "Ababa". It can be seen that "A" and "ABA" are both prefix and suffix, but "ABA" is long, so J' = "ABA ". length ();
Expressed in mathematical language:
When a [I-j + 1... I] = B [1... J] And a [I + 1]! = B [J + 1], you need to adjust J to re-A [I-j + 1... I] = B [1... J];
In this case, we use an array for pre-calculation and record it as P [], p [J] to indicate that when J strings have been matched, but the J + 1 character does not match the new value of J after the return;
For example, in the preceding example, P [5] = 3; because the length of "Ababa" is 5, j = 3 after the return;
For example:
A [1... 5] = B [1... 5], a [6]! = B [6], I = 6, j = 5, so we need to return to J (based on the pre-calculated P [] array, return to J = P [J]);
Principle of rollback: the common part is "Ababa". From this string, we can see that "ABA" is the longest prefix and suffix, so we can perform such a transformation, make J = 3;
A [3... 5] = B [1... 3], but a [6]! = B [4], I = 6, j = 3, so continue to repeat the data so that j = 1, as shown in
A [5] = B [1], but a [6]! = B [2], I = 6, j = 1, so continue to repeat, as shown in, making J = 0:
Because j = 0, you cannot continue to roll back. I index the end of string a, but J is still not the end of string B. Therefore, B is not a sub-string of string;
The KMP algorithm is as follows:
/*** O (m + n) horizontal analysis ** @ Param A indicates the text string * @ Param B indicates the mode string * @ return */public static int kmp_indexof (string, string B) {int n =. length (); int M = B. length (); // the reason for changing to a character array is that we need to record data from Index 1. For example, if a = "ABA", CHA = {'', 'A ', 'B', 'A'}; char Cha [] = ("" + ). tochararray (); char CHB [] = ("" + B ). tochararray (); Int J = 0; // pointer to B INT [] P = computeparray (CHB ); // pre-calculate the P array for (INT I = 1; I <= N; I ++) based on string B) {While (j> 0 & CHB [J + 1]! = CHA [I]) {// The J value can be reduced at most m times. By returning m times to N for loops J = P [J]; //} If (CHB [J + 1] = CHA [I]) {J ++;} If (j = m) {// J has matched the end, so all matches return I-m;} return-1;} Private Static int [] computeparray (char [] CHB) {int [] P = new int [CHB. length + 1]; P [1] = 0; Int J = 0; For (INT I = 2; I <CHB. length; I ++) {While (j> 0 & CHB [J + 1]! = CHB [I]) {J = P [J];} If (CHB [J + 1] = CHB [I]) {J ++ ;} P [I] = J;} return P ;}
Rules for pre-Calculation of the P array:
For example, B = "ababac ",
1. initialize P [] and Set P [1] to 0, I = 2, j = 0;
2. Because B [2]! = B [1], so P [2] = 0, I = 3, j = 0; P [2] = 0;
3. Because B [3] = B [1], J ++, that is, j = 1, I = 4; P [3] = 1;
4. Because B [4] = B [2], J ++, that is, j = 2, I = 5; P [4] = 2;
5. Because B [5] = B [3], J ++, that is, j = 3, I = 6, P [5] = 3;
6. Because j> 0, and B [6]! = B [4], so J = P [J] = 2;
7. Because j> 0, and B [6]! = B [2], so J = P [J] = 0;
8. Because j = 0, and B [6]! = B [2], so P [6] = 0;
References:
Http://www.matrix67.com/blog/archives/115/ this article is written very well;
The implementation method in this article is also good:
public static void computePArray(String T,int p[]){T = " "+T;int j=0;p[1] = 0;for(int i=2;i<p.length;i++){while(j>0&&T.charAt(j+1)!=T.charAt(i)){j = p[j];}if(T.charAt(j+1)==T.charAt(i)){j++;}p[i] = j;}}