I. Theoretical preparation
Why is the KMP algorithm faster than the traditional string matching algorithm? The KMP algorithm can omit the number of characters to be matched by analyzing the pattern string and calculating the mismatch of each position in advance. Sorted out to a next array, and then to compare, so as to avoid string backtracking, some of the results in the pattern string can be reused, reduce the number of cycles, improve the matching efficiency. In layman's parlance, the KMP algorithm mainly uses the pattern string some characters and the pattern string at the beginning of the character to avoid the repetition comparison of these positions. such as the main string: abcabcabcabed, pattern string: abcabed. When comparing to the pattern string ' e ' character, it is completely unnecessary to start from the starting position of the pattern string and compare it directly from the ' C ' character of the pattern string. And the main string does not have to backtrack.
The traditional matching algorithm does not use the matching information (the pattern string is known, then the partial matching main string is also known), each time starts from the beginning compares, the speed is very slow.
Let's first introduce the prefix array (my own call, not knowing right) how it was created. First, understand two concepts: "prefix" and "suffix." "prefix" means the entire header combination of a string except the last character; "suffix" means all the tail combinations of a string except the first character.
To see an example: Chi represents the prefix of the first I characters of a pattern string, next[i] = J denotes the beginning of the Chi, J characters and the end J characters are the same (note that the subscript is the number of characters), and for prefix chi, such j is the maximum value. Next[i] = j is another definition: there is a string containing J characters, which is both the true prefix of Chi and the true suffix of chi.
Rule: next[1] = next[0] = 0, this rule is not like 0! =1 like that, but it is so, do not know how to look at the prefix concept. Note: The next array is not a palindrome string, but a prefix equals a suffix, which is important to the recursive next array. Next[i] is the prefix array, and here are 1 examples of how to construct a prefix array.
Example: Cacca has 5 prefixes, and its corresponding next array is obtained. The prefix 2 is a CA, apparently without the same characters, next[2] = 0, prefix 3 for CAC, it is obvious that the end has a common character C, so next[3] = 1, prefix 4 is CACC, the end has a common character C, so next[4] = 1, prefix 5 for Cacca, with the same character CA , so next[5] = 2. If you look closely, you can find the structure of next[i], you can use the results of next[i-1]. such as ABCDABC, the pattern has been obtained next[7] = 3, for next[8], you can directly compare the 4th character and the 8th character, if they are equal, then next[8] = Next[7]+1 = 4, this is because next[7] = 3 guarantees that the first 3 characters of the 4 characters at the end of the prefix Ch7 are the same. But what if these two characters don't want to wait? Then continue iterating, using the value of (k=3) k = next[k] until k=0 (next[8] = 0) or Word story (next[8) = k+1).
Two. Algorithm implementation
Copy Code code as follows:
Import java.util.ArrayList;
public class KMP {
Main string
static String str = "1kk23789456789hahha";
Pattern string
static String ch = "789";
static int next[] = new INT[20];
public static void Main (string[] args) {
Setnext ();
arraylist<integer> arr = GETKMP ();
if (Arr.size ()!=0) {
for (int i=0; i<arr.size (); i++) {
System.out.println ("Match takes place in:" +arr.get (i));
}
}else {
System.out.println ("match unsuccessful");
}
}
private static void Setnext () {
TODO auto-generated Method Stub
int lench = Ch.length ();
Next[0] = 0;
NEXT[1] = 1;
K represents the value of next[i-1]
int k = 0;
for (int i=2; i<=lench; i++) {
k = Next[k];
/*
* The role of the while loop to find an example to see how to understand
* I think it is to find the longest, once the success of the stop, to ensure that the current maximum
*/
while (k!=0 && Ch.charat (i-1)!=ch.charat (k)) {
k = Next[k];
}
if (Ch.charat (i-1) ==ch.charat (k)) {
k++;
}//else is k=0.
Not next[k] = K,i denotes a prefix with several characters
Next[i] = k;
}
}
private static arraylist<integer> getkmp () {
TODO auto-generated Method Stub
arraylist<integer> arr = new arraylist<integer> ();
int lenstr = Str.length ();
int lench = Ch.length ();
Where the main string starts
int pos = 0;
Pattern string each match position
int k = 0;
The loop condition is not k<lench, which may be a dead loop (no match occurs)
while (POS<LENSTR) {
/*
* First entry is no big deal, if you're going to improve the matching efficiency
* Write on the last line
*/
k = Next[k];
while (K<lench && Str.charat (POS) ==ch.charat (k)) {
pos++;
k++;
}
if (lench==k) {
Arr.add (POS-K);
}else if (0==k) {
/*
* Without this sentence of death cycle
* because next[0] = 0
* such as ABCD and ABCE, to de mismatch, at this time the execution k = next[k] (k=3),
* k into 0, found D and a does not match, at this time K or 0, repeat the above steps, then the dead loop
*/
pos++;
}//actually else is k = next[k], so just say k = next[k] Write on the last line
}
return arr;
}
}
Three. Problem extension
The efficiency of the KMP algorithm is often reflected in the long pattern string (see next array derivation process), in fact, the pattern string is often very short, recall the use of Office suites to find the length of the string, so most of the practice using BM algorithm to achieve, interested readers can own access to relevant information, Maybe we can look at the AC automata and dictmatch algorithm of multimode matching (one at a time looking for multiple pattern strings in the main string).