Explanation of KMP Algorithm

Source: Internet
Author: User

Source Address: http://blog.csdn.net/Oneil_Sally/archive/2008/12/03/3440784.aspx

 

 

I personally think this article is an online article about the KMP algorithm that is easier to understand. It is indeed very "Detailed" and will surely be rewarded after reading it patiently ~~, In addition, there are many versions of the Pattern Function value next [I]. In other object-oriented algorithm descriptions, there are also invalid function f (j), which is actually a meaning, that is, next [J] = f (J-1) + 1, but it is better to understand the next [J] Notation:

Detailed description of KMP string matching

KMP string pattern matching is an efficient algorithm for locating another string in a string. The time complexity of the simple matching algorithm is O (M * n) and KMP matching algorithm. It can be proved that the time complexity is O (m + n )..
1. Simple Matching Algorithm
Let's look at a simple matching algorithm function:
Int index_bf (char s [], char T [], int POS)
{
/* If the substring S contains characters from pos (s subscript 0 ≤ POS <strlength (s)
If the substring is the same as the substring t, the match is successful and the first substring is returned.
The subscript of such a substring in the string s; otherwise,-1 */is returned */
Int I = POs, j = 0;
While (s [I + J]! = '/0' & T [J]! = '/0 ')
If (s [I + J] = T [J])
J ++; // continue to compare the last character
Else
{
I ++; j = 0; // start a new round of matching.
}
If (T [J] = '/0 ')
Return I; // returns the subscript after successful match.
Else
Return-1; // The substring in string S (starting from the POs character) does not exist as the substring of string T.
} // Index_bf
The idea of this algorithm is straightforward: Compare the substring starting from position I in the main string s with the pattern string T. That is to say, the comparison between S [I + J] and T [J] Starting from J = 0. If they are equal, there is a possibility that I will be matched successfully in the main string S, continue to compare later (J gradually increases by 1) until it is equal to the last character in the T string, otherwise it will start from the next character lifting of the S string for the next round of "matching ", moving the string t backward, that is, increasing 1 by I, j is returned to 0, and a new round of matching is started.
For example, in the string S = "abcabcabdabba", find t = "abcabd" (we can assume that the subscript starts with 0): first, compare s [0] and T [0] to see if they are equal, then compare whether s [1] and T [1] are equal... We found that it was not until s [5] and T [5.

 
When such a mismatch occurs, the T subscript must be traced back to the beginning. The length of the S subscript Backtracking is the same as that of T, and the S subscript is increased by 1, and then compared again.
This time, a mismatch occurred immediately, and the T subscript went back to the beginning. The S subscript was increased by 1, and then compared again.

This time, a mismatch occurred immediately, and the T subscript went back to the beginning. The S subscript was increased by 1, and then compared again.

 

The mismatch occurs again, so the T subscript goes back to the beginning, and the S subscript is increased by 1, and then compared again. All the characters in T match the corresponding characters in S. The function returns the starting subscript 3 of T in S.

Ii. KMP Matching Algorithm
For the same example, find t = "abcabd" in S = "abcabdabba". If you use the KMP matching algorithm, when s [5] and T [5] are searched for the first time, s subscripts do not go back to 1, and t subscripts do not start back, but according to the pattern function value of T [5] = 'D' in T (next [5] = 2, why? As mentioned later), compare whether s [5] and T [2] are equal. Because they are equal, the subscripts of S and T are added at the same time; because they are equal, the subscript of S and T is added at the same time... Finally, t is found in S.

 

KMP matching algorithms are more efficient than simple matching algorithms. An extreme example is:
In S = "aaaaaa... Search for t = "aaaaaaaaab" in AAB "(100 A). The Simple Matching Algorithm always compares to the end of T and finds that the character is different. Then, the subscript of T goes back to the beginning, the subscript of S must also be traced back to the same length and then increased by 1 to continue the comparison. If you use the KMP matching algorithm, you do not need to trace back.
For string matching in general documents, the time complexity of the simple matching algorithm can be reduced to O (m + n), so it is applied in most practical applications.
The core idea of the KMP algorithm is to use some of the obtained matching information for subsequent matching. See the previous example. Why is the pattern function value of T [5] = 'D' equal to 2 (next [5] = 2 ), in fact, this 2 represents T [5] = 'D'. The first two characters are the same as the start two characters, and t [5] = 'D' is not equal to the third character after the start two characters (T [2] = 'C ').

That is to say, if the third character after the first two characters is also 'D', then, although T [5] = 'D' is preceded by two characters and the start two characters are the same, T [5] = 'D' the mode function value is not 2, but 0.
As I said before: Find T = "abcabdabba" in S = "abcabd". If you use the KMP matching algorithm, after the first search for a string of S [5] and T [5, the S subscript is not traced back to 1, and the T subscript is not traced back to the start, but based on the pattern function value of T [5] = 'D' in T, compare whether s [5] and T [2] are equal... Why?
I said again just now: "(next [5] = 2 ), in fact, this 2 represents T [5] = 'D' with two characters in front of it and the two starting characters are the same ". See the figure: Because, s [4] = T [4], s [3] = T [3], according to next [5] = 2, T [3] = T [0], t [4] = T [1], so s [3] = T [0], s [4] = T [1] (the two pairs are indirectly compared). Therefore, we will compare whether s [5] and T [2] are equal...

Someone may ask: s [3] and T [0], s [4] and T [1] are indirectly equal according to next [5] = 2, how can we skip between S [1] and T [0], s [2] and T [0? Because s [0] = T [0], s [1] = T [1], s [2] = T [2], and t [0]! = T [1], t [1]! = T [2], => S [0]! = S [1], s [1]! = S [2], so s [1]! = T [0], s [2]! = T [0]. In theory, it is indirectly compared.
Another question is, are you looking at this.
If s remains unchanged, search for t = "abaabd" in S? A: In this case, when we compare the values of S [2] and T [2], we can see the values of next [2]. Next [2] =-1, it means that s [2] has been indirectly compared with T [0], which is not equal. Next, compare s [3] and T [0.
If s remains unchanged, search for t = "abbabd" in S? A: In this case, when we compare the values of S [2] and T [2], we can see the values of next [2]. Next [2] = 0, it means that s [2] has already been compared with T [2], which is not equal. Next, compare s [2] and T [0.
Suppose S = "abaabcabdabba" searches for t = "abaabd" in S? A: In this case, when we compare the values of S [5] and T [5], we can see the values of next [5]. Next [5] = 2, it means that the previous comparison has passed. The first two characters of S [5] are equal to the first two characters of T. Next, compare s [5] and T [2.
In short, with the next value of the string, everything is done. So, how to obtain the value of the string mode function next [N? (In this article, the next value, the mode function value, and the mode value are meanings .)
3. How to calculate the string mode Value next [N]
Definition:
(1) next [0] =-1 meaning: the mode Value of the first character of any string must be-1.
(2) next [J] =-1 meaning: The character marked as J in the mode string T, if it is the same as the first character
It is the same, and the first 1-k characters of J are the same as the first 1-k characters
Characters (or equal, but t [k] = T [J]) (1 ≤ k <j ).
For example, t = "abcabcad", next [6] =-1, because T [3] = T [6]
(3) next [J] = K meaning: The character marked as J in the mode string T, if the first K of J
The character is the same as the start k characters, and t [J]! = T [k] (1 ≤ k <j ).
That is, t [0] T [1] T [2]... T [k-1] =
T [J-K] T [J-k + 1] T [J-K + 2]… T [J-1]
And t [J]! = T [K]. (1 ≤ k <j );
(4) next [J] = 0 meaning: Except for (1) (2) (3.
 
Example:
01) evaluate the value of the T = "abcac" pattern function.
Next [0] =-1 according to (1)
Next [1] = 0 according to (4) Because (3) there is 1 <= k <j; not to mention, j = 1, t [J-1] = T [0]
Next [2] = 0 according to (4) Because (3) 1 <= k <j; (t [0] = )! = (T [1] = B)
Next [3] =-1 according to (2)
Next [4] = 1 according to (3) T [0] = T [3] and T [1] = T [4]
That is
Subscript 0 1 2 3 4
T A B C A C
Next-1 0 0-1 1

If t = "abcab", it will be like this:
Subscript 0 1 2 3 4
T A B C A B
Next-1 0 0-1 0

Why T [0] = T [3], and next [4] = 0? Because t [1] = T [4], according to (3) and t [J]! = T [k] "is included (4 ).
02) for a complex point, evaluate the value of the T = "ababcaabc" pattern function.
Next [0] =-1 according to (1)
Next [1] = 0 based on (4)
Next [2] =-1 according to (2)
Next [3] = 0 according to (3) Although T [0] = T [2] But T [1] = T [3] is divided into (4)
Next [4] = 2 according to (3) T [0] T [1] = T [2] T [3] and T [2]! = T [4]
Next [5] =-1 according to (2)
Next [6] = 1 according to (3) T [0] = T [5] and T [1]! = T [6]
Next [7] = 0 according to (3) Although T [0] = T [6] But T [1] = T [7] is divided into (4)
Next [8] = 2 according to (3) T [0] T [1] = T [6] T [7] and T [2]! = T [8]
That is
Subscript 0 1 2 3 4 5 6 7 8
T a B A B C
Next-1 0-1 0 2-1 1 0 2

As long as you understand that next [3] = 0, instead of = 1, next [6] = 1, instead of =-1, next [8] = 2, instead of = 0, others seem easy to understand.
03) calculate the value of the T = "abcabcad" pattern function.
Subscript 0 1 2 3 4 5 6 7
T A B C A D
Next-1 0 0-1 0 0-1 4


Next [5] = 0 based on (3) Although T [0] T [1] = T [3] T [4], t [2] = T [5]
Next [6] =-1 according to (2) Although there is abc = ABC above, but t [3] = T [6]
Next [7] = 4 according to (3) there is ABCA = ABCA before, and t [4]! = T [7]
If t [4] = T [7], that is, t = "adcadcad", it will be like this: Next [7] = 0, instead of = 4, because t [4] = T [7].
Subscript 0 1 2 3 4 5 6 7
T a D C A D
Next-1 0 0-1 0 0-1 0

 
If you feel a little familiar
Exercise: Evaluate the pattern function value of T = "aaaaaaaaaab" and use the following pattern function value function for verification.
Meaning:
What is the meaning of the next function value? As mentioned above, we will summarize it here.
Locate the pattern string t in string s, if s [m]! = T [N], then, take the pattern function value of T [N] Next [N],
1. next [N] =-1 indicates that s [m] and T [0] are indirectly compared. They are not equal. Next time we compare s [M + 1] and T [0]
2. Next [N] = 0 indicates that the comparison process produces an inequality. The next comparison is s [m] and T [0].
3. next [N] = k> 0 but k <n, indicating that the first k characters of S [m] are indirectly equal to the first k characters in T, is the next comparison of S [m] and T [k] equal?
4. other values, not possible.
4. Evaluate the function of the pattern value next [N] of string t
Having said so much, do you think the pattern value of string T next [N] is very complicated? To write a function, I 'd rather go to heaven. Fortunately, there are ready-made functions. When I first invented the KMP algorithm and wrote this function, I admired it in six places. I have to wait for the underachievers to understand it. The following is the function:
Void get_nextval (const char * t, int next [])
{
// Evaluate the value of the next function of the mode string T and save it to the array next.
Int J = 0, K =-1;
Next [0] =-1;
While (T [J/* + 1 */]! = '/0 ')
{
If (k =-1 | T [J] = T [k])
{
++ J; ++ K;
If (T [J]! = T [k])
Next [J] = K;
Else
Next [J] = next [k];
} // If
Else
K = next [k];
} // While
/// Here is the display part added
// For (INT I = 0; I <j; I ++)
//{
// Cout <next [I];
//}
// Cout <Endl;
} // Get_nextval
Another method is similar.
Void getnext (const char * pattern, int next [])
{
Next [0] =-1;
Int K =-1, j = 0;
While (pattern [J]! = '/0 ')
{
If (K! =-1 & pattern [k]! = Pattern [J])
K = next [k];
++ J; ++ K;
If (pattern [k] = pattern [J])
Next [J] = next [k];
Else
Next [J] = K;
}
/// Here is the display part added
// For (INT I = 0; I <j; I ++)
//{
// Cout <next [I];
//}
// Cout <Endl;
}
The following is the KMP pattern matching program. You can use it for verification. Remember to add the above function
# Include <iostream. h>
# Include <string. h>
Int KMP (const char * Text, const char * pattern) // const indicates that the value of this parameter is not changed in the function.
{
If (! Text |! Pattern | pattern [0] = '/0' | text [0] ='/0 ')//
Return-1; // NULL pointer or string, return-1.
Int Len = 0;
Const char * c = pattern;
While (* C ++! = '/0') // move the pointer faster than move the subscript.
{
++ Len; // String Length.
}
Int * Next = new int [Len + 1];
Get_nextval (pattern, next); // evaluate the value of the next function of Pattern

Int Index = 0, I = 0, j = 0;
While (Text [I]! = '/0' & pattern [J]! = '/0 ')
{
If (Text [I] = pattern [J])
{
++ I; // continue to compare subsequent characters
++ J;
}
Else
{
Index + = J-next [J];
If (next [J]! =-1)
J = next [J]; // The pattern string moves to the right.
Else
{
J = 0;
++ I;
}
}
} // While

Delete [] Next;
If (pattern [J] = '/0 ')
Return Index; // match successful
Else
Return-1;
}
Int main () // abcabcad
{
Char * text = "bababcabcadcaabcaababcbaaaabaaacababcaabc ";
Char * pattern = "adcadcad ";
// Getnext (pattern, N );
// Get_nextval (pattern, N );
Cout <KMP (text, pattern) <Endl;
Return 0;
}
5. other methods for representing the mode Value
The mode Value Representation Method of the above string is the best representation method. We can get a lot of information from the mode Value of the string, which is called the first representation method. The second representation method also defines next [0] =-1, but-1 will never appear later, except for next [0], the value of another mode, next [J] = K (0 ≤ k <j), can be simply considered: the character whose subscript is J must have a maximum of k characters at the beginning. t [J] is not required here. = T [K]. In fact, next [0] can also be defined as 0 (the function for finding the string's pattern value and the function for matching the string's pattern given later is next [0] = 0 ), in this way, the meaning of next [J] = K (0 ≤ k <j) can be simply considered: the character whose subscript is J must start with a maximum of k characters. The third representation method is the deformation of the first representation method, that is, the mode value obtained by the first method. If each value is 1, the third representation method is obtained. The third expression method, which I saw on the Forum, was not explained in detail. I guess it was prepared for those programming languages: the subscript of the array starts from 1 rather than 0.
The following is an example of several methods:
Table 1.
Subscript 0 1 2 3 4 5 6 7 8
T a B A B C
(1) Next-1 0-1 0 2-1 1 0 2
(2) Next-1 0 0 1 2 0 1 1 2
(3) Next 0 1 0 1 3 0 2 1 3

The third representation method, in my opinion, is not so clear and will not be discussed.
Table 2.
Subscript 0 1 2 3 4
T A B C A C
(1) Next-1 0 0-1 1
(2) Next-1 0 0 0 1

Table 3.
Subscript 0 1 2 3 4 5 6 7
T a D C A D
(1) Next-1 0 0-1 0 0-1 0
(2) Next-1 0 0 0 1 2 3 4

 
Compare the first and second modes of the pattern values of a string, as shown in table 1:
The first method is next [2] =-1, indicating T [2] = T [0], and t [2-1]! = T [0]
The second method is next [2] = 0, indicating T [2-1]! = T [0], but t [0] and T [2] are not equal.
The first method is next [3] = 0, indicating that although T [2] = T [0], t [1] = T [3]
The second method is next [3] = 1, indicating T [2] = T [0]. t [1] and T [3] are not equal.
The first method is next [5] =-1, indicating T [5] = T [0], and t [4]! = T [0], t [3] T [4]! = T [0] T [1], t [2] T [3] T [4]! = T [0] T [1] T [2]
Method 2: Next [5] = 0, indicating T [4]! = T [0], t [3] T [4]! = T [0] T [1], t [2] T [3] T [4]! = T [0] T [1] T [2], but t [0] and T [5] are not equal. In other words: Even if T [5] = 'x', or T [5] = 'y', t [5] = '9 ', there are also next [5] = 0.
From this we can see that the first method of the string mode value can represent more information, and the second method is simpler and cannot be mistaken. Of course, the pattern matching function written using the first expression method is more efficient. For example, to match string T = "adcadcbdadcadcad 9876543" in string S = "adcadcad", use the first expression to write the pattern matching function. When it is compared to s [6]! When it is set to T [6], next [6] =-1 (table 3) can be used to indicate many of the following information: s [3] s [4] s [5] = T [3] T [4] T [5] = T [0] T [1] T [2], s [6]! = T [6], t [6] = T [3] = T [0], so s [6]! = T [0]. Next, compare s [7] and T [0. If the second expression is used to indicate the pattern matching function written by the method, it is compared to s [6]! When = T [6], take next [6] = 3 (table 3), which can only represent: s [3] s [4] s [5] = T [3] T [4] T [5] = T [0] T [1] T [2], however, it cannot be determined that t [6] and T [3] are not equal. Therefore, we will compare s [6] and T [3] Next, take next [3] = 0, which indicates s [3] s [4] s [5] = T [0] T [1] T [2], however, it is not determined that t [3] and T [0] are not equal, that is, s [6] and T [0] are not equal, so next we will compare s [6] and T [0] to make sure they are not equal, and then compare s [7] and T [0]. Isn't it more curved than the pattern matching function written in the first expression method.
Why, after explaining the first representation method, do you need to explain the second representation method without the first representation method? The reason is: at the beginning, I saw Yan Weimin's lecture. The pattern value representation method she provided is the second one here,
She said: "The meaning of the next function value is: When s [I] appears! = T [J], the next comparison should be performed between S [I] and T [next [J ." Although concise, I don't know why. The pattern value obtained by her algorithm is the first method to represent the next value, which is the previous get_nextval () function. The matching algorithm is also flawed. So I posted a post here saying that she was wrong:
Http://community.csdn.net/Expert/topic/4413/4413398.xml? Temp =. 2027246.
It seems that she is not wrong, but she is suspected of being Zhang guanli Dai. I don't know if someone has learned this for the first time. I can understand this algorithm without referring to other materials or understanding other people's explanations (I mean not only the general idea of the algorithm, in the definition and example, next [J] = K (0 ≤ k <j), while in the algorithm, next [J] = K (-1 ≤ k <j )). In my conscience, I have been very admired for this lecture. Not only does the lecture sound well, but it also has a well-defined and just-right speech. There was a small mistake in the issue of KMP. It may be that when I edited a book, I copied an example in this book and copied an algorithm in that book, and the result was not very accurate. Because I did not find the original book, and some netizens said that the book is no longer like this, maybe. Speaking of this, the questions that professors study are several times more advanced than this one. How can we deduce this small algorithm. In short, the hacker does not hide the jade.
The following describes the function that I wrote to evaluate the pattern value represented by the second method. To match t from any position in S, "When s [I] appears. = T [J], the next comparison should be performed between S [I] and T [next [J ." Define next [0] = 0.
Void myget_nextval (const char * t, int next [])
{
// Evaluate the value of the next function of the mode string T (the second expression method) and store it to the array next.
Int J = 1, K = 0;
Next [0] = 0;
While (T [J]! = '/0 ')
{
If (T [J] = T [k])
{
Next [J] = K;
++ J; ++ K;
}
Else if (T [J]! = T [0])
{
Next [J] = K;
++ J;
K = 0;
}
Else
{
Next [J] = K;
++ J;
K = 1;
}
} // While
For (INT I = 0; I <j; I ++)
{
Cout <next [I];
}
Cout <Endl;
} // Myget_nextval
 
The following is a matching function (next [0] = 0) that uses the second expression of the pattern value)
Int my_kmp (char * s, char * t, int POS)
{
Int I = POs, j = 0; // pos (s subscript 0 ≤ POS <strlength (s ))
While (s [I]! = '/0' & T [J]! = '/0 ')
{
If (s [I] = T [J])
{
++ I;
+ + J; // continue to compare subsequent characters
}
Else // A B C
// 0 0 0 1 2 0 1 1 2
{//-1 0-1 0 2-1 1 0 2
I ++;
J = next [J];/* When s [I] appears! = T [J,
The next comparison should be performed between S [I] and T [next [J. Next [0] = 0.
Use the Global Array next [] to pass values between the two simple demonstration functions. */
}
} // While
If (T [J] = '/0 ')
Return (I-j); // match successful
Else
Return-1;
} // My_kmp
6. What should I do later -- the history of KMP
[This paragraph is copied]
A theory proved by Cook in 1970 can be used to solve any problem that can be solved using a computer abstract model called a push-down automatic machine (more accurately, use a Random Access Machine) to solve the problem within the corresponding time. In particular, this theory implies that an algorithm can solve the pattern matching problem in about m + n. Here, m and n are the largest indexes for storing text and pattern string arrays. Knuth and Pratt tried to recreate Cook's proof and created this pattern matching algorithm. It was about the same time, and Morris created almost the same algorithm when considering the actual problem of designing a text editor. Here we can see that not all algorithms are found in the "spirit of the light", and the theoretical computer science will be applied to practical applications in some cases.

This article from the csdn blog, reproduced please indicate the source: http://blog.csdn.net/Oneil_Sally/archive/2008/12/03/3440784.aspx

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.