KMP algorithm (prelude to AC automatic mechanism)

Source: Internet
Author: User

 

The KMP we are talking about here is not an algorithm for making movies (although I like this software. The KMP algorithm is used to process string matching. In other words, we provide two strings. You need to answer whether string B is a substring of string a (whether string a contains string B ). For example, if string a = "I'm matrix67" and string B = "matrix", we can say that B is a substring of. You can ask your mm politely: "If you want to confess to someone you like, is my name a substring in your confession ?"
To solve this problem, we usually use enumeration to start matching with string B from where string A is located, and then verify whether it matches. If string a is N and string B is m, the complexity of this method is O (Mn. Although the complexity often fails to reach Mn (only one or two letters in the first letter are found to be mismatched), there are many "worst cases", for example, a = "aaaaaaaaaaaaaaaaaaaaaaaaaaaab ", B = "aaaaaaaab ". We will introduce an O (n) algorithm (M <= N) in the worst case, that is, the legendary KMP algorithm.
KMP is called because the algorithm is proposed by knuth, Morris, and Pratt, and takes the first letter of the three names. At this moment, you may suddenly understand why the AVL Tree is called aVL, or why the middle of Bellman-Ford is not a point. Sometimes seven or eight people have studied one thing. How can we name it? Generally, this item will not be named by people, so as to avoid disputes, such as the "3X + 1 problem ". Far away.
I personally think that KMP is the least necessary thing to talk about, because this thing can find a lot of information online. However, the online lectures basically involve the concepts of "shift" and "next function, this is very easy to misunderstand (at least a year and a half ago I did not figure it out when I read these materials to learn KMP ). Here, I use another method to explain the KMP algorithm.

If a = "ababaababacb", B = "ababacb", let's see how KMP works. We use two pointers, I and j, to indicate that a [I-j + 1. I] is exactly the same as B [1. J. That is to say, I is constantly increasing. As I increases, J changes accordingly, and the string whose length ends with a [I] Is J exactly matches the first J characters of string B (the larger the value of J, the better ), now we need to check the relationship between a [I + 1] and B [J + 1. When a [I + 1] = B [J + 1], I and j each add one. When is J = m, let's say that B is a sub-string of A (string B has been completed), and the matching position can be calculated based on the I value. When a [I + 1] <> B [J + 1], the KMP policy is to adjust the position of J (reduce the value of J) so that a [I-j + 1 .. i] and B [1 .. j] Keep matching and the new B [J + 1] exactly matches a [I + 1] (so that I and j can continue to increase ). Let's take a look at the situation when I = J = 5.

I = 1 2 3 4 5 6 7 8 9 ......
A = a B a B...
B = A B A C B
J = 1 2 3 4 5 6 7

In this case, a [6] <> B [6]. This indicates that J cannot be equal to 5 at this time. We need to change J to a value smaller than j '. J' What is the possibility? Think carefully, we found that j 'must make B [1 .. J ). The larger the J, the better. Here, B [1 .. 5] = "Ababa", the first three letters and the last three letters are "ABA ". When the new J value is 3, a [6] is exactly the same as B [4. Therefore, I is changed to 6, and J is changed to 4:

I = 1 2 3 4 5 6 7 8 9 ......
A = a B a B...
B = A B A C B
J = 1 2 3 4 5 6 7

From the above example, we can see that the number of new J values is irrelevant to I and only related to string B. We can pre-process such an array P [J], indicating that when the J-th letter matches the B array but the J-th letter cannot match, the maximum value of the new J. P [J] is the maximum value that satisfies B [1. P [J] = B [J-P [J] + 1. J.
Later, a [7] = B [5], and I and j increased by 1. In this case, a [I + 1] <> B [J + 1] occurs again:

I = 1 2 3 4 5 6 7 8 9 ......
A = a B a B...
B = A B A C B
J = 1 2 3 4 5 6 7

Because p [5] = 3, the new J = 3:

I = 1 2 3 4 5 6 7 8 9 ......
A = a B a B...
B = A B A C B
J = 1 2 3 4 5 6 7

At this time, the new J = 3 still cannot meet a [I + 1] = B [J + 1]. At this time, we reduce the J value again, update J to P [3] Again:

I = 1 2 3 4 5 6 7 8 9 ......
A = a B a B...
B = A B A C B
J = 1 2 3 4 5 6 7

Now, I or 7, J has become 1. At this time, a [8] is still not equal to B [J + 1]. In this way, J must be reduced to P [1], that is, 0:

I = 1 2 3 4 5 6 7 8 9 ......
A = a B a B...
B = A B A C B
J = 0 1 2 3 4 5 6 7

Finally, a [8] = B [1], I is changed to 8, and J is 1. In fact, it is possible that J to 0 still cannot meet a [I + 1] = B [J + 1] (for example, when a [8] = "D ). Therefore, when J = 0, we increase the I value but ignore J until a [I] = B [1] appears.
The code for this process is very short (really short). Here we will provide:

 


J: = 0;
For I: = 1 to n do
Begin
While (j> 0) and (B [J + 1] <> A [I]) Do J: = P [J];
If B [J + 1] = A [I] Then J: = J + 1;
If J = m then
Begin
Writeln ('patternoccurs with shift ', I-m );
J: = P [J];
End;
End;

The final J: = P [J] is to let the program continue, because we may find multiple matches.
This program may be simpler than imagined, because the Code uses a for loop for the increasing I value. Therefore, this code can be visually understood as follows: Scan string a and update the position where B can be matched.

Now, we still have two important questions: 1. Why is this program linear; 2. How to quickly pre-process the P array.
Why is this program O (n? In fact, the main controversy is that the while loop causes uncertainty in the number of executions. We will use the primary strategies in the analysis of time complexity, simply put, we can accumulate scattered, messy, and irregular execution times by observing the changes in a variable or function value. KMP's time complexity analysis is a typical analysis. Let's start with the J value of the above program. Every execution of the while loop will reduce J (but cannot be reduced to negative), and the other place that changes the J value is only the fifth line. Each time this line is executed, J can only add 1; therefore, J can add n 1 at most throughout the process. Therefore, J can only reduce N times at most (the number of times the J value is reduced cannot exceed n, because J is always a non-negative integer ). This tells us that the while loop can be executed up to n times in total. According to the analysis of the stalls, the complexity of a for loop after each for loop is O (1 ). The entire process is obviously O (n. This analysis is equally effective for the p array preprocessing process, and the complexity of the preprocessing process is O (m ).
Preprocessing does not need to be written as O (M ^ 2) or even O (M ^ 3) according to the definition of P. We can obtain the value of P [J] through the value of P [1], p [2],..., P [J-1. For B = "ababacb", if we have obtained p [1], p [2], p [3] and P [4], let's see how we can find P [5] and P [6]. P [4] = 2, so P [5] is obviously equal to P [4] + 1, because P [4] can know, B [1, 2] is already equal to B [3, 4], and now B [3] = B [5], therefore, P [5] can be obtained by adding a character after P [4. Is P [6] equal to P [5] + 1? Apparently not, because B [p [5] + 1] <> B [6]. So we have to consider "taking a step back. We consider whether P [6] is possible from the substring contained in P [5], that is, whether P [6] = P [p [5] + 1. If you cannot figure it out, take a closer look:

1 2 3 4 5 6 7
B = A B A C B
P = 0 0 1 2 3?

P [5] = 3 because B [1 .. 3] and B [3 .. 5] are all "ABA", while P [3] = 1 tells us that both B [1] and B [5] are "". Since P [6] cannot be obtained from P [5], it may be obtained from P [3] (if B [2] is exactly the same as B [6, P [6] is equivalent to P [3] + 1 ). Obviously, P [6] cannot be obtained through P [3], because B [2] <> B [6]. In fact, it cannot be pushed until P [1]. Finally, we get p [6] = 0.
How is the pre-processing process like the preceding KMP main program? In fact, the pre-processing of KMP itself is a process of "self-matching" of string B. Its code is similar to the above Code:


P [1]: = 0;
J: = 0;
For I: = 2 to M do
Begin
While (j> 0) and (P [J + 1] <> P [I]) Do J: = P [J];
If P [J + 1] = P [I] Then J: = J + 1;
P [I]: = J;
End;

Finally, the KMP algorithm only processes string B, so this algorithm is suitable for the following problem: Given a string B and a group of different string, ask which substring of string a is B.

String matching is a very valuable issue. In fact, we also have many methods such as suffix trees and automatic machines. These algorithms use preprocessing skillfully to solve string matching in linear time. Let's talk about it later.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.