[Turn] A more popular KMP algorithm to explain

Source: Internet
Author: User

Recently learning string matching time contact this algorithm, algorithm books are I hate subscript, wandering, dizzy ah. Had to search the Internet, most of the same as the book, finally found an article, finally see some understand.
In fact, the simplest string match, is to compare each by one, but this efficiency is very low, and the KMP algorithm uses ... (Do not say, poor expression ability, ^_^, see how the Cow people explain it).


The KMP we're talking about here isn't a movie (though I like the software), it's an algorithm. The KMP algorithm is taken to handle string matching. In other words, give you two strings, and you need to answer whether B string is a substring of a string (whether a string contains B strings). For example, the string a= "I ' m matrix67", the string b= "Matrix", we say B is a substring of a. You can gently ask your mm: "If you want to confess to the person you like, my name is the substring of your confession?" ”
To solve this kind of problem, our method is usually to enumerate from where the string a starts to match B, and then verify that it matches. If the length of a string is n,b string length of M, then the complexity of this method is O (MN). Although there are many times when the complexity does not reach MN (only one or two letters are found to be mismatched), we have a lot of "worst case", for example, a= "Aaaaaaaaaaaaaaaaaaaaaaaaaab", b= "Aaaaaaaab". We will introduce a worst case o (n) algorithm (here is assumed m<=n), the legendary KMP algorithm.
The reason is called KMP, because this algorithm is made by Knuth, Morris, Pratt Three, take the first letter of the names of these three people. At this point, perhaps you suddenly understand why the AVL tree is called AVL, or Bellman-ford Why the middle is a bar is not a point. Sometimes seven or eight people have studied a thing, so how to name it? Usually this thing simply does not have the name of the person, lest the controversy, such as "3x+1 problem". Pulled away.
Personally, KMP is the most unnecessary thing to say, because this thing can find a lot of information online. But the online argument basically involves "moving (shift)", "Next function" and other concepts, which is very easy to misunderstand (at least 1.5 ago, I read that the data learned KMP not clear). Here, I'll explain the KMP algorithm in a different way.

If, a= "ABABABAABABACB", b= "ABABACB", we see how KMP works. We use two pointers I and J, respectively, a[i-j+ 1..i] and B[1..J] are exactly equal. That is, I is constantly increasing, with the increase of I j correspondingly changes, and J satisfies the length of J with the A[i] end of the string exactly matches the first J characters of B string (J of course, the larger the better), now need to examine the relationship between a[i+1] and b[j+1]. When A[i+1]=b[j+1], I and J each plus one, when j=m, we say B is a substring (b string has been completed), and can be based on the value of I to calculate the matching position. When the A[I+1]&LT;&GT;B[J+1],KMP strategy is to adjust the position of J (Decrease the J value) so that a[i-j+1..i] matches B[1..J] and the new b[j+1] exactly match the a[i+1] (so that I and J can continue to increase). Let's take a look at what happens when i=j=5.

i=123456789......
a=abababaabab...
b = A b A b a C b
j = 1 2 3 4 5 6 7

At this time, a[6]<>b[6]. This shows that J cannot be equal to 5 at this point, and we have to change J to a smaller value J '. J ' may be how much? To think about it, we find that J ' must make the head J ' and the last J ' letters exactly equal (so that J becomes J ' before they continue to hold the properties of I and J) in B[1..J]. This J ' Of course the bigger the better. Here, B [1..5]= "Ababa", the first 3 letters and the last 3 letters are "ABA". And when the new J is 3 o'clock, a[6] is equal to b[4]. So I became 6, and J became 4:

i=123456789......
a=abababaabab...
b = A b A b a C b
j = 1 2 3 4 5 6 7

From the above example, we can see how much new J can be taken irrespective of I, only related to string B. We can fully preprocess an array of p[j], indicating how much the new J Max is when the j+1 letter does not match the J-Letter of the B array. P[J] should be the maximum value of all satisfying b[1..p[j]]=b[j-p[j]+1..j].
Later, A[7]=b[5],i and J increased by 1. At this time, there is a a[i+1]<>b[j+1] situation:

i=123456789......
a=abababaabab...
b = A b A b a C b
j = 1 2 3 4 5 6 7

Because of the p[5]=3, therefore the new j=3:

i=123456789......
a=abababaabab...
b = A b A b a C b
j = 1 2 3 4 5 6 7

At this point, the new j=3 still can not meet a[i+1]=b[j+1], at this time we reduce the J value again, J again updated to P[3]:

i=123456789......
a=abababaabab...
b = A b A b a C b
j = 1 2 3 4 5 6 7

Now, I or 7,j has become 1. And at this time a[8] incredibly still not equal to b[j+1]. Thus, J must be reduced to p[1], or 0:

i=123456789......
a=abababaabab...
b = A b A b a C b
j = 0 1 2 3 4 5 6 7

Finally, A[8]=b[1],i becomes 8,j for 1. In fact, it is possible that J to 0 still cannot satisfy a[i+1]=b[j+1] (such as a[8]= "D"). Therefore, it is accurate to say that when j=0, we increase the I value but ignore J until A[i]=b[1] is present.
The code for this process is very short (really short), and here we give:

The last j:=p[j] is to keep the program going because we're likely to find multiple matches.
This procedure may be simpler than expected because the code uses a for loop for the constant increase of the I value. Therefore, this code can be visually understood: Scan the string A, and update where it can match to B.

Now, we have two important questions left: First, why the program is linear, and how to quickly preprocess the P array.
Why is this program O (n)? In fact, the main controversy is that while loops make the number of executions appear uncertain. We will use the time complexity of the amortization analysis of the main strategy, simply by observing a variable or function value changes to the scattered, messy, irregular execution of the cumulative number of times. The time complexity analysis of KMP is a typical analysis of amortization. We start with the J value of the above procedure. Each time the while loop is executed, J is reduced (but not minus), while the other place to change the J value is the fifth row. Each time this line is executed, J can only add 1, so the whole process of J adds up to N 1. As a result, J has a maximum chance of being reduced by N times (the number of J values will certainly not exceed N, since J is always a nonnegative integer). This tells us that the while loop has performed up to n times in total. According to the amortization analysis, the complexity of a For loop is O (1) after each for loop. The whole process is obviously O (n). This analysis is equally effective for the process of preprocessing the P array, and the complexity of the preprocessing process is O (m).
Preprocessing does not need to be written in O (m^2) or even O (m^3) as defined by P. We can get the value of P[j] by p[1],p[2],..., p[j-1] value. For just the b= "ABABACB", if we have to find out p[1],p[2],p[3] and p[4], see how we should find out p[5] and p[6]. p[4]=2, then P [5] is obviously equal to p[4]+1, because p[4] can know that b[1,2] is equal to b[3,4], and now has b[3]=b[5], so p[5] can be obtained by adding a character after the p[4. P[6] also equals p[5]+1? Obviously not, because b[p[5]+1]<>b[6]. Well, we have to consider "step back". We consider p[6] whether it is possible for p[5 to be included in the case of a substring, i.e. whether p[6]=p[p[5]]+1. If you can't make sense here, take a closer look:

1 2 3 4 5 6 7
b = A b A b a C b
P = 0 0 1 2 3?

P[5]=3 because b[1..3] and b[3..5] are both "ABA", and P[3]=1 tells us that b[1] and b[5] are all "a". Since P[6] cannot be obtained by P [5], it may be obtained by p[3] (if b[2] is equal to b[6], p[6] is equal to p[3]+1). Obviously, p[6] can not be obtained through p[3], because B[2]<>b[6]. In fact, this has been pushed to p[1] also not, finally, we get, p[6]=0.
How does this preprocessing process look like the previous KMP main program? In fact, KMP's preprocessing itself is a "self-matching" process of B strings. Its code is in the likeness of the code above:

Finally, add: Since the KMP algorithm only preprocess B-strings, this algorithm is suitable for the problem: given a B-string and a bunch of different a-strings, ask B is what string a substring.

String matching is a very valuable problem to study. In fact, we also have a number of methods, such as suffix trees, automata, which skillfully use preprocessing to solve string matches in linear time. We'll say later.


[Turn] A more popular KMP algorithm to explain

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.