"Original" easy-to-understand explanation KMP algorithm and code implementation

Source: Internet
Author: User
Tags repetition

First, the introduction of this article

The purpose of this paper is to explain briefly the idea of KMP algorithm and its realization process.

Online articles are indeed some messy, some too shallow, some too deep, I hope this article is very friendly to beginners .

In fact, KMP algorithm has some improved version, these are in understanding KMP core thought after optimization.

So the focus of this paper is to explain the core of the KMP algorithm, the article will involve some improvement process at the end.

Ii. Introduction to KMP algorithm

KMP algorithm is a kind of string matching algorithm. It is named after three inventors, Knuth-morris-pratt, the first k is the famous scientist Donald Knuth.

Three, KMP algorithm walking process

First we define two strings as examples, matched strings S = "Abcabdabcabeab", matching string T = "Abcabeab".

Our goal is to determine if s contains T.

the core of the KMP algorithm is to analyze the characteristics of the matched string T to see what information the matching string t can tell us. I put the T coloring, as follows

T = "abcab eab"

Now it seems obvious that three coloring points are duplicates of "AB" and it seems that this T can tell us that it has a duplicate substring "AB" that can be exploited. So how do they work? Do not say how to use, first go through KMP this process, but we need to pay attention to this "AB."

1.  

We found that the "D" match between T and S substring was unsuccessful after matching success "Abcab" to "E".

This time we can get a priori is the current T-matched s substring, is "abcab". and T itself is "AB" repeated. So "Abcab" can jump directly to the second repeating "ab" position, because "Abcab" in the beginning of the other strings will not produce and t corresponding matching, this is very intuitive.

So t should move directly back three positions, and compare with the third digit "C" and the "D" of the s that just did not match.

It can be found that thes-matching process is not fallback. So the matching process is an S-scan from start to finish (the middle may be successfully exited because of a match), so this lookup process is O (N) complexity .

2.

The matching string at this point is "AB", which tells us that the current match is "AB". This is not like "abcab" There has been repeated "AB", so we know that "D" including s[5] before the substring is a garbage string. So skip s[5] and start matching t again.

Finally, the match is completed, so the scan of S is done to find the T.

So how is the machine implemented? Section fourth will be analyzed.

Four, KMP algorithm core analysis

For the above process, we pull out, the problem is the root of the T-string repetition of a decision . No matter what string s is, as long as the structural model of T analysis clearly can complete the above jump process.

So an array is required to record this T's modal function .

Here we give the modal function of this t first.

1. The value of each letter corresponding to the mode function is the subscript that is matched to the current position I, and the next T begins the comparison.

2. The moving length of S is i-f (i-1).

Corresponds to the above two problems

1. For example, the above match to the "abcab" , is T match to t[5] that is "E" when the error. Then we need to look at the value of the pattern function for the previous character, because the previous function value represents the string that has already been matched.

Found F (4) = 2, indicating that the next comparison starts with t[2] or "C". Because "abcab" has duplicate "AB", the first "AB" does not need to be compared.

2. How much does the s subscript move? The movement of the s subscript will find the subscript of S for the initial position of T. Corresponding to the second picture above, S[3] and T's movement. The 3 gain is obtained through the above formula 5-f (4). In fact, the result is the same as 1 to get a matching position, but move forward to align the beginning.

Why is this pattern function so structured?

This is because the F value indicates that there is no repetition for position I, t[i], and where the subscript is repeated (f[i]). Now that we have the position to repeat the subscript, the other related values can be pushed out.

Based on this idea, another two T-strings of the mode function values, to help you think.

1.

2.

How do I quickly construct this pattern function?

This left everyone to think about, should also be more direct, attention is to find duplicate location. If you do not understand, you can refer to the following code.

Five, KMP algorithm implementation

1 /*2 return Val means the begin POS of Haystack3 -1 means no matching substring4 */5 intKMP (Char*haystack,Char*needle) {6     //pre-process7     if(haystack[0] ==0&& needle[0] ==0)8         return 0;9 Ten     intI, J, K, Min, cur; One  A     //construct F (t) in Vector len -vector<int>Len; -Len.push_back (0); the      for(i=1; Needle[i]! =0; i++){ -         if(len[i-1] ==0){ -             if(Needle[i] = = needle[0]) -Len.push_back (1); +             Else -Len.push_back (0); +}Else{ A             if(Needle[i] = = needle[len[i-1]]) atLen.push_back (len[i-1]+1); -             Else -Len.push_back (0); -         } -     } -     //KMP Finder inj =0; -      for(i=0; Haystack[i]! =0; ) { to         //Matching +          for(; Needle[j]! =0; J + +) { -             if(Haystack[i+j]! =Needle[j]) the                  Break; *         } $         //findedPanax Notoginseng         if(Needle[j] = =0) -             returni; the         Else{// Jump +             if(j) { ACur = j-len[j-1]; thei + =cur; +j = len[j-1]; -}Else{ $j =0; $i++; -             } -         } the     } -     //Match failedWuyi     return-1; the}

Vi. Improvement of KMP algorithm

The KMP algorithm has some improved versions to speed up lookups, and can generally speed up the matching process with some information in the S string.

For example, if S = "Aaaaaaaafaaaaaaaaaaaab", T = "Aaaaaaab".

During the lookup process, the "F" in the middle of s plays a blocking role. But since we are only considering the priori information of T, encountering an "F" mismatch will cause T to move back one step at a time to make a new match, knowing that the beginning of T touches "F".

But if we add "F" to this original string s information, because "f"! = "B" && "f"! = "a" && i-1 = f (i-1) , so a new match will be found faster after jumping directly to "f".

However, these improvements are based on the KMP basic algorithm, so grasping the core points not only saves time and effort, but also expands effectively.

Vii. Reference

[1] the KMP algorithm of string matching http://www.ruanyifeng.com/blog/2013/05/Knuth%E2%80%93Morris%E2%80%93Pratt_algorithm.html

[2] Introduction to Algorithms

Reprint please specify the source ~ http://www.cnblogs.com/xiaoboCSer/p/4236668.html

"Original" easy-to-understand explanation KMP algorithm and code implementation

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.