Detailed description of the KMP algorithm for Pattern Matching

Last Update:2018-12-04 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

This improved pattern matching algorithm, which is also found by D. E. knuth, J. H. Morris and V. R. Pratt, is short for KMP. Anyone who has probably learned informatics knows that it is a hard-to-understand algorithm. Today, we have made it completely clear.

Note that this is an improved algorithm, so it is necessary to take out the original pattern matching algorithm. In fact, the key to understanding is here. General matching algorithms:

Int index (string S, string T, int POS) // refer to the program in Data Structure
{
I = Pos; j = 1; // The subscript of the first element in the string is 1
While (I <= S. Length & J <= T. length)
{
If (s [I] = T [J]) {++ I; ++ J ;}
Else {I = I-j + 2; j = 1;} // *************** (1)
}
If (j> T. Length) return i-T.Length; // match successful
Else return 0;
}

The matching process is very clear. The key is how the program handles the 'mismatched? Trace Back: That's right. I noticed (1). Why do I need to trace back? Let's look at the example below:

S: aaaaabababcaaa T: ababc

Aaaaabababcaaa
Ababc. (. indicates that the previous one has not been matched)
The result of backtracking is
Aaaaabababcaaa
A. (babc)
If no backtracking is performed
Aaaaabababcaaa
ABA. BC
In this case, a possible match is missing.
Aaaaabababcaaa
Ababc

Why is this happening? This is determined by the nature of the T string, because the T string itself has the nature of the 'partially Match' before and after. If t is abcdef, there is no need for backtracking.

This is where the improvements are made. Starting from the T string itself, we first find the matching position of the front and back parts of T itself, so we can improve the algorithm.

If Backtracking is not required, where does the next position of the T string start?

In the above example, t is ababc. If C is not matched, you can move forward to the last a position of ABA, as shown in the following figure:
... Abababd...
Ababc
-> Ababc

In this way, I does not need to trace back. j jumps to the first two positions and continues the matching process. This is where the KMP algorithm is located. After T [J] is mismatched, the value of J should jump forward is the next value of J, which is inherent in the T string and has nothing to do with the S string.

The data structure defines the next value:
0 if j = 1
Next [J] = {max {k | 1 <k <j and 'p1... pk-1 '= 'pj-k + 1... PJ-1'
1. Other cases

I was dizzy when I first saw this header. In fact, it is the situation described above. Next [1] = 0 is a rule, which can simplify the program, if it is set to another value, it is okay as long as it does not conflict with the subsequent value. What does Max mean? For example:

T: aaab

... Aaaab...
Aaab
-> Aaab
-> Aaab
-> Aaab

For a t like this, there are more than two matching parts in the previous part. What should I jump forward? The nearest one, that is, the shortest length of right slide.

OK. Now, we can see most of the content of KMP. Then the key question is how to calculate the next value? Regardless of it, Let's first look at how to use it to perform the matching operation, that is, first assume that the next value already exists.

Rewrite the previous program:

Int index_kmp (string S, string T, int POS)
{
I = Pos; j = 1; // The subscript of the first element in the string is 1
While (I <= S. Length & J <= T. length)
{
If (j = 0 | s [I] = T [J]) {++ I; ++ J;} // note that J = 0, the role of ++ J knows why the benefit of setting next [1] = 0.
Else J = next [J]; // I unchanged (not backtracking), J beat
}
If (j> T. Length) return i-T.Length; // match successful
Else return 0;
}

OK, isn't it very simple? It is simpler to evaluate the next value, which is also the key to the success of the entire algorithm. It is too scary to evaluate the value defined by the next value. How can this problem be solved? As mentioned above, the next value represents the matching nature of the T string's own part. Then, I can find it by matching the T string and T string, the matching process here is not a match from the beginning, but starts matching T [1] and T [2]. The algorithm is given as follows:

Void get_next (string T, Int & next [])
{
I = 1; j = 0; next [1] = 0;
While (I <= T. length)
{
If (j = 0 | T [I] = T [J]) {++ I; ++ J; next [I] = J; /********** (2 )*/}
Else J = next [J];
}
}

Check whether this function is very similar to a KMP-matched function. Yes, it does! Note that (2) when the logic overwrite of a statement is t [I] = T [J], and all the matches in front of I and J, auto-increment is performed first, then write down next [I] = J, so that a next [I] will be obtained whenever I has an auto increment, and J will definitely be smaller than or equal to I, so for the next, you can continue to find the next, and next [1] = 0 is known, so the whole is obtained by recursion, and the method is very clever.

This improvement is good, but the algorithm can be improved. Note the following matching situation:

... Aaac...
Aaaa.
The 'A' in string t is not matched with the 'C' in string s, while the next value of 'A' is still 'A'. The same comparison will still be mismatched, this comparison is redundant. If I know in advance that when T [I] = T [J], next [I] is set to next [J], when the value of next is evaluated, the comparison can be removed. Therefore, a slight improvement can be achieved:

Void get_nextval (string T, Int & next [])
{
I = 1; j = 0; next [1] = 0;
While (I <= T. length)
{
If (j = 0 | T [I] = T [J])
{++ I; ++ J;
If (T [I]! = T [J]) next [I] = J;
Else next [I] = next [J]; // removes unnecessary comparisons, and next jumps forward.
}
Else J = next [J];
}
}

The matching algorithm remains unchanged.

Now I have figured it out. I used to think that the KMP algorithm is mysterious. It is really not a human idea. In fact, it is actually not. It is just an improvement to the original algorithm. It can be seen that the basic classic things are still very important. If you have the ability to 'disable' the classic, it will create progress.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Detailed description of the KMP algorithm for Pattern Matching

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Detailed description of the KMP algorithm for Pattern Matching

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support