Classical string matching algorithm (KMP) parsing

Source: Internet
Author: User

I. Restatement of the issue

The existing string S1, S1 the part that matches the string S2 exactly, for example:

S1 = "ABABAABABC"

S2 = "ABABC"

Then the result of the match is 5 (the position of "ababc" in S1), of course, if there is more than one S2 in S1 is OK, can find the first one can find the second one.

-------

The easiest way to think of the method is naturally double-loop bitwise alignment (BF algorithm), but in the worst case the time complexity of the BF algorithm reached M * N, which is unacceptable in practical applications, so some 3 people think out of the KMP algorithm (KMP is the name of three people. )

Two. KMP algorithm specific process

(Do not be anxious to see the official definition of KMP algorithm, it is difficult to understand, so here simply do not define it.) )

Take a serious look at S2 = "Ababc", it is important to start with AB, the middle part of the AB (first remember this detail)

Now assume a moment:

At this point the Abab partial match succeeds, the position pointer of S1 points to a, the thing currently being done is a and C match, the result is a matching failure

    • If it is the BF algorithm, then the next step is to S1 the position of the pointer back to the second position of the B,S2 where the pointer is reset to the beginning of a, then B is aligned with a, and then:
    • If it is the KMP algorithm, then the next step is the S1 position pointer is not changed (point to a), then the Jump table (next table) to get the Jump value 2, and then move the S2 pointer to the second A, next to a and a, then.

KMP steps to not understand it's okay, after all we haven't explained what the Jump table (next table) is, but here we only use the results to focus on:

    • The S1 pointer moves to the left in the BF algorithm (backtracking)
    • The S1 pointer in the KMP algorithm does not go backwards (no backtracking)

Since the KMP algorithm does not have a backtracking process, it saves a lot of time (S1 's pointer only needs to go from head to tail)

-------

And look at the core of the KMP algorithm--Jump

Why can I jump? Pay attention to observe S2 = "Ababc", this string is characterized by: AB at the beginning, the middle part of the AB (still remember this detail), explained in detail:

If the C at the end of S2 fails with the K-bit match of S1, we can infer two messages:

    1. The match failed (c match failure means that a part of S1 does not match S2)
    2. The 4 bits before the K-bit in the S1 must be abab (only if the abab in S2 and a portion of the S1 match successfully, then the S2 C-to-S1-K-bit alignment may occur)

If we ignore the 2nd, then the next step is S1 pointer backtracking, which is what the BF algorithm will do, and if we catch the 2nd, plus the S2 string features:

    • AB appears at the beginning, and AB (again repeated) appears in the middle section.

Can get the KMP algorithm (equivalent to the K-bit in the S1 before the AB has been and S2 beginning of AB match, so you can jump directly to the S1 of the K-bit and the S2 of the 3rd bit a alignment)

It seems to be a bit clear, then this jump value 2 how to get it? Give some examples:

    • S2 = "ABA" The last Jump value is 0
    • S2 = "Abaa" The last Jump value is 1
    • S2 = "Abcabc" The last Jump value is 2

Did you find anything? Yes, the process of finding the jump value for the X-bit in S2 (where x is calculated from 0) is this:

    1. If x = 0, then the jump value of s2[x] is-1 (The jump value of the first element is-1)
    2. If x = 1, then s2[x] has a jump value of 0 (the second element has a jump value of 0)
    3. If s2[x-1] = s2[0], then s2[x] has a jump value of 1
    4. If 3rd is not met, the jump value is 0
    5. If s2[x-1] = s2[0] and s2[x-2] = s2[1], then s2[x] has a jump value of 2
    6. If s2[x-1] = s2[0] and s2[x-2] = s2[1] and s2[x-3] = s2[2], then s2[x] has a jump value of 3
    7. 。。。

People with their eyes in accordance with the above method "visual" Jump value is the fastest, but the same process for the computer is not so easy to implement, the computer has a computer like the way, a short blog post explains this way

Simply put--"recursive", that is, by the known first item is-1, the second item is 0, recursion to get all the items behind, the detailed process no longer repeat, the above link blog is very clear written

-------

The following can be drawn to the specific process of the KMP algorithm:

    1. Construct a jump table based on the pattern string S2 (next table)
    2. S1 from the head, check the next table to jump value, S1 pointer to the right to continue the alignment, until the end of S1

White is also in space for time (next table occupies space), of course, in this algorithm, the next table is the length equal to the pattern string S2 length of the linear table, do not need too much space

Three. Implementing the next function

The next function is used to construct the next table (jump table), and how to construct the next table is the key to the KMP algorithm (if you don't pay attention to the KMP proof process). )

We can implement the next function by the way we link the blog post:

Reference example: Http://blog.sina.com.cn/s/blog_96ea9c6f01016l6r.html#include<stdio.h>void getNext (char a[], int n, int  Next[]) {int I, j;next[0] = -1;//The first jump value is -1next[1] = 0;//The second element jumps to 0for (i = 2; i < n; i++) {j = i-1;//recursion Gets the remaining value in the next table while (j ! =-1) {if (a[i-1] = = A[next[j]]) {next[i] = Next[j] + 1;break;} Else{j = Next[j];}}}  Main () {char a[] = "ABAABCAC";//pattern string S2int next[8] = {0};//Jump table, initialized to full 0int i;//construct next table GetNext (A, 8, next);//output next table for (i = 0; i < 8; i++) {printf ("%d", Next[i]);} printf ("\ n");}

Of course, the GetNext function above still seems to be not efficient enough (double loop). ), but the advantages are easy to understand. Here's a look at the next function given in the book:

#include <stdio.h>void getNext (char a[], int n, int next[]) {int I, j;i = 0;next[0] = -1;//initial Jump value = -1j = -1;//recursion to get next table Remaining value while (I < n) {if (j = =-1 | | a[i] = = A[j]) {++i;++j;next[i] = j;} Else{j = Next[j];}}  Main () {char a[] = "ABAABCAC";//pattern string S2int next[8] = {0};//Jump table, initialized to full 0int i;//construct next table GetNext (A, 8, next);//output next table for (i = 0; i < 8; i++) {printf ("%d", Next[i]);} printf ("\ n");}

The two calculations are the same, but the algorithm in the book eliminates the double loop, but the result is that the code becomes more complex, without any substantial optimization

What the? Eliminating the double cycle and not improving efficiency? Isn't it possible? We might as well use the counter to verify:

void GetNext (char a[], int n, int next[]) {int I, j;int counter = 0;///next[0] = -1;//Initial jump value is -1next[1] = 0;//The second element jumps to a value of 0for (i = 2; I < n; i++) {j = i-1;//recursion to get the remaining value in the next table while (j! =-1) {if (a[i-1] = = A[next[j]) {next[i] = Next[j] + 1;counter++;///break;} Else{j = next[j];counter++;///}}}printf ("\n%d\n", counter),///}void GetNext (char a[], int n, int next[]) {int I, J;int coun ter = 0;///i = 0;next[0] = -1;//The first jump value is -1j = -1;//recursion Gets the remaining value in the next table while (I < n) {if (j = =-1 | | a[i] = = A[j]) {++i;++j;next[i] = J;counter++;///}else{j = next[j];counter++;///}}printf ("\n%d\n", counter);}

P.S. We insert the counter++ in the specific Operation section (the most internal if block and the else block), so that the result is comparable (how many times the algorithm is done)

The results of the operation are as follows:

The left is our own implementation of the next function count results, less of these 4 steps is the difference in the number of outer loops (the algorithm is n times, our algorithm is n-2), if our algorithm outer loop number is N, also need 14 times specific operation (we just did a simple optimization)

Two next functions are just different form, the internal operation sequence is exactly the same, make clear the next function, KMP algorithm is no difficulty (if do not care about the S1 of the pointer without backtracking reason, really only this one difficult point. )

Four. Proof of correctness of KMP algorithm

(This article does not expand, later understand the words may be here to supplement the missing content, explained in detail why the S1 pointers do not backtrack, why the previous part is not possible to match ...) )

However, simply focusing on "implementation", we simply understand that the next function can easily implement the KMP algorithm (as to why this can be done, how to prove that this is the right thing to do.) This is a mathematician's business)

Five. Summary

The core of the KMP algorithm is the next table's construction process, and the key idea of constructing next table is "recursion", understanding this, can completely divide the minute to write the KMP algorithm.

A little off-topic:

The KMP algorithm also has variants, this article discusses the most primitive KMP algorithm, common variants are:

    • The next table is 0 (not 1), and the result is that each of the elements in our next table is +1, but the convention is different (starting from 0 and beginning with 1), with no substantial difference
    • There are more than 1 in the next table (there is only one-1 in our results, that is, the first element of next table), which is a substantial optimization that can effectively improve efficiency. In fact, in the construction of the next table more than we do a step, the construction process has become a little bit more complex, but the matching algorithm is less than the number of

Classical string matching algorithm (KMP) parsing

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.