KMP algorithm learning)

Last Update:2018-12-03 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Learning KMP Algorithms

Today, I encountered a problem about string matching. It seems that it is more efficient to use the legendary KMP algorithm, because the test data is estimated to be BT, and the general string matching algorithm requires constant pointer rollback, low efficiency.

The problem to be solved by the KMP algorithm is described as follows:
There is a text string s, such as acabaabaabcacaabc
Search string P, such as abaabcac
Find P from S.

The general solution is as follows:

i=0;
j=0;
len_s = strlen(S);
len_p = strlen(P);
while (i <= len_s && j <= len_p) {
    if ( S[i] == S[j] ) {
        ++i;
        ++j;
    } else {
        i = i-j+2;
        j = 0;
    }    
}

This algorithm is easy to understand. S (I-j + 1) S (I-j + 2 )... S (I) = p0p1p2 .. PJ, when S (I + 1 )! = P (J + 1) indicates that the matching fails. Then, J is rolled back to 0, and I is rolled back to I-j + 2 (note that it is not I-j + 1, in that case, it would be an endless loop), and then continue the comparison between Si and PJ.

The reason for the low efficiency is that J's rollback margin is too large and suddenly becomes 0. Of course, under normal circumstances, P is relatively short, so the rollback to 0 does not have a very large impact, however, the I rollback range is also very large, and the comparison result in the middle of the round is changed to 0,
The KMP algorithm is an algorithm that enables I not to roll back, and J to roll back a part (in the worst case, all.

How can we achieve this? For example:
S = gggabcabc
P = abcac
S3s4s5s6 = p1p2p3p4 = ABCA, but S7! = P5: for general algorithms, the pointer I of S is rolled back to 4, and the pointer J of P is rolled back to 0. The KMP algorithm will find that ABCA = a bc a is the same as the substring at the beginning and end, all of which are "a". That is to say, if I does not roll back, let J roll back to 1, next re-compare from S7 <=> P1, because the match just failed, S7 before the character is S (I-1) = A, and P (J-1) certainly = S (I-1), then P (0) = P (3) = P (J-1) = S (I-1) = S (6), so that the one-step comparison is omitted, I does not need to be rolled back.

A simple description of the KMP algorithm is: when a match fails, I does not roll back. J is rolled back to a certain position. Assume that the position is next (j) = next [J].
It is very meaningful to construct such a next array. After the next array is constructed, our algorithm can be written as follows:

 i = 0;
    j = 0;
    len = strlen(P);
    len_content = strlen(S);
    find_pos = 0;
    loop = 0;
    while ( i < len_content && j < len ) {
        loop++;
        if ( S[i] == P[j] ){
            if (j == 0) {
                find_pos = i;
            }
            i++;
            j++;
        } else {
            if ( next[j] == -1 ) {
                i++;
                j++;
            } else {
                j = next[j];
            }
        }
    }

Here is a next [J] =-1, which means that if next [J] =-1, it indicates that s starts from the position of I, P cannot be matched at all, so s needs to go forward to one place, and P goes back to the first place.

The above is the KMP algorithm. We can see that it is based on the next array. The focus is to construct the next array of P in advance.

Assume that p = abaabcac
We can know that next (0) =-1 means that when the first character cannot match, let I move forward to 1 and let J return to 0 again.
The meaning of next (I) is actually used in the string p0p1 .. find a maximum K in P (I-1), let p0p1 .. P (k-1) = P (i-k-1 )... P (I-1), refers to in p0p1 .. next (I) = K if the same longest string is found at the head and tail of P (I-1.
When mathematical induction is used, assume that next (I) = K, next (0) =-1, then
Next (I + 1) =?
Next (I) = K releases p0p1 .. the header and tail of P (I-1) have the same string, p0p1 .. P (k-1) = P (i-k-1 )... P (I-1), then if P (K) = P (I), p0p1 .. P (k-1) P (K) = P (i-k-1 )... P (I-1) P (I), then
Next (I + 1) = k + 1 = next (I) + 1,
But if unfortunately, P (k )! = P (I), then the problem becomes similar to the first problem. Search t from string s:
T = p0p1. P (k-1) P (K ),
S = p0p1. P (I-1) P (I ),
When T is searched in S, p0p1 is found at the end of S .. P (k-1) = P (i-k-1 )... P (I-1), but last step, P (k )! = P (I), the match fails, so the KMP algorithm is used (here it is wonderful that the KMP algorithm is key to the next array, while the next array is in the process, we continue to use the KMP algorithm and recursive thinking ),
When the matching fails, Let k = next (K), continue to compare P (K) and P (I). If the difference still persists, continue to make K = next (k)
P (K) = P (I) or P (K) =-1.

The entire KMP algorithm is as follows:

#include "stdio.h"
#include "string.h"


int get_next(char* str_search, char* next) {
    int len;
    int i;
    int k;

    next[0] = -1;
    i = 0;
    len = strlen(str_search);
    k = -1;
    
    while ( i < len ) {
        if (str_search[i] == str_search[k] || k == -1) {
            i++;
            k++;
            next[i] = k;
        } else {
            k = next[k];
        }
    }
    return len;
}

void main() {
    char str_content[1024];
    char str_search[255];
    char next[255];
    int i;
    int j;
    int len;
    int len_content;
    int find_pos;
    int loop;
    scanf("%s", str_content);
    scanf("%s", str_search);

    //get the next 
    len = get_next(str_search, next);

    //kmp
    i = 0;
    j = 0;
    len_content = strlen(str_content);
    find_pos = 0;
    loop = 0;
    while ( i < len_content && j < len ) {
        loop++;
        if ( str_content[i] == str_search[j] ){
            if (j == 0) {
                find_pos = i;
            }
            i++;
            j++;
        } else {
            if ( next[j] == -1 ) {
                i++;
                j = 0;
            } else {
                j = next[j];
            }
        }
    }
    
    if (j == len) {
        printf("found at %d, cost: %d loop/n", find_pos, loop);
    }    

}

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

KMP algorithm learning)

Contact Us

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support