KMP algorithm learning)

Source: Internet
Author: User
Learning KMP Algorithms

Today, I encountered a problem about string matching. It seems that it is more efficient to use the legendary KMP algorithm, because the test data is estimated to be BT, and the general string matching algorithm requires constant pointer rollback, low efficiency.

The problem to be solved by the KMP algorithm is described as follows:
There is a text string s, such as acabaabaabcacaabc
Search string P, such as abaabcac
Find P from S.

The general solution is as follows:

i=0;
j=0;
len_s = strlen(S);
len_p = strlen(P);
while (i <= len_s && j <= len_p) {
    if ( S[i] == S[j] ) {
        ++i;
        ++j;
    } else {
        i = i-j+2;
        j = 0;
    }   
}

This algorithm is easy to understand. S (I-j + 1) S (I-j + 2 )... S (I) = p0p1p2 .. PJ, when S (I + 1 )! = P (J + 1) indicates that the matching fails. Then, J is rolled back to 0, and I is rolled back to I-j + 2 (note that it is not I-j + 1, in that case, it would be an endless loop), and then continue the comparison between Si and PJ.

 

The reason for the low efficiency is that J's rollback margin is too large and suddenly becomes 0. Of course, under normal circumstances, P is relatively short, so the rollback to 0 does not have a very large impact, however, the I rollback range is also very large, and the comparison result in the middle of the round is changed to 0,
The KMP algorithm is an algorithm that enables I not to roll back, and J to roll back a part (in the worst case, all.

How can we achieve this? For example:
S = gggabcabc
P = abcac
S3s4s5s6 = p1p2p3p4 = ABCA, but S7! = P5: for general algorithms, the pointer I of S is rolled back to 4, and the pointer J of P is rolled back to 0. The KMP algorithm will find that ABCA = a bc a is the same as the substring at the beginning and end, all of which are "a". That is to say, if I does not roll back, let J roll back to 1, next re-compare from S7 <=> P1, because the match just failed, S7 before the character is S (I-1) = A, and P (J-1) certainly = S (I-1), then P (0) = P (3) = P (J-1) = S (I-1) = S (6), so that the one-step comparison is omitted, I does not need to be rolled back.

A simple description of the KMP algorithm is: when a match fails, I does not roll back. J is rolled back to a certain position. Assume that the position is next (j) = next [J].
It is very meaningful to construct such a next array. After the next array is constructed, our algorithm can be written as follows:

 i = 0;
    j = 0;
    len = strlen(P);
    len_content = strlen(S);
    find_pos = 0;
    loop = 0;
    while ( i < len_content && j < len ) {
        loop++;
        if ( S[i] == P[j] ){
            if (j == 0) {
                find_pos = i;
            }
            i++;
            j++;
        } else {
            if ( next[j] == -1 ) {
                i++;
                j++;
            } else {
                j = next[j];
            }
        }
    }

 

Here is a next [J] =-1, which means that if next [J] =-1, it indicates that s starts from the position of I, P cannot be matched at all, so s needs to go forward to one place, and P goes back to the first place.

The above is the KMP algorithm. We can see that it is based on the next array. The focus is to construct the next array of P in advance.

Assume that p = abaabcac
We can know that next (0) =-1 means that when the first character cannot match, let I move forward to 1 and let J return to 0 again.
The meaning of next (I) is actually used in the string p0p1 .. find a maximum K in P (I-1), let p0p1 .. P (k-1) = P (i-k-1 )... P (I-1), refers to in p0p1 .. next (I) = K if the same longest string is found at the head and tail of P (I-1.
When mathematical induction is used, assume that next (I) = K, next (0) =-1, then
Next (I + 1) =?
Next (I) = K releases p0p1 .. the header and tail of P (I-1) have the same string, p0p1 .. P (k-1) = P (i-k-1 )... P (I-1), then if P (K) = P (I), p0p1 .. P (k-1) P (K) = P (i-k-1 )... P (I-1) P (I), then
Next (I + 1) = k + 1 = next (I) + 1,
But if unfortunately, P (k )! = P (I), then the problem becomes similar to the first problem. Search t from string s:
T = p0p1. P (k-1) P (K ),
S = p0p1. P (I-1) P (I ),
When T is searched in S, p0p1 is found at the end of S .. P (k-1) = P (i-k-1 )... P (I-1), but last step, P (k )! = P (I), the match fails, so the KMP algorithm is used (here it is wonderful that the KMP algorithm is key to the next array, while the next array is in the process, we continue to use the KMP algorithm and recursive thinking ),
When the matching fails, Let k = next (K), continue to compare P (K) and P (I). If the difference still persists, continue to make K = next (k)
P (K) = P (I) or P (K) =-1.

The entire KMP algorithm is as follows:

#include "stdio.h"
#include "string.h"


int get_next(char* str_search, char* next) {
    int len;
    int i;
    int k;

    next[0] = -1;
    i = 0;
    len = strlen(str_search);
    k = -1;
   
    while ( i < len ) {
        if (str_search[i] == str_search[k] || k == -1) {
            i++;
            k++;
            next[i] = k;
        } else {
            k = next[k];
        }
    }
    return len;
}

void main() {
    char str_content[1024];
    char str_search[255];
    char next[255];
    int i;
    int j;
    int len;
    int len_content;
    int find_pos;
    int loop;
    scanf("%s", str_content);
    scanf("%s", str_search);

    //get the next
    len = get_next(str_search, next);

    //kmp
    i = 0;
    j = 0;
    len_content = strlen(str_content);
    find_pos = 0;
    loop = 0;
    while ( i < len_content && j < len ) {
        loop++;
        if ( str_content[i] == str_search[j] ){
            if (j == 0) {
                find_pos = i;
            }
            i++;
            j++;
        } else {
            if ( next[j] == -1 ) {
                i++;
                j = 0;
            } else {
                j = next[j];
            }
        }
    }
   
    if (j == len) {
        printf("found at %d, cost: %d loop/n", find_pos, loop);
    }   

}

 

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.