KMP Algorithm Learning & Summary

Source: Internet
Author: User

0. Origin

The KMP algorithm of the YM legend has been able to solve the string match with the worst linear time complexity,

Start to see only know KMP in the K incredibly is donald.e. KNuth, author of the Art of computer programming.

OK, continue the ym ...

1. Traditional string matching algorithm

  /* * Starting from sindex position of s in the match p * If the match succeeds, returns the start index of the mode string p in s if the match fails, returns 1 */int index (const std::string &s, const std::string & amp;p, const int sindex = 0) {    int i = sindex, j = 0;    if (S.length () < 1 | | p.length () < 1 | | Sindex < 0)    {        return-1;    }    while (i! = S.length () && J! = P.length ())    {        if (s[i] = = P[j])        {            ++i;            ++j;        }        else        {            i = i-j + 1;            j = 0;        }    }    return J = = P.length ()? I-j:-1;}

2. Performance problems of traditional string matching algorithm

Using the pattern string p to match the string s, a mismatch occurs at i=6,j=4:

I=6

S:a b a b c a D c a c b a B

P: a b c a c

L=r

At this point, according to the traditional algorithm, the 1th character a(j=0) of p should be slid to align with the 4th character b(i=3) in S and then match:

I=3

S:a b a b c A A D a C b a B

P: a b c a C

J=0

In this process, the access to the string s has been "back"(moved back from i=6 to i=3).

We do not want this to happen, but instead try to align the characters in P to i=5 in S by the "swipe right" mode string p, and then try to match the i=6 character in s with the character of index j+1 in P.

In this test case, we directly move P 3 characters to the right, so that the i=5 character in S is aligned with the j=0 character in P, and then matches the character I=6 in S and J=1 in P.

I=6

S:a b A B c a D c a c b a B

P: a B c a C

J=0

3. General discussion of KMP algorithm

The following is a discussion of how to implement a string match on the premise of " not returning" access S and relying only on the "sliding"p, that is, the "KMP algorithm"under general circumstances.

I=6

S:a b A B c a D c a c b a B

P: a B c a C

K=1

I=6

S:a b a b c a D c a c b a B

P: a b c a c

L=r

For arbitrary S and P, when the character of index I in S and the character mismatch of index J in P, we assume that the character with index K should be slid to "align" with the character of index I in S and continue the comparison.

So, how much is this k?

We know that the so-called alignment is to let S and p meet the following conditions (the blue characters in):

...... (1)

On the other hand, we already have some partial match results (green characters in) at mismatch time:

...... (2)

From (1), (2) can be obtained:

...... (3)

As shown in the effect:

The definition next[j]=k,k indicates that when the character of index J in the pattern string P is mismatch with the character of index I in the main string s, the character of index k in P shall continue to be compared with the character of index I in the main string s.

...... (4)

Give an example of the next array as defined above:

J 0 1 2 3 4 5 6 7

P a B A a B c a C

NEXT[J]-1 0 0 1 1 2 0 1

In the case of the next array, the following steps are given for string matching:

I and J indicate that the initial value of the index,i of the character that is currently being compared in the main string s and the pattern string P is sindex,j, which is 0.

During each cycle of the matching process, if, I and J are increased by 1 respectively,

Else,j back to Next[j], at which time the next loop is compared with the.

4, the implementation of KMP algorithm

Under the premise of the known next function, according to the above steps, the implementation of the KMP algorithm is as follows:

int KMP (const std::string& S, const std::string& p, const int sindex = 0) {    std::vector<int>next (p.size ( ));    GetNext (P, next);//Get Next array, save to vector    int i = sindex, j = 0;    while (i! = S.length () && J! = P.length ())    {        if (j = =-1 | | s[i] = = P[j])        {            ++i;            ++j;        }        else        {            j = next[j];        }    }    return J = = P.length ()? I-j:-1;}

OK, the following question is how to find the next array of mode string P.

The initial conditions of the next array are next[0] =-1, set next[j] = k, then:

So, next[j+1] has two things:

①, there are:

At this time next[j+1] = Next[j] + 1 = k + 1

Ii:

At this point, you need to swipe p to the right and continue to compare the characters of index J in P with index NEXT[K]:

It is worth noting that the "swipe right" above itself is a KMP in the case of mismatch in the sliding process, the process to see the self-matching of P, there are:

If, then next[j+1] = next[k] + 1;

Otherwise, continue to slide P to the right until the match succeeds, or there is no such match, at which point next[j+1] = 0.

The GetNext function is implemented as follows:

void GetNext (const std::string &p, std::vector<int> &next) {    next.resize (p.size ());    Next[0] =-1;    int i = 0, j =-1;        while (i! = P.size ()-1)    {        //note here that I==0 actually asks for the value of next[1], and so on        if (j = =-1 | | p[i] = = P[j])        {            ++I;
   ++j;            Next[i] = j;        }        else        {            j = next[j];        }    }}

Thus, a complete KMP has been implemented.

5. Further optimization of GetNext function

Note that the above GetNext function also has a place to optimize, such as:

I=3

S: A A a b a A A a B

P: a a A a B

J=3

At this time, i=3, j=3 when the mismatch, next[3]=2, at this time also need to make 3 comparisons:

I=3, j=2;

I=3, J=1;

I=3, j=0.

In fact, because I=3, j=3 already know a!=b, and then three times is still compared with a and B, so these three comparisons are superfluous.

At this point, you should swipe the P right 4 characters, i=4, j=0 comparison.

In general, in the GetNext function, Next[i]=j, that is, when P[i] fails to match a character in S, use p[j] to continue comparing it with the character in S.

If P[I]==P[J], then this comparison is superfluous (as in the above example), this time should be directly next[i]=next[j].

The complete implementation code is as follows:

void getnextupdate (const std::string& p, std::vector<int>& next) {    next.resize (p.size ());    Next[0] =-1;    int i = 0, j =-1;    while (i! = P.size ()-1)    {        //note here that I==0 actually asks for the value of nextvector[1], and so on        if (j = =-1 | | p[i] = = P[j])        {            + + i;            ++j;            Update            //next[i] = j;            Note that this is ++i and ++j after P[i], P[j]            next[i] = p[i]! = P[j]? J:next[j];        }        else        {            j = next[j];        }    }}

Corresponding, only in the KMP algorithm will be GetNext (P, next); Replace with Getnextupdate (P, next); Can.

6. Analysis of time complexity

The time complexity of the KMP algorithm is analyzed with the GetNext function as an example.

1 void getNext (const std::string& p, std::vector<int>& next) 2 {3     next.resize (P.size ()); 4     next[0 ] =-1; 5  6     int i = 0, j =-1; 7  8 while     (i! = P.size () 1) 9     {Ten         if (j = = 1 | | p[i] = = P[j])         {12< c9/>++i;13             ++j;14             next[i] = j;15         }16         else17         {             j = next[j];19}20     }21}

Assuming that p.size () is M, the puzzle of analyzing its time complexity is that the ++i operation is not performed every time in the while, so the execution of the entire while is not necessarily m.

At a different angle, notice that in each loop, the number of times that J is modified in the entire while is the time complexity of the getnext function, regardless of if or else it modifies the value of J and only one modification of J per loop.

Each successful match, ++i; ++j; , because ++i executes m-1 times most, so ++j also executes m-1 times, that is, j up to increase m-1 times;

Correspondence, only in j=next[j]; The value of J is bound to become smaller, because J increases m-1 times, so J can reduce m-1 times.

In summary, the time complexity of the GetNext function is O (m), and if the length of the matched string s is N, then the KMP function takes the time responsibility O (m+n).

7, the application of KMP advantages

① Fast, O (M+n) The linearity of the worst time complexity;

② does not need to revisit the string s to be matched, so it works well for processing large files that are entered from the peripheral, and can be read into the edge match.

KMP Algorithm Learning & Summary

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.