Computes the position of the substring in the main string and its optimization (KMP algorithm)

Source: Internet
Author: User

Problem Description: Set a starting position to find the first occurrence of the substring in the main string.

Algorithm implementation:

int index (string str,string substr,int pos) {int i=0,j=0;int slen,sslen;i=pos;slen=str.length (); Sslen=substr.length () while (I+sslen<slen) {while (J<sslen) {if (str[i+j]==substr[j]) j++;elsebreak;} if (J==sslen) return i+1;else{j=0;i++;}} return 0;}
The loop begins, judging whether the current position plus the length of the substring exceeds the length of the main string, if not exceeded, starts comparing the current position character of the main string with the substring character, if the main string is equal to the character of the substring, continues the comparison until the end of the substring, and returns the position of the substring in the main string, Then jump out of the comparison loop, the substring position is zeroed, and the main string position is shifted one after another to continue the comparison.

This method is good to understand, but it produces a lot of unnecessary calculations.

For example, the main string is Abcdefabcdexghij, and the substring is abcdex, let's say we start from the starting position of the main string comparison.

Compared to the sixth character when the difference is found, and then the main string moves backwards one continues to compare with the substring, finds that the first one is different, and then the main string continues to move one bit, so that it is compared to the seventh string. In fact, it is not useful to find the comparison of the second string starting from the first of the main string to the fifth one, which is completely redundant, and the sixth is reserved because there is a difference between the characters that cannot be determined and the starting position.

Can have a kind of algorithm to avoid this situation, there is an algorithm called KMP, may realize avoids this kind of situation.

The first step is to calculate the array of changes to the substring called the next array.

How does this next array count? J is the token bit of the substring.

When J is 1, it is next[j]=0 at the first of the substring;

When the substring moves backward in the J position, the first occurrence of the repeating string appears, there may be more than one duplicate character here, taking the largest one, such as ABCABX, when J=5, the front ABCA, the fourth character repeats with the first character, then next[5]=1+1=2. When J=6, in front of the Abcab string, the first AB and the end ab Repeat, then go to the largest next[6]=2+1=3.

If there is no repetition, the next array is 1.

The above description is expressed in a formula as follows:


The next array for a substring such as "ABCABX" can be represented as follows:

J:1 2 3 4 5 6

S:a b c A B x

next:0 1 1 1 2 3

The source code for the next array is obtained as follows:

void get_next (int next[],string str) {int i=0,j=-1;next[0]=-1;while (str[i]!= ')} {if (J==-1 | | str[i]==str[j]) {++i;++j ; next[i]=j;} ELSEJ=NEXT[J];}}
One thing to note here is that I is the amount that points to the tail of a string, and J is the amount that points to the head of the string.

Let's start by initializing I=0,j=-1,next[0]=-1. Tail in the first, then the head on the other side of the axis;

At the beginning of the J=-1, the character cannot be compared, but satisfies the j==-1 condition, enters the condition body inside, I moves one bit forward, J also moves forward one bit, calculates the next[1]=0. Then next time do not meet the conditions of j==-1, and then see str[1]==str[0] Whether set up, if not, then J Reset to Next[0], meet the conditions j==-1,i,j continue to move forward, at this time i=2,j=0. If set up, then the i,j all move forward one bit, under this condition i=2,j=1. Move in such a way that the next array is finally built.

The KMP algorithm that is implemented with the next array is as follows:

int KMP (string str,string substr,int pos) {int next[256]={0};get_next (NEXT,SUBSTR); int I=pos,j=0;while (i< Str.length () && j<substr.length ()) {if (j==-1| | Str[i]==substr[j]) {++i;++j;} else j=next[j];} if (J==substr.length ()) return (I-J); Elsereturn 0;}
The KMP algorithm above can reduce the complexity of the normal lookup substring from O ((n-m+1) *m) to O (m+n).

But KMP is flawed, and there is an extreme case where the main string is "Aaaabaaaaab". The substring is "Aaaaab".

The next array can be derived from the substring as 0,1,2,3,4,5. When the match, found that the fifth character mismatch, according to the next array, when J back to a bit, and found that the different back to the first of the substring, in fact, can be found before the first of these operations is completely useless. So we find that the next value of the first character to replace the value of the subsequent characters equal to it can greatly improve the efficiency of the algorithm. To improve the Get_next function, you can get the following function:

void get_next (int next_new[],string str) {int i=0,j=-1;next_new[0]=-1;while (str[i]!= ')} {if (J==-1 | | str[i]==str[j]) {++i;++j;if (str[i]!=str[j])     next_new[i]=j;else      next_new[i]=next_new[j];} ELSEJ=NEXT_NEW[J];}}

The next array for a substring such as "ABCABX" can be represented as follows:

J:          1     2     3     4      5     6   S:          a     b     c     a      b     xnext:       0     1     1     1      2     3next_new:   0     1     1     0      1     3

Looking at a substring example for further understanding, we calculate the next array, and then make a small modification to the next array to form the final next array:

J:         1     2     3    4    5     6    7    8    9   S:         a     b     a    b    a     a    a    b    anext:      0     1     1    2    3     4    2    2    3next_new:  0     1     0    1    0     4    2    1    0

The next array that calculates KMP above is my understanding as follows:

First calculate the intermediate process next, and then calculate the next_new according to next.

I take this "Ababaaaba" as an example.

When J=1, next[1]=0;

When j=2, next[2]=1;

When J=3, next[3]=1;

When J=4, the front "ABA" appears repeated, from the beginning to the end of the maximum repeat substring is "a", its position is k-1=1, then next[4]=k=2;

When the j=5, the front continues to appear repeated, "Abab", the maximum repeat substring is "ab", then next[5]=k=2+1=3;

When j=6, the front substring is "Ababa", the maximal repeat substring is ' aba ', then next[6]=k=3+1=4;

When j=7, the front substring is "Ababaa" and the maximum repeat substring is "a", then next[7]=k=1+1=2;

When j=8, the front substring is "ababaaa" and the maximum repeat substring is "a", then next[8]=k=1+1=2;

When j=9, the front substring is "Ababaaab" and the maximum repeat substring is "ab", then next[9]=k=2+1=3.

Through the above, we calculate the value of the intermediate process next, next, with each next marked by the character with the s in the comparison, the next value to take the previous character corresponding to the next value, the difference will remain unchanged, here for example, when J=6, next[6]= 4, 4 The corresponding character is "B", and 6 corresponds to the character "A", the two are different, remain unchanged.

The above is the string lookup matching and some understanding of KMP algorithm, is not very in place, need to strengthen. Here is the code for my entire test:

#include <iostream> #include <vector> #include <string> #include <queue> #include <set># Include<algorithm>using namespace std;void get_next (int next[],string str) {int I=0,j=-1;next[0]=-1;while (str[i ]!= ' + ') {if (J==-1 | | str[i]==str[j]) {++i;++j;if (str[i]!=str[j]) next[i]=j;else next[i]=next[j];} ELSEJ=NEXT[J];}} int KMP (string str,string substr,int pos) {int next[256]={0};get_next (NEXT,SUBSTR); int I=pos,j=0;while (i< Str.length () && j<substr.length ()) {if (j==-1| | Str[i]==substr[j]) {++i;++j;} else j=next[j];} if (J==substr.length ()) return (I-J); Elsereturn 0;} int index (string str,string substr,int pos) {int i=0,j=0;int slen,sslen;i=pos;slen=str.length (); Sslen=substr.length () while (I+sslen<slen) {while (J<sslen) {if (str[i+j]==substr[j]) ++j;elsebreak;} if (J==sslen) return (i+1); else{j=0;++i;}} return 0;} int main () {int position=0,pos=0;string s1,s2;int next[255]={0},i=0;cout<< "postition:" <<endl;cin> >pos;cin>>s1;cin>>s2;cout<< " Next array is: "<<endl;get_next (NEXT,S2), for (I=0;i<s2.length (); i++) cout<<next[i]<<" "; cout <<endl;//position=index (S1,s2,pos);p OSITION=KMP (S1,s2,pos), if (position!=0) cout<< "from" <<pos << "St Character:" <<position<<endl;else cout<< "can not found!" <<endl;}

Copyright NOTICE: This article for Bo Master original article, without Bo Master permission not reproduced.

Computes the position of the substring in the main string and its optimization (KMP algorithm)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.