Data structure 20:KMP algorithm (fast pattern matching algorithm) detailed

Source: Internet
Author: User

Through the introduction of the previous section, learned the common pattern matching algorithm, the general idea is: The pattern string from the first character of the main string to match, each match failed, the main string record matching progress of the pointer I have to i-j+1 the fallback operation (this process is called "pointer backtracking"), while the pattern string to move backward one character position. Loop again and again until the match succeeds or the program ends.


The advantage of the "KMP" algorithm compared to the "BF" algorithm is that:
    • To ensure that the pointer I does not backtrack, when the matching fails, let the pattern string to the right to move the maximum distance;
    • and the O(n+m) pattern matching operation of the string can be done at the time of the order of magnitude;

Therefore, the "KMP" algorithm is called "Fast pattern matching algorithm". The calculation of the distance to the right of the pattern string when the pattern string and the main string match, each has a pointer to the current matching character (the main string is the pointer I, the pattern string is the pointer j), in the guarantee I pointer does not backtrack, if you want to implement the function, you can only let the J pointer backtracking.
The distance of the J-pointer backtracking is equivalent to the distance the pattern string moves to the right. The more the J-pointer backtracking, the longer the pattern string moves to the right. The distance that the calculation pattern string moves to the right can be converted to: the position of the J Pointer backtracking when a character match fails.

For a given pattern string, where each character is likely to encounter a match failure, then the corresponding J pointer will need to backtrack, the location of the specific backtracking is actually determined by the pattern string itself, and the main string is not related.

The position of the J pointer backtracking for each character in the pattern string can be obtained by the algorithm, and the resulting result is stored in an array (the default array is named Next).

The calculation method is: For a character in the pattern string, to say, extract the string before it, respectively, from both ends of the string to see the number of consecutive identical strings, on the basis of +1, the result is the corresponding value of the character. The first character of each pattern string corresponds to a value of 0, and the second character corresponds to a value of 1. For example: "Abcabac" next for the pattern string. The first two characters correspond to 0 and 1 are fixed.

For the character ' C ', extract the string "ab", ' A ' and ' B ' are not equal, the number of the same string is 0, 0 + 1 = 1, so ' C ' corresponds to the next value of 1;

The fourth character ' a ', extract "abc", from First ' a ' and ' C ' are not equal, the same number is 0, 0 + 1 = 1, so, ' a ' corresponds to the next value of 1;

The fifth character ' B ', extract "ABCA", the first ' a ' and the last ' a ' are the same, the same number is 1, 1 + 1 = 2, so, ' B ' corresponds to the next value of 2;

The sixth character ' a ', extract "Abcab", the first two characters "AB" and the last two "AB" the same, the same number is 2, 2 + 1 = 3, so, ' a ' corresponding to the next value is 3;

The last character ' C ', extract "Abcaba", the first character ' a ' and the last ' a ' are the same, the same number is 1, 1 + 1 = 2, so ' C ' corresponds to the next value of 2;

Therefore, the string "Abcabac" corresponds to the value in the next array (0,1,1,1,2,3,2).

At the top of the evaluation process, each time you need to determine the string header and tail of the same number of characters, and in the implementation of the algorithm, for a character to say, you can borrow the previous character's judgement, calculate the current character corresponding to the next value.

The specific algorithm is as follows:

Pattern string T is (subscript starting from 1): "Abcabac"
Next Array (subscript starting from 1): 01

The third character ' C ': Because the next value of the previous character ' B ' is 1, take t[1] = ' A ' and ' B ' compared, not equal, continue; because next[1] = 0, end. The ' C ' corresponds to the next value of 1, (as long as the loop to next[1] = 0, the character's next value is 1)

The pattern string T is: "Abcabac"
Next Array (subscript starting from 1): 011

The fourth character ' a ': because the next value of the previous character ' C ' is 1, take t[1] = ' A ' and ' C ' compared, not equal, continue; because next[1] = 0, end. The next value for ' a ' corresponds to 1;

The pattern string T is: "Abcabac"
Next Array (subscript starting from 1): 0111

The fifth character ' B ': Because the next value of the previous character ' A ' is 1, take t[1] = ' a ' and ' a ' compared, equal, end. The ' B ' corresponding next value is: 1 (next value of the previous character ' a ') + 1 = 2;

The pattern string T is: "Abcabac"
Next Array (subscript starting from 1): 01112

The sixth character ' a ': because the next value of the previous character ' B ' is 2, take t[2] = ' B ' and ' B ' compare, equal, so end. The next value for ' a ' corresponds to: 2 (next value of the previous character ' B ') + 1 = 3;

The pattern string T is: "Abcabac"
Next Array (subscript starting from 1): 011123

The seventh character ' C ': Because the next value of the previous character ' A ' is 3, take t[3] = ' C ' and ' a ' compared, not equal, continue; next[3] = 1, so take t[1] = ' a ' and ' a ' compare, equal, end. The next value for ' a ' corresponds to: 1 (next[3] value) + 1 = 2;

The pattern string T is: "Abcabac"
Next Array (subscript starting from 1): 0111232

Algorithm implementation:
#include <stdio.h>#include<string.h>
voidNext (Char*t,int*next)
{inti =1; next[1] =0; intj =0; while(i<strlen (T))
{if(j==0|| t[i-1]==t[j-1])
{i++; J++; Next[i]=J; }
Else
{J=Next[j]; }}}

Note: In this program, the next array uses a subscript initial value of 1, next[0] is not used (or can hold the length of the next array). The storage of the string begins with the subscript 0 of the array, so the program is t[i-1] and t[j-1]. The implementation of the KMP algorithm based on next first look at the KMP algorithm running flow (assuming the main string: Ababcabcacbab, pattern string: ABCAC).

First time match:

The match fails, I pointer does not move, j = 1 (the next value of the character ' C ');

Second match:

Equal, continue, until:

The match fails, I does not move, J = 2 (j points to the next value of the character ' C ');

Third match:

Equal, I and J move back, and the final match succeeds. Using a common algorithm, you need to match 6 times, while using the KMP algorithm, you only match 3 times. Implementation code:
intKMP (Char*s,Char*T)
{intnext[Ten];  Next (T, next); //initializes the next array according to the pattern string T    inti =1; intj =1; while(I<=strlen (S) && j<=strlen (T))
{//j==0: The first character representing the pattern string is not equal to the character pointed to by the pointer I, s[i-1]==t[j-1], if the corresponding position character is equal, in both cases, the two pointers to the current test, the subscript I and J move backwards      if(j==0|| s[i-1]==t[j-1])
{i++; J++; }Else
{J=NEXT[J];//if the two characters of the test are not equal, I do not move, J becomes the next value of the current test string      }}if(J>strlen (T))
{
//if the condition is true, the match succeeds      returnI-(int) strlen (T); }return-1;}

KMP Algorithm complete code
#include <stdio.h>#include<string.h>
voidNext (Char*t,int*next)
{inti =1; next[1] =0; intj =0; while(i<strlen (T))
{if(j==0|| t[i-1]==t[j-1])
{i++; J++; Next[i]=J; }
    Else
    {J=NEXT[J]; }  }}
intKMP (Char*s,Char*T)
{intnext[Ten];  Next (T, next); //initializes the next array according to the pattern string T    inti =1; intj =1; while(I<=strlen (S) &&j<=strlen (T))
{//j==0: The first character representing the pattern string is not equal to the character currently being tested, s[i-1]==t[j-1], if the corresponding position character is equal, in both cases, the two pointers to the current test, the subscript I and J move backwards      if(j==0|| s[i-1]==t[j-1])
{i++; J++; }Else
    {J= Next[j];//if the two characters of the test are not equal, I do not move, J becomes the next value of the current test string      }}if(J>strlen (T))
{
     //if the condition is true, the match succeeds      returnI-(int) strlen (T); }
  return-1;}
intMain ()
{inti = KMP ("Ababcabcacbab","ABCAC"); printf ("%d", i);
  return 0;}

Operation Result:6

Upgrade Next note: The key to the KMP algorithm is to determine the next array, in fact, the upper KMP algorithm in the next array, not the most streamlined, but also can be simplified.

Example: pattern string t:a b c a C
next:0 1 1 1 2 in the pattern string "ABCAC", there are two characters ' a ', we assume the first is A1, the second is A2. In the program matching process, if the J pointer points to A2 when the match fails, then at this point, the I pointer in the main string does not move, the J pointer points to A1, it is obvious, because of A1==A2, and a2! =s[i], so A1 is certainly not equal to s[i].

To avoid unnecessary judgment, the next array needs to be streamlined, for the "ABCAC" pattern string, because t[4] = = T[next[4]], so you can change the next array to:
Pattern string T:a b c a C
Next:0 1 1 0 2 This simplification, if the matching process due to A2 matching failure, then also no longer to determine whether the A1 match, because it is certainly not possible, so directly bypass A1, proceed to the next step.

Implementation code:
voidNext (Char*t,int*next)
{inti =1; next[1] =0; intj =0; while(i<strlen (T))
{if(j==0|| t[i-1]==t[j-1])
{i++; J++; if(t[i-1]! = t[j-1])
{Next[i]=J; }Else
{Next[i]=Next[j]; }    }
Else
{J=Next[j]; }}}

Using the streamlined next array will reduce the number of unnecessary judgments and improve the efficiency of the KMP algorithm in solving problems such as pattern string "Aaaaaaab".

For example: Next1 before refinement, next2: pattern string: A A A A a a a a B
Next1:0 1 2 3 4 5 6 7
next2:0 0 0 0 0 0 0 7 Summing up the KMP algorithm, the reason for the faster than the BF algorithm is that: the KMP algorithm is actually the same as the BF algorithm, all from the beginning of the main string matching, but in the matching process, the KMP algorithm recorded some necessary information. Based on this information, some meaningless matching processes are skipped during the subsequent matching process.

Data structure 20:KMP algorithm (fast pattern matching algorithm) detailed

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.