A deep understanding of fast pattern matching algorithm (KMP) _c language

Source: Internet
Author: User
I'm afraid the people who have used the computer now must know that most of the software with text editing function has a shortcut key ctrl+f (such as word). This function is mainly to complete the "find", "replace" and "Replace all" function, in fact, this is a typical pattern matching application, that is, in a text file to find a string.

1. Pattern matching
The pattern-matching model is probably like this: given two string variables s and p, where s becomes the target string, which contains n characters, p is called a pattern string, and contains m characters, which m<=n. The search mode p starts at the given position of S, usually the first position of S. If found, returns the position of the mode p in the target string (that is, the subscript of the first character of P in s). Returns 1 if the pattern string p is not found in the target string s. This is the definition of pattern matching, so let's look at how to implement the pattern matching algorithm.

2. Simple pattern Matching
Naïve pattern matching algorithm is very simple, easy to understand, presumably the idea is this: starting from the first character of S, S0 the characters in P and S, if S0=p0 && ... && Sm-1 = Pm-1, the match is proved to be successful. The rest of the match does not have to be done, return subscript 0. If in a certain step Si!= Pi p in the remaining characters are not compared, it is impossible to match the success, and then from the second character in S with the first character in P to compare, the same, also know SM = Pm-1 or find some I make Si!= S-1 so far. By analogy, if we know that the n-m start character in S, there is no matching success, it is proved that there is no mode p in S. (Think of why the emphasis here is N-M) This code implementation should be very simple, starting with the reference to the internal implementation of the STRSTR function. Can look at Baidu Encyclopedia, to a link http://baike.baidu.com/view/745156.htm, here does not write out, still have to hurry into the topic KMP it.

3. Fast pattern matching algorithm (KMP)
The simple pattern matching efficiency is mainly due to the repeated character comparisons. There is no connection between the next comparison and the last comparison, which is the disadvantage of the simple pattern matching, in fact, the comparison result of the last comparison is available, which produces quick pattern matching. In the simple pattern matching, the target string S's subscript movement is step by step, this actually is not good, the movement step number does not need to be 1.
Now let's assume that the current match is like this: S0 ... St st+1 ... St+j and P0 P1 ... Pj, now trying to match the characters is st+j+1 and pj+1, and st+j+1!= pj+1, the implication is that St st+1 ... St+j and P0 P1 ... PJ is exactly the match. So this time, what should s next match start position?? According to the simple pattern match, the next comparison should start from st+1, and make st+1 and P0 compare, but in quick pattern matching is not so, fast pattern matching choose st+j+1 and pk+1 compare, K is what? K is such a value that makes P0 P1 ... Pk and Pj-k pj-k+1 ... PJ Perfect Match, may wish to set k=next[j], therefore P0 P1 ... PK and St+j-k st+j-k+1 ... St+j exact match. Then the next two characters to be matched should be st+j+1 and pk+1. S and P are not traced back to subscript 0 for comparison, which is why KMP is fast.
Now the key question is, how can this K be obtained? If the value of this k is high, then this idea is not good, in fact this k, only with the pattern string P has a relationship, and requires M k,k = Next[j], so as long as it is stored in the next array can be, and time complexity and M have relationship (linear relationship). Look at exactly how to find the value of the next array, that is, K.
Use inductive method to seek next[]: Set Next (0) =-1, if you know Next (j) = k, want to get next[j+1].
(1) If pk+1 = Pj+1, obviously next[j+1] = k+1. If pk+1!=, pj+1] < next[j+1], then look for H < K to make Next[j P0 ... Ph = Pj-h pj-h+1 ... Pj = Pk-h pk-h+1 ... Pk. That means H = next (k); see, it's an iterative process. (that is, previous results are useful for future values)
(2) If do not deposit such H, explain P0 P1 ... There are no pj+1 substrings in the next[j+1] =-1.
(3) If there is such an H, continue to test whether the ph and PJ are equal. The process of finding the equivalent in this case, or determining the 1 next[j+1] is over.
Look at the implementation code:
Copy Code code as follows:

View Code
int next[20] ={0};
Note The return result is an array of Next, save M K worth the place, where next[j]=k
Then str[0]str[1]...str[k] = Str[j-k]str[j-k+1]...str[j]
So when the des[t+j+1 and pat[j+1] match fails, the next match is des[t+j+1] and next[j]+1
void Next (char str[],int len)
{
Next[0] =-1;
for (int j = 1; j < Len; j + +)
{
int i = next[j-1];
while (Str[j]!= str[i+1] && i >= 0)//iterative process
{
i = next[i];
}
if (str[j] = = str[i+1])
{
NEXT[J] = i+1;
}
Else
{
NEXT[J] =-1;
}
}
}

Now that we have the K value saved by next array, we can implement the KMP algorithm:
Copy Code code as follows:

View Code
Des is the target string, Pat is the pattern string, len1 and Len2 are the lengths of the strings.
int KMP (char des[],int len1,char pat[],int len2)
{
Next (STR2,LEN2);
int p=0,s=0;
while (P < len2 && s < len1)
{
if (pat[p] = = Des[s])
{
p++;s++;
}
Else
{
if (p==0)
{
s++;//if the first character fails to match, start with the next character of Des
}
Else
{
p = next[p-1]+1;//to determine the character that Pat should backtrack to by using a failure function
}
}
}
if (P < len2)//The entire process match failed
{
return-1;
}
return s-len2;
}

Complexity of Time:
For the next function near O (m), the time complexity of the KMP algorithm is O (n), so the time complexity of the whole algorithm is O (n+m)
Complexity of space:
The space complexity of O (M) is introduced more and more.
4. The application KMP a face question
Given that the two strings are S1 and S2, the string containing the S2 can be determined by the S1 do a cyclic shift. For example, S1=aabcd,s2 =cdaa returns True because the S1 loop shift can become Cdaab. Returns false for the given S1=ACBD and S2=ACBD.
Analysis: It is not difficult to find that the S2 shifted string will be string S1S1 substring, if the S2 can have S1 cyclic displacement, then S2 must be S1S1 substring, then KMP algorithm is very useful.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.