String pattern matching KMP algorithm

Source: Internet
Author: User
Tags comparison strlen

Transferred from: http://blog.chinaunix.net/uid-26548237-id-3367953.html

KMP string pattern matching the popular point is an efficient algorithm for locating another string in one of the strings. The time complexity of the simple matching algorithm is O (m*n), while the KMP algorithm can prove that its time complexity is O (m+n).


first, simple matching algorithm

Let's start with a simple matching algorithm function. int INDEX_BF (char const *s, char const *t, int pos)
{
/*--------------I modify the-------------*/
if (S = = NULL | | T = = NULL)
{
return-1;
}
if (pos < 0 | | pos > strlen (S)-strlen (T))
{
return-1;
}
/*------------------------------------*/

The match succeeds if the string S is present in the same substring as the string T from POS (s subscript 0<= pos <=strlength (s)).
Returns the subscript of the first such substring in the string s; otherwise returns-1
int i = pos;
int j = 0;
while (s[i+j]!= ' + ' && t[j]! = ' \ ")
{
if (s[i+j] = = T[j])
{
j++;//continue to compare the latter character
}
Else
{
Re-start a new round of matches
i++;
j=0;
}
}
if (t[j] = = ' + ')
{
Return i;//matches successfully, returns subscript
}
Else
{
return-1;//string S (pos word character) does not exist with the same substring as the string t
}
The idea of this algorithm is straightforward: compare the substring in the main string s with the pattern string T in the beginning of a position I. That is, from j=0 to compare s[i+j] and t[j], such as equal, in the main string s there is the probability that I is the starting position matching success, continue to compare (J gradually add 1), until with the last word in the T string typeface and so on, otherwise change from the S string of the next word character restart the next round of "matching", The string t is going to slide backwards one bit, i.e. I increases by 1, and J returns to 0, and a new round of matching is restarted.

For example: Find t= "Abcabd" in string s= "Abcabcabdabba" (we can assume starting from subscript 0): First compare s[0] and t[0] for equality, then compare S[1] and t[1] for equality .... We find that we have been comparing to s[5] and t[5]. As shown in the figure.

When such a mismatch occurs, the T subscript must go back to the beginning, the S subscript backtracking is the same length as T, and the s subscript increases by 1, and then the comparison again. As shown in the following figure.

This time there was a mismatch, the T subscript goes back to the beginning, the s subscript increases by 1, and then it is compared again. As shown in the figure.

This time there was a mismatch, the T subscript went back to the beginning, s subscript increased by 1, and then again compared. As shown in the figure.

Another mismatch occurred, so the T subscript goes back to the beginning, the s subscript increases by 1, and then it is compared again. All the characters in this T match the corresponding characters in S. The function returns the starting subscript 3 of T in S. As shown in the figure.


second, KMP algorithm

Still the same example, in s= "Abcabcabdabba" to find t= "Abcabd", if the use of KMP matching algorithm, when the first search to s[5] and T[5], the s subscript is not back to the 1,t subscript also a lot of backtracking to the beginning, but according to T in T[5]= ' The value of the mode function of d ' (next[5]=2, why. Later), the direct comparison between S[5] and t[2] is equal, because equal, s and T's subscript increase at the same time, because it is also equal, S and T's subscript increases at the same time .... Finally, T is found in S. As shown in the figure.

KMP matching algorithm and simple matching algorithm efficiency comparison, an extreme example is:

In s= "aaaaaa ... AAB "(100 a) to find t=" Aaaaaaaaab ", the simple matching algorithm each time is compared to the end of T, found that the characters are different, then the subscript of T back to the beginning, s subscript also to backtrack the same length after 1, continue to compare. If you use the KMP matching algorithm, you do not have to backtrack.

For the matching of strings in general documents, the time complexity of simple matching algorithm can be reduced to O (m+n), so it is applied in most practical applications.

The core idea of the KMP algorithm is to use the partial matching information that has been obtained to carry out the subsequent matching process. Looking at the previous example, why t[5] = = ' d ' has a modal function value equal to 2 (next[5]=2), starting with this 2 means t[5]== ' d ' preceded by two characters and the beginning of the two characters are the same, and t[5]== ' d ' is not equal to the first two characters after the third character (t[2]= ' C '). As shown in the figure.

That is, if the third character after the start of the two character is also ' d ', then, although t[5]== ' d ' preceded by 2 characters and the beginning of the same two character, the t[5]== ' d ' mode value is not 2, but 0.

I said: in s= "Abcabcabdabba" to find t= "abcabd", if using KMP matching algorithm, when the first search to s[5] and T[5], the s subscript is not back to the 1,t subscript is not the beginning, but according to T in t[5]== ' The value of the mode function of d ', directly comparing s[5] and t[2] is equal. Why is this possible?

Just now I said: "(next[5]=2), in fact, this 2 means t[5]== ' d ' preceded by 2 characters and the beginning of the two character is the same". See figure: Because, S[4] ==t[4],s[3] ==t[3], according to Next[5]=2, there is t[3]==t[0],t[4] ==t[1], so s[3]==t[0],s[4] (two pairs equivalent to the indirect comparison), so, next compare S [5] and t[2] are equal.

One might ask: S[3] and t[0],s[4] and t[1] are based on next[5]=2 indirect comparison of equality, that s[1] and t[0],s[2] and t[0] between how to skip, can not compare it. Because S[0]=t[0],s[1]=t[1],s[2]=t[2], and t[0]! = t[1], t[1]! = t[2],==> S[0]! = s[1],s[1]! = s[2], so s[1]! = t[0],s[2]! = t[0]. Or is it theoretically an indirect comparison.

Some doubts come again, you analyze is not special light situation ah.

Assuming S is unchanged, search for t= "Abaabd" in S.

A: This situation, when compared to s[2] and t[2], found unequal, go to see next[2] value, next[2]=-1, meaning s[2] has been and t[0] indirect comparison, not equal, next to compare s[3] and t[0] it.

Assuming S is unchanged, search for t= "Abbabd" in S.

A: This situation when compared to s[2] and t[2], found unequal, go to see next[2] value, next[2]=0, meaning s[2] have compared with t[2], not equal, next to compare s[2] and t[0] it.

Suppose s= "Abaabcabdabba" searches for t= "abaabd" in S.

A: This situation when compared to s[5] and t[5], found unequal, go to see next[5] value, next[5]=2, meaning is the previous comparison, wherein, s[5] in front of two characters and T start two equal, next to compare s[5] and t[2] it.

In short, with a string of next value, everything is done. So, how do you find the value of a string's mode function Next[n]? (The next value in this article, the value of the modal function, the pattern value is a meaning.) )
three, how to find the mode value of the string next[]
Definition:(1) next[0] = 1 Meaning: The pattern value of the first character of any string is set to-1. (2) Next[j] = 1 meaning: The character of the subscript J in the pattern string T, if it is the same as the first character, and the 1-k character in front of J is not equal to the 1-k character at the beginning (or equivalent but t[k]==t[j],1<=k<j) (3) next = k meaning    : The character labeled J in the pattern string T, if the first k characters of J are equal to the first k characters, and T[j]!=t[k],1<=k<j. (4) Next[j] = 0 Meaning: In addition to (1), (2), (3) other circumstances.Example:)The value of the mode function for t= "ABCAC" next[0] = 1 according to (1) next[1] = 0 According to (4) because (3) there is 1<=k<j; cannot say, j=1,t[j-1]==t[0] next[2] = 0 According to (4) because (3) there is 1< =k<j; (t[0]=a)! = (t[1]=b) next[3] =-1 according to (2) next[4] = 1 According to (3) t[0]=t[3] and t[1]=t[4] why t[0] = = T[3], there will be next[4] = 0. Because T[1]==t[4], according to (3) and t[j]!=t[k] are divided into (4).
)For a bit more complex, ask for the value of the t= "ABABCAABC" mode function. Next[0] = 1 according to (1) next[1] = 0 According to (4) next[2] =-1 according to (2) next[3] = 0 According to (3) although t[0]=t[2] but T[1]=t[3] was scored (4) next[4] = 2 According to (3) t[0 ]T[1]=T[2]T[3] and t[2]!=t[4] next[5] = 1 According to (2) next[6] = 1 According to (3) t[0]=t[5] and t[1]!=t[6] next[7] = 0 According to (3) although t[0]=t[6] but t[1]=t[7 ] is scored (4) next[8] = 2 According to (3) t[0]t[1]=t[6]t[7] and T[2]!=t[8] both: As long as understanding next[3]=0, not =1,next[6] = 1, instead of =-1,next[8]=2, not = 0, other good As easy to understand as it is.)For a special, t= the value of the modal function of "Abcabcad". NEXT[5] = 0 According to (3) although t[0]t[1]=t[3]t[4], but t[2]=t[5] next[6] = 1 According to (2) Although there is abc=abc in front, but t[3]==t[6] next[7] = 4 According to (3) there is ABCA=ABCA in front, and T[4]!=t[7])If t[4]==t[7], i.e. t= "Adcadcad", then this will be the case: next[7]=0, not = 4, because t[4]==t[7]. If you think you understand, then, do a little exercise.Practice:The value of the mode function of t= "Aaaaaaaaaab" is obtained and verified by the following function of the function value.Meaning:What does the next function value mean?    Previously said some, here Summary: set in string s to find the pattern string T, if S[m]!=t[n], then, take T[n] mode function Next[n], (1) next[n] = 1 means s[m] and t[0] indirectly compared, unequal, next comparison s[m+1] and t[0] (2) Next[0] = 0 means that the comparison process produces unequal, the next comparison s[m] and T[0] (3) next[n] = k>0 && k<n means the first k characters of S[m] have been indirectly compared with the K characters starting in t    Equal, next time compare s[m] and t[k] equal. (4) Other values, impossible.
implementation of KMP algorithm
/* If there is any problem, please point out, will learn the language * * #include <stdio.h>
#include <stdlib.h>
#include <string.h>

/* Function: Pattern string value * parameter: PTN: pattern string *nextval: Array of Saved mode string values */void Get_nextval (char const *PTN, int *nextval)
{
int i = 0;
Nextval[0] =-1;
Int j =-1;
int plen = strlen (PTN);

if (PTN = = NULL | | nextval = = NULL)
{
Return
}
while (I < Plen)
{
if (j = =-1 | | ptn[i] = = Ptn[j])
{
++i;
++j;
if (ptn[i]! = Ptn[j])
{
Nextval[i] = j;
}
Else
{
Nextval[i] = Nextval[j];
}
}
Else
{
j = Nextval[j];
}
}
}

/* Function: Implement KMP algorithm * Parameters: src: source String * Patn: Pattern String * Nextval: Pattern String Value * pos: source string Start position * Return value: If the match succeeds, the subscript is returned, if the error or match is unsuccessful, then 1 */I NT Kmp_search (char const *SRC, char const *PATN, int const *nextval,int POS)
{
int i = pos;
Int&nbs

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.