KMP string pattern matching detailed __KMP

Source: Internet
Author: User

KMP string pattern Matching popular point is an efficient algorithm for locating another string in a string. The time complexity of the simple matching algorithm is O (m*n); KMP matching algorithm. It can be proved that its time complexity is O (m+n). First, a simple matching algorithm to see a simple matching algorithm function:

int INDEX_BF (char s [], Char t [], int pos)
{
/* If string s is present from the subscript 0≤pos<strlength (s) character of POS (s) to
the same substring as String T , the said match succeeds, returns the subscript of the first
such substring in the string S, otherwise returns-1/
int i = pos, j = 0;
while (S[i+j]!= '/0 ' && t[j]!= '/0 ')
if (s[i+j] = = T[j])
j + +;//Continue comparing the latter character
else
{
i + +; j = 0; Restart new round match
}
if (t[j] = = '/0 ') return I/////////////////////////////////
string s (pos word Fu Qi) not There is the same substring as String T
//INDEX_BF

The idea of this algorithm is straightforward: to compare the substring of a position I starting with the pattern string T in the main string S. That is, from the j=0 to compare S[i+j] and t[j], in the case of equality, there is the possibility of a successful match with I in the main string s, which continues to be compared (J step by 1) until the last word in the T string is story, or else it will be removed from the next character of the S string. The new start of the next round of "match", the string T slide backward One, that is, I increased 1, and J back to 0, restart the new round of matching. For example: Find t= "Abcabd" in string s= "Abcabcabdabba" (we can assume starting with subscript 0): First compare s[0] and t[0] are equal, then compare s[1] and t[1] equality ... We found that the comparison to s[5] and t[5] only ranged. As shown in the figure: when such a mismatch occurs, the T subscript must go back to the beginning, the length of the s subscript backtracking is the same as T, then the s subscript is increased by 1 and then compared again. Figure: This time the mismatch, T subscript and back to the beginning, S subscript 1, and then compare again. Figure: This time the mismatch, T subscript and back to the beginning, S subscript 1, and then compare again. As shown in figure:

Another mismatch has occurred, so the T subscript goes back to the beginning, and the S subscript increases by 1 and then compares again. All the characters in T characters and the corresponding character in S are matched. function returns the start Subscript 3 of T in S. As shown in figure:

two. KMP Matching algorithmOr the same example, in s= "Abcabcabdabba" to find t = "abcabd", if the KMP matching algorithm, when the first search to s[5] and t[5] unequal, the S subscript is not back to 1, T subscript is not traced to the beginning, but according to T t[5]== ' d ' mode function value (next[5]=2, why. Later), the direct comparison between S[5] and t[2] is equal, since the subscript of S and T increases at the same time; Because they are equal, the subscript of S and T increases at the same time ... In the end, we found T in S. As shown in figure:


KMP matching algorithm and simple matching algorithm efficiency comparison, an extreme example is: in s= "aaaaaa ... AAB "(100 a) to find t=" Aaaaaaaaab ",  simple matching algorithm each time is compared to the end of T, found that the characters are different, and then T's subscript back to the beginning, S subscript also to backtrack the same length after 1, continue to compare. If you use the KMP matching algorithm, you do not have to backtrack. For the matching of strings in a general document, the time complexity of the simple matching algorithm can be reduced to O (m+n), so it is applied in most practical application situations. The core idea of the KMP algorithm is to make use of the partial matching information that has been obtained to carry out the matching process. Look at the previous example. Why the t[5]== ' d ' mode function is equal to 2 (next[5]=2), in fact this 2 means that t[5]== ' d ' is preceded by 2 characters and two characters that begin with, and t[5]== ' d ' is not equal to the third character after two characters (t[2]= ' C ') . In the figure: that is, if the third character after the first two characters is also ' d ', then, although the t[5]== ' d ' has 2 characters and the beginning two characters, the t[5]== ' d ' mode function value is not 2, but 0.     before I said: "Look for t =" abcabd "in s=" Abcabcabdabba ", if you use the KMP matching algorithm, when the first search to s[5]  and t[5] unequal, S subscript not back to 1, t The subscript also does not go back to the beginning, but is based on the mode function value of t[5]== ' d ' in T, directly comparing s[5]  and t[2] as equal ... Why do you do that? Just now I said: "(next[5]=2), in fact, this 2 means that t[5]== ' d ' is preceded by 2 characters and the beginning two character is the same". See figure  : Because, S[4] ==t[4], s[3] ==t[3], according to next[5]=2, have t[3]==t[0], t[4] ==t[1], so s[3]==t[0], s[4] ==t[1] (two pairs equivalent to indirect ratio ), so that the next comparison between s[5]  and t[2] is equal ... One may ask: s[3] and t[0], s[4]  and t[1] are based on next[5]=2 indirectly more equal, that s[1] and t[0], s[2]  and t[0] between how to skip, you can not compare it. Because S[0]=t[0], s[1]=t[1], s[2]=t[2], and t[0] != t[1], t[1] != t[2],==> s[0] != S[1],S[ 1]!= s[2], so s[1] != t[0],s[2]!= t[0].  or theoretically indirectly compared. Some questions come again, you analyze is not special light condition AH. Assuming S is unchanged, search for t= "Abaabd" in S. A: This situation, when compared to s[2] and t[2], found unequal, to see the value of next[2], next[2]=-1, meaning that s[2 has been indirectly compared with the t[0] , not equal, next to compare s[3] and t[0] bar. Assuming S is unchanged, search for t= "Abbabd" in S. A: This situation when compared to s[2] and t[2], found unequal, to see the value of next[2], next[2]=0, meaning s[2] has been compared with the t[2], not equal, then to compare s[2] and t[0. Suppose s= "Abaabcabdabba" searches for t= "Abaabd" in S. A: This situation when compared to s[5] and t[5], find the difference, go to see the value of next[5], next[5]=2, meaning that the previous comparison, of which, s[5] preceded by two characters and T of the beginning of two equal, then to compare s[5 and t[2] bar. Anyway, with the next value of the string, everything is done. So, how do you find the pattern function of a string

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.