The KMC string matching algorithm of Data Structure

Source: Internet
Author: User

About the Mode function value Next[i] There are many versions of it, and in some other object-oriented algorithm description There is also the assertion that the Invalidation function f (j) is actually a meaning, namely next[j]=f (j-1) +1, but still next[j] This notation is good to understand:

KMP string pattern matching The popular point is an efficient algorithm for locating another string in one of the strings.   The time complexity of the simple matching algorithm is O (m*n); The time complexity of the KMP matching algorithm is O (m+n).

first, simple matching algorithm

Let's take a look at the function of a simple matching algorithm: (c code)

intINDEX_BF (CharS[],CharT[],intPOS) { inti= pos, j =0;  while(s[i+j]!=' /'&&t[j]!=' /')     if(S[i+j] = =T[j]) J++;//continue comparing the latter character    Else{i++; j =0;//start a new round of matches again    }    if(t[j]==' /')       returnI//Match successful return subscript    Else       return-1; } 

The basic idea of the algorithm is to compare the substring in the main string s with the pattern string T in the beginning of a position I. That is, from j=0 to compare s[i+j] and t[j], if equal, in the main string S in the presence of I as the starting position to match the likelihood of success, continue to compare (J gradually increased by 1), until the last word in the T string typeface and so on, otherwise changed from the S string of the next word character restart the next round of "matching T slide backward one bit, i.e. I increases by 1, and J returns to 0, restarting a new round of matching.
For example: Find t= "Abcabd" in string s= "Abcabcabdabba" (we can assume starting from subscript 0): First compare s[0] and t[0] for equality, then compare S[1] and t[1] for equality ... We found that the comparison to s[5] and t[5] only ranged.

When such a mismatch occurs, the T subscript must go back to the beginning, the S subscript backtracking is the same length as T, and the s subscript increases by 1, and then the comparison again. This time there was a mismatch, the T subscript goes back to the beginning, the s subscript increases by 1, and then it is compared again.

This time there was a mismatch, the T subscript goes back to the beginning, the s subscript increases by 1, and then it is compared again.

Another mismatch has occurred, so the T subscript goes back to the beginning, the s subscript increases by 1, and then it is compared again. All the characters in this T nonalphanumeric match the corresponding character in S. The function returns the starting subscript 3 of T in S.
second, KMP matching algorithm

The same example, in s= "Abcabcabdabba" to find t= "abcabd", if using KMP matching algorithm, when the first search to s[5] and T[5], the s subscript is not back to the 1,t subscript is not the beginning, but according to T in t[5]== ' d ' 's Mode function value (next[5]=2, why?) ), the direct comparison between S[5] and t[2] is equal, because the subscript of s and T increases at the same time because it is equal, and the subscript of s and t increases simultaneously. Finally, T is found in S. :

KMP matching algorithm and simple matching algorithm efficiency comparison, an extreme example is: in s= "aaaaaa ... AAB "(100 a) to find t=" Aaaaaaaaab ", the simple matching algorithm each time is compared to the end of T, found that the characters are different, then the subscript of T back to the beginning, s subscript also to backtrack the same length after 1, continue to compare. If you use the KMP matching algorithm, you do not have to backtrack.
For the matching of strings in general documents, the time complexity of simple matching algorithm can be reduced to O (m+n), so it is applied in most practical applications.

The core idea of the KMP algorithm is to use the partial matching information that has been obtained to perform the subsequent matching process. Look at the previous example. Why is the value of the t[5]== ' d ' mode function equal to 2 (next[5]=2), in fact this 2 means that t[5]== ' d ' is preceded by 2 characters and the first two characters are the same, and t[5]== ' d ' is not equal to the third character (t[2]= ' C ') after the beginning of the two character.

      that is, if the third character after the start of the two character is also ' d ', then, although t[5]== ' d ' is preceded by 2 characters and begins with the same two character, t[5]== ' The value of the mode function of d ' is not 2, but it is 0.
      said in s= "Abcabcabdabba" to find t= "Abcabd", if you use the KMP matching algorithm, when the first search to s[5] and t[5] range, s subscript is not back to the 1,t subscript is not back to the beginning, but based on the t[5]== ' d ' modal function values in T, directly compare s[5] and t[2] are equal ... Why is this possible?
      just now said: "(next[5]=2), in fact, this 2 means t[5]== ' d ' preceded by 2 characters and the beginning of the two character is the same". See figure: Because, S[4] ==t[4],s[3] ==t[3], according to Next[5]=2, there is t[3]==t[0],t[4] ==t[1], so s[3]==t[0],s[4] (two pairs equivalent to the indirect comparison), so, next compare S [5] and t[2] are equal ...    

      one might ask: S[3] and t[0],s[4] and t[1] are indirectly comparing equality according to Next[5]=2, that s[1] and t[0],s[2] and t[0] and how to skip, Can not compare it? Because S[0]=t[0],s[1]=t[1],s[2]=t[2], and t[0]! = t[1], t[1]! = t[2],==> S[0]! = s[1],s[1]! = s[2], so s[1]! = t[0],s[2]! = t[0]. Or is it theoretically an indirect comparison.
        Some questions come again, do you analyze the special light conditions AH. Assuming S is unchanged, search for t= "Abaabd" in S? A: This situation, when compared to s[2] and t[2], found unequal, go to see next[2] value, next[2]=-1, meaning s[2] has been and t[0] indirect comparison, not equal, next to compare s[3] and t[0] it.
        Assuming S is the same, searching for t= "Abbabd" in S? A: This situation when compared to s[2] and t[2], found unequal, go to see next[2] value, next[2]=0, meaning s[2] have compared with t[2], not equal, next to compare s[2] and t[0] it.
        Suppose s= "Abaabcabdabba" searches for t= "abaabd" in S? A: This situation when compared to s[5] and t[5], found unequal, go to see next[5] value, next[5]=2, meaning is the previous comparison, wherein, s[5] in front of two characters and T start two equal, next to compare s[5] and t[2] it.
      in short, with a string of next values, everything is done. (The next value in this article, the value of the modal function, the pattern value is a meaning.) )        

Three, the mode value of the string to find Next[n]

Definition:

(1) next[0]=-1 meaning: The pattern value of the first character of any string is set to-1.

(2) next[j]=-1 meaning: The character in the pattern string T is labeled J, if it is the same as the first character, and J is preceded by a 1-k character that is not equal to the 1-k character at the beginning (or equivalent but t[k]==t[j]) (1≤K<J). such as: t= "Abcabcad" is next[6]=-1, because T[3]=t[6]

(3) next[j]= K meaning: The characters in the pattern string T labeled J, if the K characters before J are equal to the first k characters, and t[j]! = T[k] (1≤k<j). IE T[0]t[1]t[2] ... t[k-1]== t[j-k]t[j-k+1]t[j-k+2] ... T[J-1] and t[j]! = T[k]. (1≤K<J);

(4) next[j]= 0 meaning: In addition to (1) (2) (3) other circumstances.

Example:

01) The value of the mode function of t= "ABCAC" is obtained.

Next[0]=-1 under (1)

next[1]=0 according to (4) because (3) have 1<=k<j; cannot say, j=1,t[j-1]==t[0]

next[2]=0 according to (4) because (3) there is 1<=k<j; (t[0]=a)! = (t[1]=b)

Next[3]=-1 under (2)

Next[4]=1 according to (3) t[0]=t[3] and t[1]=t[4]

Subscript 0 1 2 3 4
T A B C A C
Next -1 0 0 -1 1
If t= "Abcab" will be like this:
Subscript 0 1 2 3 4
T A B C A B
Next -1 0 0 -1 0
Why T[0]==t[3], there will be next[4]=0, because t[1]==t[4], according to (3) "and t[j]! = T[k]" is divided into (4).

02) for a complex point, ask for the value of the t= "ABABCAABC" mode function.

Next[0]=-1 under (1)

Next[1]=0 according to (4)

Next[2]=-1 according to (2)

Next[3]=0 according to (3) although t[0]=t[2] but T[1]=t[3] is divided into (4)

next[4]=2 according to (3) t[0]t[1]=t[2]t[3] and t[2]!=t[4]

Next[5]=-1 according to (2)

Next[6]=1 according to (3) t[0]=t[5] and t[1]!=t[6]

Next[7]=0 according to (3) although t[0]=t[6] but T[1]=t[7] is divided into (4)

next[8]=2 according to (3) t[0]t[1]=t[6]t[7] and t[2]!=t[8]

That
Subscript 0 1 2 3 4 5 6 7 8
T A B A B C A A B C
Next -1 0 -1 0 2 -1 1 0 2

As long as you understand next[3]=0, not =1,next[6]=1, rather than = -1,next[8]=2, not = 0, the other seems easy to understand. 03) to a special, the value of the mode function for t= "Abcabcad"

Subscript 0 1 2 3 4 5 6 7
T A B C A B C A D
Next -1 0 0 -1 0 0 -1 4

next[5]= 0 According to (3) although t[0]t[1]=t[3]t[4], but t[2]==t[5] next[6]=-1 according to (2) although preceded by ABC=ABC, but T[3]==t[6] next[7]=4 according to (3) preceded by ABCA=ABC A, and t[4]!=t[7] if t[4]==t[7], i.e. t= "Adcadcad", then this will be the case: next[7]=0, not = 4, because t[4]==t[7].

Subscript 0 1 2 3 4 5 6 7
T A D C A D C A D
Next -1 0 0 -1 0 0 -1 0

If you feel a little bit understood, then practice: t= The value of the "Aaaaaaaaaab" mode function and validate it with the subsequent function value function.
Meaning:

What is the value of next function: Set in string s to find the pattern string T, if S[m]!=t[n], then take T[n] mode function value Next[n], 1.       next[n]=-1 means s[m] and t[0] are indirectly compared, unequal, next comparison s[m+1] and t[0] 2. Next[n]=0 indicates that inequality was generated during the comparison, the next comparison s[m] and t[0]. 3. next[n]= k >0 but K<n, that the first k characters of s[m] are indirectly equal to the beginning K characters in T, and the next comparison s[m] and t[k] equal? 4. Other values, impossible.

four, the string T mode value next[n] function said so much, is not think of string T mode value Next[n] is very complex? To ask me to write a function, at present, I would rather go to the sky. Fortunately there are ready-made functions, the original invention KMP algorithm, write this function of the ancestors, so I admire the six body cast land. I wait for Sang, understand, have to ponder over and over again. Here is the function:

/** String pattern matching: KMP algorithm * If return position i > Length (T)-Length (P), the mismatch * Otherwise, I is the matching position*/ImportJava.io.*; Public classKmp_next { Public Static voidMain (String args[]) {shownexttable ("ABCAC", buildnextimproved ("ABCAC")); Shownexttable ("Abcab", buildnextimproved ("Abcab")); Shownexttable ("Ababcaabc", buildnextimproved ("Ababcaabc")); Shownexttable ("Abcabcad", buildnextimproved ("Abcabcad")); Shownexttable ("Adcadcad", buildnextimproved ("Adcadcad")); Shownexttable ("Aaaaaaaaaab", buildnextimproved ("Aaaaaaaaaab")); } protected Static int[] buildnextimproved (String P) {//Create a pattern string P next[] Table  int[] Next =New int[P.length ()]; intj = 0; intt = next[0] = 1;  while(J < P.length ()-1)   if(0 > T | | P.charat (j) = =P.charat (t)) {J++; t++; NEXT[J]= (P.charat (j)! = P.charat (t))?T:next[t]; }Else//mismatcht =Next[t]; return(next);}protected Static voidShownexttable (String P,int[] N) {  for(inti=0; i< p.length (); i++) System.out.printf ("%4C", P.charat (i)); System.out.println ();  for(inti=0;i< n.length;i++) System.out.printf ("%4d", N[i]); System.out.println ("\ n---------------------------------------------------");}}


Run: C:\work>java kmp_next a b c a C -1 0 0 -1 1-------------- ------------------------------------- a b c a b -1 0 0 -1 0- -------------------------------------------------- a b a b c a a b C -1 0 -1 0 2 -1 1 0 2------------------------------------- -------------- a b c a b c a D -1 0 0 -1 0 0 -1 4--------------------------------------------------- a d C a D C a D -1 0 0 -1 0 0 -1 0------------------------------ ---------------------A A A A a a a a a a B -1 -1 -1 -1 -1 -1-1 -1 -1 -1 9------------------------------- --------------------

Data Structure KMC string matching algorithm

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.