string pattern matching algorithm (BF algorithm and KMP algorithm)

Source: Internet
Author: User
Tags first string

Pattern matching algorithm for strings

The locating operation of a substring is often called a pattern match of a string, where T is called a pattern string .

General positioning function for substring position (Brute force)

That's what I write Java code for.

int Index(String s,string T,int POS) {char[] S_arr = S.tochararray (); char[] T_arr = T.tochararray ();intI,j,k;//iIs the pointer to the main string s, and J is the pointer to the pattern stringif(POS<0||POS>s.length() || S.length() < T.length() ||POS+t.length() >s.length())return-2;/* Outermost is the maximum number of matches to the main string, starting with the array subscript 0, I means that the current match is the beginning of the element labeled I from the main string */         for(i =POS-1; i < S.length()-T.length(); i++) {/ * The inner loop is the loop of the pattern string, and J represents the position of the current matching pointer * /             for(j=0; j<t.length(); j + +) {/*i+j is the position of the pointer on the main string, if the two do not match, then the pointer of the main string is moved to the next position, the pattern string pointer starts from the beginning, and re-matches * /                if(S_arr[i+j]! = T_arr[j]) { Break; }            }            /* withMatch end of pattern string, determine pointer position of pattern string*/            if(J>=t.length())returnI }return-1;}

The algorithm in the book is like this (pseudo code)

int index (String s,stringT, int pos) {/* Returns a substringTPosition after the POS position character in the main string s, if not present, the return value is0whichTNon-empty,1<=pos<=s.length,s[0] andT[0] Location Store String length */i = Pos;j =1; while(I <= s[0] && J <=T[0]){if(S[i] = =T[j]) {i++;j++/* Two-string pointer back */}Else{i = i-j+2; j =1;/* pointer back i-j+ for main string2Units, due to the subscript from a1Start j-i+ in main string1The location and pattern string1Position, then move back one unit. */        }    }if(j>T[0])returnI-T[0];return 0}

When the main string length is M, the length of the pattern string is n (m>n), the time Complexity O (m*n) of the BF.

Improved pattern matching algorithm

Because in the BF algorithm, each time and the main string matching process encountered an unworthy character, the pattern string pointer is always going back to the beginning and the next character of the main string to start matching, the period can be used and the resulting "partial match" results, the pattern string "to the right" to a distance as far as possible, continue to compare. In fact, the pointer of the current main string refers to the character, and the pattern string in the "partial match" in the content of a character continues to match.
Now the problem has become:
* The character that the current main string refers to, should be in the pattern string to complete the "partial match" in the substring of the character continues to compare, that is, the distance to the right slide.

Main string s subscript 1,2,...,i-j+1,...,i-1,i,...,m

Pattern string T subscript 1,...,j-1,j,...,n

The current pointer I and J refer to the characters being matched (not judging whether they are equal).

In this case there is such an equation: s (i+j-1) ... s (i-1) = t (1). T (j-1), which corresponds to the matching.

So assuming that s (i) and T (j) do not match, the pattern string needs to move right, and when the right moves, s (i) and T (k) are matching (k is assumed here)

There is such an equation: s (i-k+1) ... s (i-1) = t (1) ... t (k-1)

Description of the image

aaa b c d e   i指向cT1:   aa b e       j指向eT2:       aa b e   右移后,j指向a(下标为3),即第k个字符的位置

As you can see, in the pattern string, a substring from the beginning of the first character character-a substring (AB) of length k-1, and a substring (AB) that ends with the previous character of the current match character, and the length is k-1, is equal, and for the pattern string we come to the formula:

T (1) ... t (k-1) = t (j-k+1) ... t (j-1)

So it is this substring of the pattern string of length k-1 that determines the pattern string pointer repositioning to the K-character when a mismatched character is encountered.
Now the problem has become:
* Look for a substring of length k-1 in the substring before the character labeled J in the pattern string
* makes the existence of T (1) ... t (k-1) = t (j-k+1) ... t (j-1) such a relationship

Note that these two matched substrings must have several conditions:
* Before the character labeled J.
* The beginning element of the first string must be the beginning element of the pattern string, and the end element of the second string must be j-1 as the subscript element.

It looks like the head and tail of "Qiatouquwei".

We used the next array, next[j]=k, that in the pattern string, the pattern string pointer should point to the pattern string character labeled K, and then compare it to the corresponding character in the main string, after the first J element is mismatch with the element in the main string.

Chestnuts:

  j       1  2  3  4  5  6  7  8模式串    a  b  a  a  b  c  a  cnext[j]   0  1  1  2  2  3  1  2  

How do you calculate Next[j]? I asked a classmate of the lab last night, and he told me One way:

Regulation, next[1] = 0.

When calculating next[j], use your finger to block the character labeled J, and see if there are any substrings in the previous character.

If so, then these two equal substrings are the strings we are looking for in length k-1. So the k= substring length is +1, so it calculates the value of next[j].

So if not, k-1=0, then K is 1.

At this point, the optimization algorithm becomes:

int index (String s,stringT, int pos) {/* Returns a substringTPosition after the POS position character in the main string s, if not present, the return value is0whichTNon-empty,1<=pos<=s.length,s[0] andT[0] Location Store String length */i = Pos;j =1; while(I <= s[0] && J <=T[0]){if(S[i] = =T[j]) {i++;j++/* Two-string pointer back */}Else{j=Next[j]; }    }if(j>T[0])returnI-T[0];return 0}
What exactly does the next array ask?

As you can see from the above discussion, the next array is not related to the main string, but only to the template string itself. According to the textbook instructions, the definition of the starting point, recursive way to obtain the next function value.

This can be seen as a template string to compare itself with itself. Because each search for a subscript next value, is in the sub-string of the pattern string before the subscript to look for, compare to see if there are no matching conditions in the two-length k-1, matching substring.

Known by definition next[1] = 0

When comparing with the main string for a fallback to find K, there is this relationship:

T (1) ... t (k-1) = t (j-k+1) ... t (j-1)

When the above formula is set up, the J-character of the pattern string and the I-character mismatch of the main string, the pattern string is rolled back to the position of the K-character, where the value of k should be between 1 and J and the string of two k-1 lengths should be the largest group in the substring before J, i.e. no K1 makes K < K1 ( Of course K1 is also less than J).

So, in seeking next[j+1] =? , there are two situations.

Case one, t (k) = T (j)

T (1) ... t (k-1) t (k) = t (j-k+1) ... t (j-1) T (j)

It is natural that these two lengths of k-1 are increased by 1, that is, if there is a mismatch between this position (J+1) and the main string, the pointer should be pointed to the character k+1 position. That

Next (j+1) = Next (j) + 1

Scenario two, T (k)! = T (j)

What should we do at this time? Note that at this point there is still

T (1) ... t (k-1) = t (j-k+1) ... t (j-1)

At this point the T (1) ... t (k) is taken out (copied), compared to the original pattern string (called the main string and the pattern string, respectively)

t(1... t(k-1...... t(j-k+1... t(j-1...                             t(1)     ... t(k-1) t(k)

At this point, the following equation needs to be slid to the right, and this analysis is similar to the above analysis of K, we assume that at this point the pointer should be slid to the character labeled Next[k]=k2, compared with T (j).
Note that T (1) ... t (k-1) = t (j-k+1) ... t (j-1)

t(1... t(k-1...... t(j-k+1) ...t(j-k2+1... t(j-1)  ...                                         t(1) .......  t(k2-1... t(k)
If T (k2) = T (j) at this time

Indicates that there is a first-oldest string with a length of next[k] before the position of the j+1 in the main string, and that the substring in the pattern string is equal to the length of next[k] (the pattern string is part of the main string). And then there's the

NEXT[J+1] = next[k]+1

And from the process of seeking k we know next[j] = k. So the upper-style and can write:

NEXT[J+1] = next[Next[j]] + 1

If this time t (K2)! = T (j)

Now it's time to move right ... We need to ask for NEXT[K2].

t(1... t(k-1...... t(j-k+1) ...t(j-k2+1... t(j-1)  ...                                         t(1) .......  t(k2-1... t(k)

Therefore, according to the reasoning process of the two cases above, it is concluded that the next function value is determined by the values of the next functions of the previous position element.

Write out next function algorithm

void GetNext (StringTIntNext[]) {i =1;Next[1] =0; j =0; while(i<T[0]) {/* if J is0Stating that this is just the beginning, it willNext[1]=0Move two pointers, then compare if J is not0, but the two pointers point to the same character, moving the pointer and then comparing */if(J = =0||T[I] = =T[j])            {i++;            j + +; /* Note here that the values of I and J are different from the beginning, when the first executionNext[2] =1*/Next[I] = j; }Elsej =Next[j]; }}

It took me about two days. The first day of understanding, the next day to write notes, write the time or found that there is no understanding of the place, such as the above principle into the algorithm code, I still do not understand ...

Pattern matching algorithm for strings (BF algorithm and KMP algorithm)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.