On data structure-string matching

Source: Internet
Author: User

Pattern Matching is a basic operation of a string in a data structure, given a substring, which requires that all substrings of the same substring be found in a string, which is pattern matching .

Assuming that P is a given substring, T is the string to be looked up, requiring that all substrings in t be found with the same p, this problem becomes a pattern matching problem. P is called a pattern, and T is called a target. If there is one or more substrings in t that have a pattern of p, the position of the substring in T is given, which is known as a successful match, otherwise the match fails.

Algorithm idea of brute force algorithm (BF algorithm)

The first character of the target string T is compared to the first character of the pattern string P .

If it is equal, the character will continue to be compared later, otherwise the target string is character from the second character and the first characters of the pattern string.

Until each character in the pattern string is equal to a contiguous sequence of characters in the target string, this is known as a successful match, or the match fails.

Algorithmic performance

Suppose that the length of the pattern string is m, the length of the target string is n: = Outer loop, and M is the inner loop.

The BF algorithm has backtracking, seriously affects the efficiency, the worst case is the n*m, so the complexity of the algorithm is O (MN). The brute force algorithm cannot take advantage of known information, that is, the pattern string information, to reduce the match. For example, in the fourth step, t[5] and p[4] do not match, and then back (the picture is a bit problematic), t[3] and p[0] certainly different, because the previous match, we learned that t[3]=p[1], and p[0], but p[1] different.

Code
intBfConst Char*text,Const Char*find) {    //Anomaly Judgment    if(*text = ='/0'|| *find = ='/0')    {        return-1; }        intFind_len =strlen (find); intText_len =strlen (text); if(Text_len <Find_len) {        return-1; }    //Remove Const Property    Char*s =const_cast<Char*>(text); Char*p =s; Char*q = const_cast<Char*>(find); //Execute BF algorithm     while(*p! =' /')    {        //match succeeds, pointer moves forward        if(*p = = *q) {p++; Q++; }        //Otherwise, backtracking, by recording the position of the pointer before, re-assign value.         Else        {            //s++,p points to the location of the backtracking.s++; P=s; //Q Re-point to the initial positionQ =const_cast<Char*>(find); }        //The execution succeeds and returns the position.         if(*q = =' /')        {            return(P-text)-(Q-find); }    }    return-1;}
Brute Force algorithm

KMP algorithm

Knuth-morris-pratt algorithm (abbreviated as KMP) is an improved algorithm proposed by D.e.knuth, J.h.morris and V.r.pratt, which eliminates the backtracking problem in BF algorithm and completes the pattern matching of strings. KMP string pattern matching the popular point is an efficient algorithm for locating another string in one of the strings.

Algorithmic thinking

In the s= "Ababcabcacbaa" to find t= "ABCAC", if using the KMP matching algorithm, when the first search to s[2] and t[2], the s subscript is not back to 1, the second occurrence of mismatch, s subscript is not to start, the subscript of T is not to start, Instead, it is based on the modal function of t[4]== ' B ' in T.

Key idea: In the matching process, if there is a mismatch situation.

If NEXT[J] >= 0, the target string's pointer I is not changed, and the pointer J of the pattern string is moved to Next[j] to continue matching;

If next[j] =-1, move I to the right 1 bits and J 0 to continue the comparison.

Program Algorithm Ideas:

If j =-1, or if the current character matches successfully (ie s[i] = = P[j]), make i++,j++ continue to match the next character;

if J! =-1, and the current character match fails (that is, s[i]! = P[j]), then I is unchanged, j = next[j]. This means that when mismatch occurs, the pattern string P moves from the beginning to the next[j] and then begins to match .

Algorithm code
intKmpsearch (Char* S,Char*p) {inti =0; intj =0; intSlen =strlen (s); intPlen =strlen (P);  while(I < Slen && J <Plen) {          //① If J =-1, or if the current character match succeeds (ie s[i] = = P[j]), the i++,j++        if(j = =-1|| S[i] = =P[j]) {i++; J++; }          Else          {              //② If J! =-1, and the current character match fails (that is, s[i]! = P[j]), then I is unchanged, j = Next[j]//Next[j] is the next value corresponding to Jj =Next[j]; }      }      if(J = =Plen)returnIJ; Else          return-1; }
KMP Algorithm

Partial match table Next array

The KMP algorithm idea is explained earlier, one of the key is the next array.

Next array the meanings of each value: the same prefix suffix that represents the length of the string that precedes the current character. For example, if Next [j] = k, the string representing the preceding J has the same prefix suffix with the maximum length K .

The next array functions in the KMP algorithm: when the match character fails, the next array is found, and the values in the array tell the pattern string to jump to that position. If NEXT[J] is 0 or-1, start from the beginning (the pattern string prefix is not the same), such as next[j] = K and k>0, which represents the next match jumps to a character before J, skipping K characters. (the suffix and prefix in the pattern string are the same as the K characters), such as next[j] = 2, then the last 2 bits in the pattern string are the same as the first 2 bits of the pattern string.

Maximum length table

In understanding the partial match table, the prefix suffix is mastered first.

First, you need to understand the two concepts: prefix and suffix. "prefix" means the combination of all the headers of a string except the last character; "suffix" means all the trailing combinations of a string in addition to the first character.

If the given pattern string is: "Abcdabd", traversing the entire pattern string from left to right, the prefix suffix of each of its substrings is shown in the table below.

The maximum length table for the common elements of each prefix suffix corresponding to the string substring of the original pattern is ( hereinafter referred to as "Maximum length table"):

Next array-Partial match table

When the value of the next array means that the maximum value of the prefix and suffix is the same as before this character, the maximum length table represents the same prefix and suffix in the current string, so the maximum length of the common element of each prefix suffix is obtained by the ① step, as long as the shape is slightly deformed: the value obtained in step ① is shifted to one bit The initial value is then assigned to-1, as shown in the following table:

Algorithmic thinking

If p[k] = = P[j], then next[j + 1] = next [j] + 1 = k + 1;

If P[k]≠p[j], if at this time p[next[k]] = = P[j], then next[j + 1] = Next[k] + 1, otherwise continue recursive prefix index k = next[k], and then repeat the process.

Assuming that the given pattern string is ABCDABCE, and that next [j] = K (equivalent to "p0 pk-1" = "pj-k pj-1" = AB, we can see that K is 2), what is the requirement for next [j + 1]? Because pk = PJ = C, so next[j + 1] = Next[j] + 1 = k + 1 (can be seen next[j + 1] = 3). Represents the same prefix suffix of the length k+1 in the pattern string before the character E.

But what if pk! = PJ ? Description "P0 pk-1 pk" ≠ "pj-k pj-1 PJ". In other words, when pk! = PJ, how long before the character E has the same prefix suffix? Obviously, because C is different from D, ABC is not the same as Abd, that is, the pattern string in front of the character E does not have the same prefix suffix with the length k+1, and it can no longer be easily made: Next[j + 1] = Next[j] + 1.

If the prefix "p0 pk-1 PK" in the constant recursive prefix index k = next [K], find a character PK ' also for D, for pk ' = PJ, and meet the P0 PK '-1 pk ' = Pj-k ' pj-1 pj, then the maximum of the same prefix The prefix length is K ' + 1, thus next [j + 1] = K ' + 1 = next [k '] + 1. Otherwise the prefix does not have D, which means that there is no same prefix suffix, next [j + 1] = 0.

Code:
voidGetNext (Char* p,intnext[]) {      intPlen =strlen (P); next[0] = -1; intK =-1; intj =0;  while(J < Plen-1)      {          //P[k] Represents a prefix, p[j] represents a suffix        if(k = =-1|| P[J] = =P[k]) {              ++K; ++J; NEXT[J]=K; }          Else           {            //k is often 0,next[k] for-1, jumps to the upper branch, K for 0 means the new prefix, to this branch then for-1, New startK =Next[k]; }      }  }
Partial Matching Group

On data structure-string matching

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.