String matching (BF,BM,SUNDAY,KMP algorithm parsing) __ algorithm

Source: Internet
Author: User
Tags comparison table string back

String matching has been one of the most popular research problems in computer field, and many algorithms are emerging. The string matching algorithm has a strong practical value, which is applied to information search, spell checking, bioinformatics and other fields.
Today we introduce some of the more famous algorithms:
1. BF
2. BM
3. Sunday
4. KMP

-,BF algorithm
The BF (brute Force) algorithm, also known as the Brute force matching algorithm, is a common pattern matching algorithm.

Its algorithm idea is very simple, starting with the first POS character of the main string s, and comparing it with the number one character of the pattern string T, and if it is equal, then the main string and the pattern string are moved backward by a character, and if not the same, then the pos+1 character of the main string s begins to be compared again.
And so on, until each character in the pattern string T is equal to a continuous string in the main string s, then the match succeeds, returning the position of the first character of the pattern string T in the main string s; if the main string is not successful after traversal, then the match fails.

The algorithm steps are as follows:

Subscript I    0 1 2 3 4 5 6 7 8 9  
main string s a b  a  b  c  a  b  c  a  c  
pattern string T  a  b  c  a          

First comparison, from left to right, s[0] = t[0], counter + + s[1] = t[1],i++; when s[2]!= t[2], the main string backtracking, from s[1 to start again comparison.

Subscript I    0 1 2 3 4 5 6 7 8 9  
main string s    a  b  b  a  b  c  a  c  
pattern string T     a  b  c  a          

That is, the main string I starts with 0, i++ each time the comparison fails, and then compares it again until

Subscript I    0 1 2 3 4 5 6 7 8 9  
main string s a b a B  b  a  b  c  a  c  
pattern string T                 a  b  c  a   

i = 5, matched successfully. Return to I=5.

Can be pushed, the BF algorithm in the worst case needs to be compared, the main string length M times the pattern string length N, for-O (M * N).

The code implementation is as follows:

int BF (const char *STR1, const char *str2)
{
    int str1_len = strlen (str1);
    int str2_len = strlen (str2);
    int i = 0;
    int j = 0;

    if (str1 = null | | str2 = = NULL) {
        return-1
    }

    while (I < Str1_len && J < Str2_len) {
        if (str1[i] = = Str2[j]) {
            i++;
            j + +;
            Equality continues to be compared
        }
        else{
            i = i-j + 1;
            j = 0;
            Not equal to the main string backtracking, re-compare
        } 
    }
    if (j = = Str2_len) {return

        i-j;
    } else{
    return-1
    }
}

The BF algorithm is easy to understand, but this algorithm is very inefficient, because every failure after backtracking, waste of previous comparisons, resulting in many times of useless comparisons, time loss is greater.

Two,BM algorithm
The BM (Boyer Moore) algorithm was 1977, and Robert S.boyer and J Strother Moore presented a matching algorithm for O (n) time complexity. Time complexity is lower than the BF.

BM algorithm is able to have higher efficiency in pattern matching, mainly because the BM algorithm constructs two jump tables, called bad suffix tables, and good suffix tables. The two tables cover two rules in the BM algorithm:

bad character rule : When a character in a text string does not match a character in the pattern string, we call this mismatch character in a text string to be a bad character, at which point the pattern string needs to move to the right and the number of digits = The position of the bad character in the pattern string-the position where the bad character appears in the pattern string. In addition, if the "bad character" is not included in the pattern string, the most right position is-1.

good suffix rule : When the character mismatch, the back shift number = good suffix in the pattern string position-good suffix in the pattern string last occurrence position, and if the good suffix in the pattern string does not reappear, then 1.

The algorithm steps are as follows:
1. First, the "text string" and "pattern string" head alignment, starting from the tail to compare. "S" does not match "E". At this point, "S" is referred to as the "bad character" (poor character), the mismatched character, which corresponds to the 6th bit of the pattern string. and "S" is not included in the pattern string "EXAMPLE" (equivalent to the right position is-1), which means that you can move the pattern string back 6-(-1) = 7 bits, and move directly to the next bit of "s".

2. Still starting from the tail of comparison, found that "P" and "E" does not match, so "P" is "bad character." However, "P" is included in the pattern string "EXAMPLE". Because "P" this "bad character" corresponds to the 6th bit of the pattern string (numbered from 0), and in the pattern string the most right occurrence is 4, so the pattern string is moved 6-4 = 2 bit, two "P" is aligned.

3. In turn, the "Mple" match is called "good suffix" (good suffix), that is, all tail-matching strings. Note that "Mple", "PLE", "LE", "E" are all good suffixes.

4. Found that "I" and "A" Do not match: "I" is a bad character. If the rule is based on bad characters, then the pattern string should be moved back 2-(-1) = 3 bits. The problem is, there is no better way to move.

5. A better way to move is to use the suffix rule: when the character mismatch, the back shift number = good suffix in the pattern string position-good suffix in the pattern string last occurrence, and if the good suffix in the pattern string does not reappear, then 1.
Of all the "good suffixes" (Mple, PLE, LE, E), only "E" appears in the head of the "EXAMPLE", so it moves back 6-0 = 6 bits.
As can be seen, "bad character rules" can only move 3 bits, "good suffix rule" to move 6 bits. The larger values in each of these two rules are moved back each time. The number of moved digits of these two rules is only related to the pattern string, regardless of the original text string.

6. Continue from the tail start comparison, "P" and "E" does not match, so "P" is "bad character", according to "bad character rules", move 6-4 = 2 bits. Because it is the last one mismatch, has not yet received a good suffix.

Matching completed, see BM algorithm according to good and bad character rules, so that when the gain and loss match, the text string can move more bits, making the efficiency higher than the BF step-by-step matching.

The time complexity of BM algorithm is O (N).

The main purpose of code implementation is to build two tables.
Bad character tabulation:

void Buildbadc (const char* pattern, size_t pattern_length, unsigned int* badc, size_t alphabet_size)  
{  
    unsigned i NT I;  

    for (i = 0; i < alphabet_size ++i)  
    {  
        badc[i] = pattern_length;  
    }   
    for (i = 0; i < pattern_length ++i)  
    {  
        badc[pattern[i]-' A '] = pattern_length-1-i;  
    }  
}  

Good suffix table:

void Buildgoods (const char* pattern, size_t pattern_length, unsigned int* goods) {unsigned int i, j, C;  
    for (i = 0; i < pattern_length-1 ++i) {goods[i] = pattern_length;    
    }//Initialize the good suffix value of pattern last element goods[pattern_length-1] = 1;  This loop finds the pre value of each element in pattern, where the goods array is first used as the pre array for (i = pattern_length-1, c = 0; I!= 0;-i) {for (j = 0; J < I;  
            ++J) {if (memcmp (pattern + I, pattern + j, (pattern_length-i) * sizeof (char)) = = 0)  
                {if (j = = 0) {c = pattern_length-i;  
                    else {if (pattern[i-1]!= pattern[j-1])  
                    {Goods[i-1] = j-1; The goods value for (i = 0; i < PA) is computed based on the pre value of the elements in pattern. Ttern_length-1;  
        ++i) {if (Goods[i]!= pattern_length) {Goods[i] = pattern_length-1-goods[i];  

            else {Goods[i] = pattern_length-1-i + goods[i];  
            if (c!= 0 && pattern_length-1-i >= c) {goods[i] = C;   }  
        }  
    }  
}

The

Build the table BM algorithm is simple.

unsigned int BM (const char* text, size_t text_length, const char* pattern, size_t pattern_length, unsigned int*)  
    {unsigned int i, j, M;  
    unsigned int badc[alphabet_size];   
    unsigned int goods[pattern_length];  
    i = j = pattern_length-1;  

    m = 0;  
    Build good suffix and bad character chart buildbadc (pattern, pattern_length, BADC, alphabet_size);  

    Buildgoods (pattern, pattern_length, goods); while (J < text_length) {//Discovery Target pass and pattern pass from the backward 1th position while ((I!= 0) && (pattern[i] = = Te  
            XT[J])) {I.  
        --j;  
            //Find a match if (i = = 0 && Pattern[i] = = Text[j]) {matches[m++] = j;  
        J + + Goods[0];  } else {//bad character chart with dictionary to build more appropriate J + + Goods[i] > badc[text[j]-' A ']? Goods[i]:  
        badc[text[j]-' A '];  
    } i = Pattern_length-1;  
return m;   }

The idea of BM algorithm is very important in string matching, in which good suffixes coincide with the idea of KMP algorithm, and bad character jumps are similar to Sunday algorithm. In general, BM is a very efficient matching algorithm.

Three,Sunday algorithm
The Sunday algorithm is a string pattern match proposed by Daniel M.sunday in 1990. According to the Internet, the Sunda algorithm is more efficient than BM and KMP (not quite clear). In my opinion, the Sunday algorithm is the BM algorithm optimization, logic is easy to understand, the code is also easier to implement (no need to construct two tables).

The idea of the Sunday algorithm is:
1. Start with the first character of the text string s, and compare it with that of the pattern string T, and if it is equal, then the main string and the pattern string are moved back one character to continue the comparison;
2. If not the same, then the text string to participate in matching the last character of the lowest character and pattern string inverse match.
3. If the pattern string is not matched, the pattern string is skipped, that is, the moving digit = match string length + 1.
4. If the pattern string matches the character, the same character in the pattern string is moved to the text string under the character, aligned with the character. The number of bits moved = the distance at the far right end of the pattern string is +1.

The algorithm steps are as follows:

Subscript I 0 1 2 3 4 5 6 7 8 9 The
main string s    C  b  a  b  d  c  b  a  c  b   b   a   D
pattern string T  c  b  b  A    

In the past, the mismatch at i = 2 is concerned about whether i= 4 (d) is included in the pattern string T. The traversal compares the discovery does not contain, moves the pattern t to I = 5 continues to compare. (After the mismatch, the main string this round of participation in the comparison of the latter (d) must be involved in the next round of comparisons, T in the D, the match will not be successful AH. More than anything, move directly back to (d) to the next position. )

Subscript I 0 1 2 3 4 5 6 7 8 9 The
main string s    C  b  a  b  d  c  b  a  c  b   b   a   D
pattern string T                 c  b  b  A  

In i = 7 mismatch, Attention i = 9 (b), in the pattern string to find B, i = 6, I = 7 have B, which should be aligned with. It should be aligned with the back (i = 7), which is why we have to find B in the pattern string backwards, find the direct break, and align here with i = 9. (Why and the back of the B-aligned, we think)

Subscript I    0 1 2 3 4 5 6 7 8 9 A
main string s    C  B  a  b  d  c  b  a  c  b b a   d
pattern string T                       C  b  B   A  

Mismatch at i=7, focusing on i=11 (a), matching the last one in the pattern string to a,break and aligning a with i = 17 in the pattern string. Continue to compare.

Subscript I    0 1 2 3 4 5 6 7 8 9 A
main string s    C  B  a  b  d  c  b  a  c  b b a   d
pattern string T                          C  b   B   A  

Match completed.

It is shown that the core idea of Sunday algorithm is mismatch, the pattern string can move as much backward as possible, so that the number of matching is reduced and the efficiency is improved.

Sunday and BM are different points:
1. Bm from the back to match, Sunday in the past.
2. BM algorithm mismatch is concerned with the "last One", Sunday is concerned with "the last one".
3. BM has bad string good string points, Sunday not. (But the idea is similar, the good characters and bad characters in BM are included in the pattern string, and can be compared to the Sunday algorithm to find the character; good characters and bad strings do not appear, the analogy is not to find the character in the pattern string).

Code implementation:

int Sunday (const char *STR1, const char *str2)
{
    int str1_len = strlen (str1);
    int str2_len = strlen (str2);
    int i = 0;
    int j = 0;
    Enum{false,true};
    int Y = FALSE;   

    if (str1 = null | | str2 = = NULL) {
    return-1
    }

    while (I < Str1_len && J < Str2_len) {
        if (str1[i] = = Str2[j]) {
        i++;
        j + +; 
        } else{
        //focus on the last one after this character
        int num = i-j + str2_len;
        for (j = str2_len-1 J >= 0; j--) {
        //the character is compared to the pattern string, and the same is found to be aligned with the character.
        if (str1[num] = = Str2[j]) {
            i = num-j;
                    Y = TRUE;
            break;
            }
        }
        if (Y = = FALSE) {
        //No pattern string is found, jump directly to the next digit of the character.
        i = num + 1; 
        }
        Y = FALSE;
        j = 0;
        }
    }       
    if (i = = Str1_len) {
    return-1
    } else return
        I-str2_len;;
}

The application of Sunday algorithm is very strong, (actual efficiency is higher than KMP and BM algorithm), code implementation is also very simple, I hope we can master.

Four,KMP algorithm
The Knuth-morris-pratt string lookup algorithm, referred to as the "KMP algorithm" by Donald Knuth, Vaughan Pratt, and James H. Morris, was published jointly in 1977, thus naming this algorithm for the names of the 3 people.
KMP algorithm is very troublesome on the Internet, I think it is similar to the BM algorithm of the good character suffix matching rules, and next[] array of derivation two points.

The idea of the algorithm is: Suppose that now the text string s matches to the I position, the pattern string P matches to the J position if J =-1 (mark), or the current character match succeeds (namely s[i] = = P[j]), all make the i++,j++, continue to match the next character;
If J!=-1 and the current character match fails (that is, S[i]!= p[j]), then I will not change, j = next[j]. When it means mismatch, the pattern string P moves J-next [j] bit to the right relative to the text string s. In other words, when the match fails, the number of digits to the right of the pattern string is: the location of the mismatch character-the next value for the mismatch character. The actual number of digits to move is: j-next[j], and this value is greater than or equal to 1.

The value of the next array is the same prefix suffix that represents the length of the string before the mismatch. For example, next[j] = k; A string preceded by J with the same prefix suffix with a maximum length of K.
This also means that when a character is mismatched, the next value of the character tells you that in the next match, the pattern string should jump to the position of j-next[j]. So the point is to seek next[].

As follows:
Abcdab abcdabc abcdabcdabdabd Text string
ABCDABD Mode string

Maximum prefix suffix same number:

  A               left                         right                     
  AB              a                          B                      0
  ABC            a,ab                       c,bc                    0
  ABCD          a,ab,abc                   D,CD,BCD                 0
  ABCDA        A,AB,ABC,ABCD              a,da,cda,bcda             1
  abcdab     a,ab,abc,abcd,abcda         B,ab,dab,cdab, Bcdab        2
  abcdabd   a,ab,abc,abcd,abcda,abcdab  d,bd,abd,dabd,cdabd,bcdabd  0

  Maximum prefix suffix common element length comparison table
  a   b   C   d   a   b   d 
  0   0   0   0   1   2   0    

The Next array considers the longest same prefix suffix except the current character, therefore, the maximum length of the common elements of each prefix suffix is obtained by the steps above, as long as a minor deformation can be made: the value of the evaluated total is shifted to the right one, then the initial value is assigned to 1, as shown in the following table:

A   b   C   d   a   b   d 
-1   0   0   0   0   1   2

In the mismatch mismatch, only need to use the mismatch position J minus Next[j], you can get the mode string moved to where.

Next[] 's fetch code is implemented as follows:

int *get_next (const char* str2)
{
    int str2_len = strlen (str2);
    int *next = (int *) malloc (sizeof (int) *str2_len);
    Next[0] =-1;
    int left =-1;
    int right = 0;

    while (right < str2_len-1) {
    if (left = = 1 | | str2[left] = = Str2[right]) {
        left++;
        right++;
        Next[right] = left;
    } else left
        = Next[left];
    }
    return next;
}

This code is to seek next[] corresponding value, and there is no practical meaning, can get the correct next[] on the line (next[) of the value of the Internet as if this kind of code implementation method.

       0   1   2   3   4 5 6 
 next-1   0   0   0 0 1 2   

Let's verify that the code is accurate.
Str_len-1~6
left =-1
right = 0
——————————————

       0   1   2   3   4 5 6 
 next-1   0 Left

  = 0 Right
  = 1
  next[1] = 0  
----------- ---------------------------------
  abcdabd Right
  = 1 Left
  = next[0] =-1
------------------------ --------------------Left
  = 0 Right
  = 2
  next[2] = 0
      0   1   2 3 4 5 6 
next-1   0   0

It can be seen that it is right (later on without deduction), the code design is very clever, can just calculate the corresponding value of next[]. (This code does not need to understand, remember on the line, is to seek next[] and specially designed algorithm).

Find out the corresponding value of next[], KMP algorithm code is very easy.

int KMP (const char* str1, const char* str2)
{
    int str1_len = strlen (str1);
    int str2_len = strlen (str2);
    int *next = Get_next (str2); 
    int i = 0;
    int j = 0;

    if (str1 = null | | str2 = = NULL) {
    return-1
    }
    while (I < Str1_len && J < Str2_len) {
        if (j = = 1 | | str1[i] = = Str2[j]) {
            i++;
            j + +;
        } else{
            //Key step, mismatch time according to next[] Jump
            j = next[j];
        }
    Free (next);  
    if (j = = Str2_len) {return
        (I-str2_len);    
    }
    return-1;
}

Time Complexity of KMP:
We find that if a character matches successfully, the position of the first character of the pattern string remains fixed, just i++, J + +, and if the match mismatch, I invariant (that is, I do not backtrack), the pattern string skips the matching next [j] character. The worst case scenario is when the pattern string first character is in the I-j position to match successfully, and the algorithm ends.
Therefore, if the length of the text string is n and the length of the pattern string is M, then the time complexity of the matching process is-O (n), the-O (m) time of next is computed, and the overall time complexity of the KMP is-O (M + N).

Four classical string matching algorithms are introduced, everyone on the paper more than counting, can better understand the algorithm idea.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.