Algorithm-BM algorithm for String Matching

Source: Internet
Author: User
Tags array definition

Algorithm-BM algorithm for String Matching
Preface

Boyer-Moore is a pattern string matching algorithm (BM) based on suffix matching. suffix matching means that pattern strings are compared from right to left, however, the mode string is still moved from left to right. In practice, the BM algorithm is more efficient than the KMP algorithm described earlier. The algorithm is divided into two phases: preprocessing and search; the time and space complexity of the pre-processing phase are O (m + sigma), and sigma is the character set size, generally 256; in the worst case, the algorithm time complexity is O (m * n); In the best case, it reaches O (n/m ).

Pre-processing of the BM algorithm

The BM algorithm has two rules:Bad character rules(Bad Character Heuristic) andSuffix rules(Good Suffix Heuristic); these two rules aim to make the pattern string move to the right as far as possible. The BM algorithm moves the distance between the pattern strings to the right. The maximum value is calculated based on the suffix algorithm and the bad character algorithm. The basic concepts are as follows:

Bad characters:When the character in the input text string does not match the current character in the mode string, this character in the text string is called a bad character;

Suffix:It refers to the character substring that has been matched successfully between the text string and the mode string before the occurrence of bad characters;

The following figure shows bad characters and suffixes:


Bad character rules: When a character in the input text string does not match a character in the mode string, the mode string needs to move to the right for the next match, number of digits to move = position of the bad character in the mode string-position of the bad character in the rightmost mode string. In addition, if there is no "bad character" in the mode string, the rightmost position is-1. Therefore, there must be two bad character rules, which will be discussed below.
Suffix rules:When the character mismatch occurs, the number of digits after the move is equal to the position of the suffix corresponding to the mode string-the position where the suffix appeared last time on the mode string, if the suffix does not appear again in the mode string, it is-1. There are three possible cases based on whether the mode string has a good suffix or some of the good suffixes. We will discuss them one by one.

Bad character rules

There are two bad character rules, as shown in:


Suffix rules

If the text string matches the pattern string with a good suffix u, whether there are good Suffixes in other locations of the pattern string will be moved differently. If, the last u character of the pattern string pat matches the text string txt, but the next character is a bad character. If the same suffix or partial suffix still exists in the pattern, move the longest suffix or partial suffix to the current suffix. If the pattern string pat does not have other good suffixes, the entire pat is directly shifted to the right. There are three possible suffix rules, as shown in:

The size of good suffix rules and bad character rules is calculated by the pre-processing array of the mode string. The pre-processing array of the bad character algorithm is bmBc [], and the pre-processing array of the good suffix algorithm is bmGs [].

Calculate the bad character array bmBc []

Case1: If the mode string contains bad characters, if the mode string contains multiple bad characters, select the rightmost character. BmBc ["B '] indicates the rightmost position of character B in the mode string.

For example, in the following mode string, the locations of bad character B are j, k, and I respectively. Then, the rightmost position I is selected as the value of bmBc ['B;

Case2: The character does not appear in the mode string. If the mode string does not contain character B, bmBc ["B '] =-1.

The source code implementation of the bad character array bmBc [] is as follows:

void PreBmBc(const string &pat, int m, int bmBc[]){    int i = 0;// Initialize all occurrences as -1, include case2    for(i = 0; i < MAX_CHAR; i++)        bmBc[i] = -1;   // case1:Fill the actual value of last occurrence of a character    for(i = 0; i < m; i++)        bmBc[pat[i]] = i;}

Calculate the suffix array bmGs []

The suff [] auxiliary array with the suffix array length is first solved before the suffix array is solved; it indicates the maximum length that matches the pattern string suffix with I as the boundary, as shown in: <喎?http: www.bkjia.com kf ware vc " target="_blank" class="keylink"> VcD4KPHA + pgltzybzcm9 "http://www.2cto.com/uploadfile/Collfiles/20141011/20141011084315349.png" alt = "\">

Suff [I] is used to determine the length of a public suffix string (including the current position character) in pat with the I-position character as the suffix and the last character as the suffix. The following example describes:

I: 0 1 2 3 4 5 6 7 | pat: B c a B/* When I = S-1 = 7, then suff [7] = 8; when I = 6, the suffix string with pat [6] As the suffix is bcababa, if the suffix string bcababab with the last character B as the suffix is not the public longest son string, that is, suff [6] = 0; when I = 5, the suffix string suffixed with pat [5] Is bcabab, And the suffix string suffixed with the last character B is bcababab. Then, the longest and most common abab is suff [5] = 4; when I = 4, the suffix string suffixed with pat [4] Is bcaba, And the suffix string suffixed with the last character B is bcababab. Therefore, there is no common headers, that is, suff [4] = 0 ;....... when I = 0, the suffix string with the pat [0] As the suffix is B, and the suffix string with the last character B as the suffix is bcababab, then the most common eldest son string is B, that is, suff [0] = 1 ;*/


SuffArray Definition: Reference from Boyer-Moore algorithm

For;

M is the length of the mode string, so it is easy to implement the source code as follows:

void suffix(const string &pat, int m, int suff[]){    int i, j;     suff[m - 1] = m;     for(i = m - 2; i >= 0; i--)    {j = i;        while(j >= 0 && pat[j] == pat[m - 1 - i + j]) j--;         suff[i] = i - j;    }}
With the suffix length array suff [] solved above, you can calculate the suffix array bmGs []. Based on the three cases of the suffix above, the solution array corresponds to three situations:


You can write out the source code of the suffix array bmGs:

void PreBmGs(const string &pat, int m, int bmGs[]){    int i, j;    int suff[SIZE];       // computed the suff[]    suffix(pat, m, suff);     // Initialize all occurrences as -1, include case3    for(j = 0; j < m; j++)    {        bmGs[j] = -1;    }     // Case2    j = 0;    for(i = m - 1; i >= 0; i--)    {        if(suff[i] == i + 1)        {            for(; j < m - 1 - i; j++)            {                if(bmGs[j] == -1)                    bmGs[j] = i;            }        }    }     // Case1    for(i = 0; i <= m - 2; i++)    {        j = m - 1 - suff[i];bmGs[j] = i;    }}

BM algorithm matching process

The method for solving the BM algorithm has been explained so far. The program of the BM algorithm is given below:

# Include
 
  
# Include
  
   
Using namespace std; const int MAX_CHAR = 256; const int SIZE = 256; static inline int MAX (int x, int y) {return x <y? Y: x;} void BoyerMoore (const string & pat, const string & txt); int main () {string txt = "ababaacbabaa"; string pat = "babaa "; boyerMoore (pat, txt); system ("pause"); return 0;} void PreBmBc (const string & pat, int m, int bmBc []) {int I = 0; // Initialize all occurrences as-1, include case2 for (I = 0; I <MAX_CHAR; I ++) bmBc [I] =-1; // case1: fill the actual value of last occurrence of a character for (I = 0; I <m; I ++) bmBc [pat [I] = I ;} void suffix (const string & pat, int m, int suff []) {int I, j; suff [m-1] = m; for (I = m-2; i> = 0; I --) {j = I; while (j> = 0 & pat [j] = pat [m-1-I + j]) j --; suff [I] = I-j ;}} void PreBmGs (const string & pat, int m, int bmGs []) {int I, j; int suff [SIZE]; // computed the suff [] suffix (pat, m, suff); // Initialize all occurrences as-1, include case3 for (j = 0; j <m; j ++) bmGs [j] =-1; // Case2 j = 0; for (I = m-1; I> = 0; I --) {if (suff [I] = I + 1) {for (; j <m-1-I; j ++) {if (bmGs [j] =-1) bmGs [j] = I ;}}// Case1 for (I = 0; I <= m-2; I ++) {j = m-1-suff [I]; bmGs [j] = I ;}} void BoyerMoore (const string & pat, const string & txt) {int j, bmBc [MAX_CHAR], bmGs [SIZE]; int m = pat. length (); int n = txt. length (); // Preprocessing PreBmBc (pat, m, bmBc); PreBmGs (pat, m, bmGs); // Searching int s = 0; // s is shift of the pattern with respect to text while (s <= n-m) {j = m-1; /* Keep indexing index j of pattern while characters of pattern and text are matching at this shift s */while (j> = 0 & pat [j] = txt [j + s]) j --;/* If the pattern is present at current shift, then index j will become-1 after the above loop */if (j <0) {cout <"pattern occurs at shift:" <s <
   
    
References:
    

Http://www-igm.univ-mlv.fr /~ Lecroq/string/node14.html

Http://blog.csdn.net/v_july_v/article/details/7041827

Http://blog.jobbole.com/52830/

Http://www.searchtb.com/

Http://www.geeksforgeeks.org/pattern-searching-set-7-boyer-moore-algorithm-bad-character-heuristic/

Http://www.ruanyifeng.com/blog/2013/05/boyer-moore_string_search_algorithm.html

Http://dsqiu.iteye.com/blog/1700312

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.