The following is my understanding of this algorithm. I have referenced some introduction to this algorithm. each graph in it is very serious. I hope to clarify the problem clearly, if you have any errors, questions, or do not understand anything, you must raise them to learn and make progress together! The text below begins. 1. a brief introduction to the algorithm used to find sub-strings. the BM (Boyer-Moore) algorithm is currently considered the most efficient string search algorithm, which consists of BobBoyer and
The following is my understanding of this algorithm. I have referenced some introduction to this algorithm. each graph in it is very serious. I hope to clarify the problem clearly, if you have any errors, questions, or do not understand anything, you must raise them to learn and make progress together! The text below begins.
1. Brief introduction
Among the algorithms used to find substrings, the BM (Boyer-Moore) algorithm is currently considered the most efficient string search algorithm. it was designed by Bob Boyer and J Strother Moore in 1977. Generally, it is 3-5 times faster than the KMP algorithm. This algorithm is often used in text editors for searching and matching. for example, the GNU grep command, which is well-known, uses this algorithm. this is also an important reason why GNU grep is faster than BSD grep, for more information, see my latest article: "Why is GNU grep so fast?" The author is Mike Haertel, author of GNU grep.
2. Main features
Assume that the text length of the text string is n and the pattern length of the pattern string is m. The main features of the BM algorithm are:
- Compare and match from right to left (for example, KMP is a string search algorithm that matches from left to right );
- Algorithms are divided into two phases: preprocessing and search;
- The time and space complexity of the pre-processing phase are bothO(M+) Is the character set size, generally 256;
- The time complexity of the search phase isO(MN);
- When the mode string is non-cyclical, the algorithm needs to perform a 3n character comparison in the worst case;
- The algorithm achievesO(N? /?M), For example, in the text string bn search mode string AM-1B, only need n/m times to compare.
These features allow you to have a basic understanding of the algorithm. when you understand the algorithm, you can see these features again.
3. basic idea of algorithms
The regular matching algorithm moves the mode string from left to right, while the comparison is from left to right. The basic framework is as follows:
while(j <= strlen(text) - strlen(pattern)){ for (i = 0; i < strlen(pattern) && pattern[i] == text[i + j]; ++i); if (i == strlen(pattern)) { Match; break; } else ++j;}
The BM algorithm moves the mode string from left to right, while the comparison is from right to left. The basic framework is as follows:
while(j <= strlen(text) - strlen(pattern)){ for (i = strlen(pattern); i >= 0 && pattern[i] == text[i + j]; --i); if (i < 0)) { Match; break; } else j += BM();}
The essence of the BM algorithm lies in the BM (text, pattern), that is, when the BM algorithm does not match, it can skip more than one character at a time. That is, it does not need to compare the characters in the searched strings one by one, but will skip some of them. Generally, the longer the search keyword, the faster the algorithm. Its efficiency comes from the fact that algorithms can use this information to eliminate as many locations as possible for each failed match attempt. That is, it makes full use of some features of the string to be searched to accelerate the search process.
The BM algorithm actually contains two parallel algorithms (that is, two heuristic policies): the bad character algorithm (Bad-character shift) And suffix algorithms (Good-suffix shift). These two algorithms aim to make the pattern string move to the right as much distance as possible (that is, the BM () above ).
We will not directly explain the two algorithms in writing below. to make them easier to understand, we should first describe them with examples. this is the easiest way to accept.
4. string search brainstorm
Let's brainstorm: how to speed up string search? For example, navie indicates the general practice. the comparison is performed one by one, from right to left, and the last character c does not match d in text. pattern shifts one place to the right. But let's take a look at the features of d? There is no d in pattern, so no matter whether you shift 1, 2, 3, or 4 to the right, it will definitely not match. why bother with this? Is it better to directly shift the 5 (strlen (pattern) bits to the right for comparison? Well, let's do this. after the right shift of 5 digits, the B in text is compared with the c in pattern, and the result is still different. what should I do? B is in pattern, so it cannot be shifted to the right by five digits. do you want to shift one digit to the right? No, you can directly shift B in pattern to the position B in text for comparison, but there are two B in pattern, which B is shifted to the right? The insurance method is to use B and text on the rightmost side for comparison. why? It is clear that B on the far left is too radical, and it is easy to miss the true match. in the figure, B on the far right is used to find that exactly all matches are successful, if I use the leftmost option, do I miss this match? This heuristic search is done by the BM algorithm.
But, if the following problem occurs, the c and B in text in pattern do not match, OK, follow the rules above to shift pattern right until B on the rightmost side is aligned with B on text for comparison. Then compare c in pattern with c in text, and continue the comparison to left until a in pattern 3 does not match B in text, according to the heuristic rules mentioned above, we should align the rightmost B in pattern with the B in text. what can be found at this moment? What should I do if pattern goes back? Of course, do not be so stupid. in this case, you only need to move pattern one step to the right and stick to the same path!
Well, this is the so-called "bad character algorithm". It's simple and easy to understand. The B marked in bold above is the "bad character", that is, unmatched characters, bad characters are for text.
Is BM so simple? Just a heuristic rule? Of course not. Let's brainstorm again. Is there any other way to speed up string search? For example
At first, I used the bad character algorithm to move four digits. that's good. next I encountered a turning back. I can't just move one digit conservatively, but can I really only move one character? No, because the other locations in the front of pattern also have the suffix AB that has just been matched successfully, it is better to move the AB in front of pattern to the AB alignment that has just been matched in text to continue the forward matching? In this way, you can shift the right two places at a time. There is a good heuristic search rule. Some people may think: what if there is no matched suffix? Is it invalid? Not all. this depends on the situation, for example, the following example.
The suffix cbab has been successfully matched, and then B has not been successful, and no string such as cbab was found before pattern, so that it will directly shift one bit conservatively? No, there is AB in front. this is part of the cbab suffix. you can also make good use of it to directly shift the AB in front of pattern to the AB position where the text has been matched and continue the forward matching, in this way, four digits are moved right at once, which is good. Of course, if there is no matched suffix or partial suffix, such as babac at the beginning, it cannot be used.
Well, this is the so-called "good suffix algorithm". It's simple and easy to understand. The AB (example above) and cbab (example above) marked in red) is "good suffix", good suffix is for pattern.
Next, we will give an example to illustrate what is a bad character or a good suffix.
Master string? :? Mahtavaatalomaisema omalomailuun
Mode string: maisemaomaloma
Bad character: "t" in the main string is a bad character.
Good suffix: aloma in the mode string is "good suffix ".
Is BM so simple? Yes, it is easy to understand, but not everyone can think of two heuristic search rules to create an excellent algorithm such as BM. So there is another problem? How can we use these two algorithms? for bad characters, for suffixes, and when should we use bad characters? When should I use a proper suffix? A good question is, it depends on which number of digits is shifted to the right. for example, in the above example, if a suffix is used properly, only one digit can be moved, and three digits can be shifted to the right using bad characters, in this case, select the bad character algorithm. Next, if you continue to use bad characters, you can only move one character to the right, and use a good suffix to move four characters to the right. what do you mean by this time? So, the two algorithms are "parallel", which is the most useful.
The example alone shows that it is not enough, too small, and may not completely cover all the situations, not accurate. The following is a theoretical discussion.
5. discussion on BM algorithm theory
(1) bad character algorithm
When a bad character occurs, the BM algorithm moves the pattern string to the right, compares the rightmost character in the pattern string with the bad character, and continues matching. There are two bad character algorithms.
Case1: when there are corresponding bad characters in the mode string, make the rightmost matched character in the mode string relative to the bad character (PS: BM cannot go back, because if it goes back, the moving distance is a negative number, and it is definitely not the maximum number of moving steps.
Case2: there are no bad characters in the mode string. it is good. you can directly shift the length of the entire mode string to such a large step, for example.
(2) suffix algorithms
If the program matches a good suffix and there is another part with the same suffix or suffix in the pattern, move the next suffix or part to the current suffix. Assume that the last u characters of the pattern match the text, but the next character does not match. I need to move it to match. If the last u character has appeared or partially appeared in other locations of the pattern, we will shift the right side of the pattern to the previous u character or part, and the last u character or part is the same, if the last u character does not appear at all other locations of pattern, it is good to directly shift the entire pattern right. In this way, there are three possible suffix algorithms, as shown in:
Case1: if a pattern string contains a perfect match between a substring and a suffix, move the substring to the position of the suffix to continue matching.
Case2: if there is no substring that exactly matches the suffix, find the oldest substring with the following features in the suffix so that P [m-s... M] = P [0... S].
Case3: if there is no child string that matches the regular suffix, the entire pattern string is shifted to the right.
(3) mobile rules
The moving rules of the BM algorithm are as follows:
Replace j + = BM () in the basic framework of the algorithm in 3 with j + = MAX (shift (good suffix), shift (bad character), that is
The BM algorithm moves the distance between the pattern strings to the right. The maximum value is calculated based on the suffix algorithm and the bad character algorithm.
Shift (good suffix) and shift (bad character) are obtained through simple calculation of the pre-processing array of the mode string. The pre-processing array of the bad character algorithm is bmBc [], and the pre-processing array of the good suffix algorithm is bmGs [].
6. execute the BM algorithm
When the child strings of the BM algorithm are not matched, calculate the right shift distance of the pattern based on the bad character algorithm. use the bmBc array, and use the bmGs array to calculate the right shift distance of the pattern based on the suffix algorithm. The following describes how to calculate the two preprocessing arrays bmBc [] and bmGs.
(1) calculate the bad character array bmBc []
This computation should be very easy. it seems that only bmBc [I] = m-1-I is required, but this is wrong, because the characters at position I may appear in multiple places in pattern (as shown in), what we need is the rightmost position, so we need to judge each loop, which is very troublesome, poor performance. Here is a trick: use a character as a subscript rather than a positional number as a subscript. In this way, you only need to traverse it once. This seems to be the practice of changing the space for time, but if it is a pure 8-character, it only requires 256 space sizes, and for large mode, the length may exceed 256, so it is worth doing so (this is also one of the reasons why the larger the data, the more efficient the BM algorithm ).
As mentioned above, the bmBc [] calculation is divided into two situations, which correspond to the previous one.
Case1: The character appears in the mode string. bmBc ['V'] indicates the last position of the character v in the mode string and the length from the end of the mode string, as shown in.
Case2: the character does not appear in the mode string. if the mode string does not contain the character v, BmBc ['V'] = strlen (pattern ).
Writing code is also very simple:
void PreBmBc(char *pattern, int m, int bmBc[]){ int i; for(i = 0; i < 256; i++) { bmBc[i] = m; } for(i = 0; i < m - 1; i++) { bmBc[pattern[i]] = m - 1 - i; }}
To calculate the distance that pattern needs to be shifted to the right, use the bmBc array. is The bmBc value the actual distance that pattern needs to be shifted to the right? No. if you think about it, for example, the bmBc algorithm may go back, that is, the distance to the right shift is a negative number, and the bmBc value cannot be a negative number, so the two are not equal. How can we calculate the actual right shift distance of pattern? This depends on the location of bad characters in text. As mentioned earlier, the bad character algorithm is for text. let's look at the figure and see it at a glance. In the figure, v is a bad character in text (corresponding to position I + j). If the unmatched position in pattern is I, the actual right shift distance of pattern is: bmBc ['V']-m + 1 + I.
(2) calculate the suffix array bmGs []
BmGs [] indicates the position of a character in pattern.
As mentioned above, the bmGs array is calculated in three cases, which correspond to the first one. Assume that the suffix length in the figure is represented by an array suff.
Case1: corresponding to the suffix algorithm case1. for example, j is the position before the suffix.
Case2: corresponding to the suffix algorithm case2: as shown in:
Case3: corresponds to the good suffix algorithm case3, bmGs [I] = strlen (pattern) = m
This makes the code clearer and easier to write:
Void PreBmGs (char * pattern, int m, int bmGs []) {int I, j; int suff [SIZE]; // calculates the suffix array suffix (pattern, m, suff); // assign all values to m first, including Case3 for (I = 0; I <m; I ++) {bmGs [I] = m ;} // Case2 j = 0; for (I = m-1; I> = 0; I --) {if (suff [I] = I + 1) {(; j <m-1-I; j ++) {if (bmGs [j] = m) bmGs [j] = m-1-I ;}}} // Case1 for (I = 0; I <= m-2; I ++) {bmGs [m-1-suff [I] = m-1-I ;}}
So easy? Is it over? What's next? What about suff [] here?
When calculating bmGc arrays, to improve efficiency, calculate the secondary array suff [] to indicate the length of the suffix.
Suff array definition: m is the length of pattern
A. suffix [m-1] = m; B. suffix [I] = k? ? For [pattern [I-k + 1]..., Pattern [I] = [pattern [m-1-k + 1], pattern [m-1]
It seems obscure. In fact, suff [I] is the length of the public suffix string in pattern with the I-position character as the suffix and the last character as the suffix. I don't know if this is the case. let's take an example:
I ?? ? : 0 1 2 3 4 5 6 7
Pattern: B c? A B
When I = 7, suff [7] = strlen (pattern) = 8 by definition
When I = 6, the suffix string suffixed with pattern [6] is bcababa, and the suffix string suffixed with the last character B is bcababab. there is no common suffix string between the two, so suff [6] = 0
When I = 5, the suffix string suffixed with pattern [5] is bcabab, and the suffix string suffixed with the last character B is bcababab. the common suffix string of the two is abab, so suff [5] = 4
And so on ......
When I = 0, the suffix string with the suffix pattern [0] is B, the suffix string with the last character B as the suffix is bcababab, and the common suffix string of both is B, so suff [0] = 1
In this case, the code is also very easy to write:
void suffix(char *pattern, int m, int suff[]){ int i, j; int k; suff[m - 1] = m; for(i = m - 2; i >= 0; i--) { j = i; while(j >= 0 && pattern[j] == pattern[m - 1 - i + j]) j--; suff[i] = i - j; }}
In this way, everything may be fine, but some people are always dissatisfied with this algorithm and feel too violent. so some smart people come up with a method to improve the above conventional method. The basic scan is from right to left. the improvement is to use the computed suff [] value to calculate the currently computed suff [] value. For more information, see:
I is the position where the suff [] value is being calculated.
F is the starting position of the previous successful match (not every position can be matched successfully ,? In fact, there are not many locations that can be matched successfully ).
G is the mismatch of the last successful match.
If I is between g and f, there must be P [I] = P [m-1-f + I]; and if suff [m-1-f + I] <I-g, then suff [I] = suff [m-1-f + I], this does not take advantage of the previous suff.
PS:Some people may think it should be suff [m-1-f + I] <= I-g, because if suff [m-1-f + I] = I-g, we can still use the previous suff [], but this is wrong, for example, an extreme example:
I ????? : 0 1 2 3 4 5 6 7 8 9
Pattern:? A B? A
Suff [4] = 4. here f = 4, g = 0. when I = 3 is, suff [M-1 = f + I] = suff [8] = 3, suff [3] = 4, the two are not equal, because the last mismatch position g may be matched this time.
Okay. after this explanation, the code is relatively simple:
void suffix(char *pattern, int m, int suff[]) { int f, g, i; suff[m - 1] = m; g = m - 1; for (i = m - 2; i >= 0; --i) { if (i > g && suff[i + m - 1 - f] < i - g) suff[i] = suff[i + m - 1 - f]; else { if (i < g) g = i; f = i; while (g >= 0 && pattern[g] == pattern[g + m - 1 - f]) --g; suff[i] = f - g; } }}
Is it over? OK. it can be said that all the important algorithms have been completed. I hope you can understand them. to verify whether you fully understand them, the following is a simple example, calculate bmBc [], suff [], and bmGs.
Example:
PS:Someone may ask: how is bmBc ['B'] equal to 2? isn't it the last position in pattern? It should be 0 by definition. Take a closer look at the bmBc algorithm:
for(i = 0; i < m - 1; i++) { bmBc[pattern[i]] = m - 1 - i; }
Here I <m-1 is not I <m, that is, if the last character has not appeared before, then its bmBc value is m. Why is the last bit not included in bmBc? It is easy to think, if the bmBc of the character is 0, as described above, pattern needs to shift right away from bmBc ['V']-m + 1 + I =-m + 1 + I <= 0, that is, do not move or go back, of course, do not do it, we have already explained this situation clearly, so it is expressed here as "M-1.
Okay, all of them are finished. let's integrate these algorithms.
# Include
# Include
# Define max_char256 # define SIZE 256 # define MAX (x, y) (x)> (y )? (X): (y) void BoyerMoore (char * pattern, int m, char * text, int n); int main () {char text [256], pattern [256]; while (1) {scanf ("% s", text, pattern); if (text = 0 | pattern = 0) break; boyerMoore (pattern, strlen (pattern), text, strlen (text); printf ("\ n");} return 0;} void print (int * array, int n, char * arrayName) {int I; printf ("% s:", arrayName); for (I = 0; I <n; I ++) {printf ("% d", array [I]);} printf ("\ n");} void PreBmBc (char * pattern, int m, int bmBc []) {int I; for (I = 0; I <MAX_CHAR; I ++) {bmBc [I] = m ;}for (I = 0; I <m-1; I ++) {bmBc [pattern [I] = m-1-I;}/* printf ("bmBc []:"); for (I = 0; I <m; I ++) {printf ("% d", bmBc [pattern [I]);} printf ("\ n "); */} void suffix_old (char * pattern, int m, int suff []) {int I, j; suff [m-1] = m; for (I = m-2; I> = 0; I --) {j = I; while (j> = 0 & pattern [j] = pattern [m-1-I + j]) j --; suff [I] = I-j ;}} void suffix (char * pattern, int m, int suff []) {int f, g, I; suff [m-1] = m; g = m-1; for (I = m-2; I> = 0; -- I) {if (I> g & suff [I + m-1-f] <I-g) suff [I] = suff [I + m-1-f]; else {if (I <g) g = I; f = I; while (g> = 0 & pattern [g] = pattern [g + m-1-f]) -- g; suff [I] = f-g ;}} // print (suff, m, "suff []");} void PreBmGs (char * pattern, int m, int bmGs []) {int I, j; int suff [SIZE]; // calculate the suffix array suffix (pattern, m, suff); // assign all values to m first, including Case3 for (I = 0; I <m; I ++) {bmGs [I] = m;} // Case2 j = 0; for (I = m-1; I> = 0; I --) {if (suff [I] = I + 1) {for (; j <m-1-I; j ++) {if (bmGs [j] = m) bmGs [j] = m-1-I ;}}// Case1 for (I = 0; I <= m-2; I ++) {bmGs [m-1-suff [I] = m-1-I;} // print (bmGs, m, "bmGs []");} void BoyerMoore (char * pattern, int m, char * text, int n) {int I, j, bmBc [MAX_CHAR], bmGs [SIZE]; // Preprocessing PreBmBc (pattern, m, bmBc); PreBmGs (pattern, m, bmGs); // Searching j = 0; while (j <= n-m) {for (I = m-1; i> = 0 & pattern [I] = text [I + j]; I --); if (I <0) {printf ("Find it, the position is % d \ n ", j); j + = bmGs [0]; return ;} else {j + = MAX (bmBc [text [I + j]-m + 1 + I, bmGs [I]) ;}} printf ("No find. \ n ");}
The running effect is as follows: