Http://hi.baidu.com/l6834279/item/d6ef651684dda4fcddeecae3
First, briefly describe some basic concepts about the BM algorithm.
The BM algorithm is an exact string matching algorithm (different from fuzzy match ).
The BM algorithm uses the right-to-left comparison method and applies two heuristic rules, namely, the bad character rule and the suffix rule, to determine the distance to the right jump.
The basic process of the BM algorithm: Set the text string T and the mode string to P. First, align T and P to the left, and then compare them from the right to the left, as shown in:
If a comparison does not match, the BM algorithm uses two heuristic rules, namely, the bad character rule and the suffix rule, to calculate the distance from the pattern string to the right until the end of the matching process.
Next, we will introduce in detail the bad character rules and suffix rules.
First, describe the concept of bad characters and suffix.
See:
In the figure, the first unmatched character (red) is a bad character, and the matched character (green) is a better suffix.
1) bad character rules (bad character ):
When the BM algorithm scans from the right to the left, if a character X does not match, the following two cases are discussed:
I. If character X does not appear in mode P, then M texts starting from character x obviously cannot match P. Skip this area directly.
II. If X appears in mode P, it is aligned with this character.
It is represented by a mathematical formula, where Skip (x) is the right-shift distance of P, M is the length of the pattern string P, and Max (x) is the rightmost position of character X in P.
Example 1:
The red part does not match once.
Calculates the moving distance from Skip (c) to 5-3 = 2, then P moves two places to the right.
After moving, for example:
2) Good suffix rules (good suffix ):
If a character does not match, some of the existing characters are successfully matched, the following two cases are discussed:
I. if the position t in P matches a part of P' in P, and the character before the position T is different from the character before the position t, shift P to the right so that t' corresponds to the location where T is located.
II. if no part of P matches in P, find the longest prefix X of P with the same suffix P ''as P, and move P to the right, the position where the suffix of P ''is located before the correspondence of X.
It is represented by a mathematical formula, where shift (j) is the right-shift distance of P, M is the length of the pattern string P, and J is the position of the matched character, S is the distance between t' and T (I in the above case) or the distance between x and P' (II in the above case ).
The above process is a bit abstract, so we will continue to illustrate it.
Example 2:
The matched cab (green) does not appear in P.
Then, if the suffix t' (blue) matches the prefix P' (red) in P, the P' is moved to the t' position.
After moving, for example:
Since then, the two rules have been explained.
In the process of BM algorithm matching, take the greater person in skip (X) and shift (j) as the Jump Distance.
The pre-processing time complexity of the BM algorithm is O (M + S), the space complexity is O (s), and S is the finite character set length related to P and T, the time complexity of the search phase is O (m · N ).
In the best case, the time complexity is O (N/m), and in the worst case, the time complexity is O (m · N ).