This article turns from
Http://www.ruanyifeng.com/blog/2013/05/boyer-moore_string_search_algorithm.html
Nanyi
Date: May 3, 2013
In a previous article, I introduced the KMP algorithm.
However, it is not the most efficient algorithm, the actual adoption is not much. The "Find" function (ctrl+f) of various text editors, mostly using the Boyer-moore algorithm.
The Boyer-moore algorithm is not only efficient, but also ingenious and easy to understand. In 1977, Professor Robert S. Boyer of the University of Texas and Professor J Strother Moore invented the algorithm.
Below, I explain this algorithm according to Professor Moore's own example.
1.
The string is assumed to be "Here are A simple EXAMPLE" and the search term is "EXAMPLE".
2.
First, the string is aligned to the head of the search term and is compared from the tail.
This is a very clever idea, because if the trailing characters do not match, then as long as a comparison, you can know that the first 7 characters (overall) is definitely not the result to find.
We see that the "S" and "E" do not match. At this point,"S" is called the "bad Character" (character), which is the mismatched character. We also found that "s" is not included in the search term "EXAMPLE", which means that the search term can be moved directly to the next bit of "s".
3.
Still comparing from the tail, we find that "P" does not match "E", so "P" is "bad character". However, "P" is included in the search term "EXAMPLE". So, move the search word back two bits, two "P" to align.
4.
We have thus concluded the "bad character rule":
Post Shift number = position of bad character-last occurrence in search term
If the "bad character" is not included in the search term, the last occurrence is-1.
Take "P" for example, it appears as a "bad character" in the 6th digit of the search term (numbering starting from 0), and the last occurrence in the search term is 4, so the 6-4 = 2 bit is moved back. Take the second step of the "S" as an example, it appears in the 6th bit, the last occurrence position is 1 (that is, does not appear), then the entire search term is shifted 6-(-1) = 7 bits.
5.
Still compare from the tail, "E" and "E" match.
6.
Compare the previous bit, "le" matches "le".
7.
Compare the previous bit, "ple" and "ple" match.
8.
Compare the previous bit, "mple" matches "Mple". We call this the "good suffix" (good suffix), which is the string that matches all tails. Note that "Mple", "PLE", "LE", "E" are good suffixes.
9.
Compare the previous one and find that "I" and "A" do not match. So, "I" is "bad character".
10.
Depending on the "bad character rule", the search term should be shifted back by 2-(-1) = 3 bits. The question is, is there a better way to move at this point?
11.
We know that there is a "good suffix" at this point. Therefore, the "good suffix rule"can be used:
Post Shift number = position of good suffix-last occurrence in search term
For example, if the last "AB" of the string "Abcdab" is a "good suffix". Then its position is 5 (calculated from 0, take the last "B" value), "Last occurrence in the search term" is 1 (the position of the first "B"), so the 5-1 = 4 bit, the previous "AB" moved to the position of the latter "AB".
For another example, if the string "EF" of "ABCDEF" is a good suffix, the position of "EF" is 5, the last occurrence is 1 (that is, it does not appear), so move back 5-(-1) = 6 bits, that is, the entire string is moved to the next bit of "F".
This rule has three points to note:
(1) The position of "good suffix" shall be the last character. Assuming that the "EF" of "ABCDEF" is a good suffix, its position is "F", which is 5 (calculated from 0).
(2) If a "good suffix" appears only once in the search term, its last occurrence is-1. For example, "EF" appears only once in "ABCDEF", where its last occurrence is-1 (that is, it does not appear).
(3) If there are multiple "good suffixes", the last occurrence of the other "good suffixes" must be in the head except for the longest "good suffix". For example, suppose the "good suffix" of "Babcdab" is "DAB", "AB", "B", what is the last occurrence position of "good suffix"? The answer is, the good suffix is "B" at this time, its last occurrence is the head, that is, the No. 0 position. This rule can also be expressed as: if the longest "good suffix" appears only once, you can rewrite the search term to the following form of "(DA) Babcdab", that is, virtual join the first "Da".
Go back to this example above. At this point, all the "good suffixes" (Mple, PLE, LE, E), only "E" in "EXAMPLE" also appear in the head, so the back shift 6-0 = 6 bits.
12.
As you can see, the bad character rule only moves 3 bits, and good suffix rules can move 6 bits. Therefore,the basic idea of the Boyer-moore algorithm is to move the larger values of the two rules each time.
More subtly, the number of moving digits of these two rules is only relevant to the search term, regardless of the original string. Therefore, you can pre-calculate the bad character rule table and the good suffix rule table. When using, just check the table to compare a bit.
13.
Continue to compare from the tail, "P" and "E" do not match, so "P" is "bad character". According to "bad character rule", move back 6-4 = 2 bits.
14.
Starting from the end of the comparison, find all matches, so the search is over. If you want to continue searching (that is, to find all matches), follow the good suffix rule and move back 6-0 = 6 bits, that is, the "e" of the head moves to the trailing "e" position.
Finish
Boyer-moore algorithm for string matching