Ruan Yifeng: Boyer-Moore algorithm for String Matching
In the previous article, I introduced the KMP algorithm.
However, it is not the most efficient algorithm and is not used in many cases. The "Search" function (CTRL + F) of various text editors mostly uses
Boyer-Moore algorithm.
Boyer-Moore algorithms are not only efficient, but also clever and easy to understand. In 1977, Professor Robert S. Boyer of the University of Texas and Professor J Strother Moore invented this algorithm.
Next, I will explain this algorithm based on Professor Moore's own example.
1.
Assume that the string is "here is a simple example" and the search term is "example ".
2.
First, the "string" and "search term" headers are aligned and compared from the end.
This is a very clever idea, because if the tail character does not match, you can know that the first seven characters are not the result of a comparison.
We can see that "S" and "E" do not match. At this time,"S" is called "bad character" (bad character), that is, unmatched characters.We also found that "S" is not included in the search term "example", which means that the search term can be directly moved to the last digit of "S.
3.
After comparison, we found that "p" does not match "E", so "P" is a "bad character ". However, "P" is included in the search term "example. Therefore, two "p" pairs are removed after the search term.
4.
Therefore, we can conclude that"Bad character rules":
Number of digits after shift = Location of bad characters-last position in the search term
If the "bad character" is not included in the search term, the last occurrence location is-1.
Take "p" as an example. It is used as a "bad character" and appears at the 6th-bit (starting from 0) of the search term. The last appearance position in the search term is 4, so move 6-4 to 2. Take the second step of "S" as an example. It appears at 6th bits. If the last occurrence is-1 (that is, it does not appear), the entire search term is moved 6-(-1) = 7 bits.
5.
The comparison starts from the end and matches "E" with "E.
6.
Compare the first digit and match "Le" with "Le.
7.
Compare the first digit, "ple" with "ple.
8.
Compare the first digit and match "mple" with "mple.We call this situation "Good suffix" (good suffix), that is, all trailing matching strings.Note that "mple", "ple", "Le", and "E" are good suffixes.
9.
Compared with the previous one, we found that "I" and "A" do not match. Therefore, "I" is "bad character ".
10.
According to the "bad character rule", the search term should be 2-(-1) = 3 characters later. The problem is, is there a better shift method at this time?
11.
We know that there is a "good suffix ". Therefore, you can use"Suffix rules":
Number of digits after the move = position with a good suffix-last position in the search term
When calculating, the value of the location is based on the last character of the "good suffix. If the "good suffix" does not appear repeatedly in the search term, its last appearance location is-1.
Among all the "good suffixes" (mple, ple, le, e), only "E" appears twice in "example", so 6-0 = 6 digits are removed.
12.
As you can see, the "bad character rule" can only move three bits, and the "good suffix rule" can move six bits. So,The basic idea of the Boyer-Moore algorithm is to move the two rules one by one.
What's more clever is that the number of moving digits of these two rules is only related to the search term, not to the original string. Therefore, you can pre-calculate and generate the bad character rule table and the good suffix rule table. During use, you only need to compare the table.
13.
The comparison starts from the end. "P" does not match "E", so "P" is a "bad character ". According to the "bad character rule", the value is 6-4 = 2 characters.
14.
Start from the end and compare by bit. If all matches are found, the search ends. If you want to continue searching (that is, to find all matches), Move 6-0 = 6 digits after the suffix rule, that is, the "E" of the header is moved to the "e" position at the end.
Attached CSHARP algorithm:
/// <Summary> /// return the start position of the specified string in the searched string. If no start position is found,-1 is returned; /// </Summary> /// <Param name = "haystack"> string to be searched </param> /// <Param name = "Needle"> string to be searched </param> /// <Param name = "startpos"> Start position </param> /// the next location found in <returns> </returns> Public static int indexof (string haystack, string needle, int startpos) {If (needle. length = 0) {return 0;} int [] chartable = makechartable (needle); int [] offsettable = makeoff Settable (needle); For (INT I = needle. length-1 + startpos, j = 0; I