Ruan Yifeng: Boyer-Moore algorithm for String Matching

Last Update:2018-12-03 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

In the previous article, I introduced the KMP algorithm.

However, it is not the most efficient algorithm and is not used in many cases. The "Search" function (CTRL + F) of various text editors mostly uses
Boyer-Moore algorithm.

Boyer-Moore algorithms are not only efficient, but also clever and easy to understand. In 1977, Professor Robert S. Boyer of the University of Texas and Professor J Strother Moore invented this algorithm.

Next, I will explain this algorithm based on Professor Moore's own example.

Assume that the string is "here is a simple example" and the search term is "example ".

First, the "string" and "search term" headers are aligned and compared from the end.

This is a very clever idea, because if the tail character does not match, you can know that the first seven characters are not the result of a comparison.

We can see that "S" and "E" do not match. At this time,"S" is called "bad character" (bad character), that is, unmatched characters.We also found that "S" is not included in the search term "example", which means that the search term can be directly moved to the last digit of "S.

After comparison, we found that "p" does not match "E", so "P" is a "bad character ". However, "P" is included in the search term "example. Therefore, two "p" pairs are removed after the search term.

Therefore, we can conclude that"Bad character rules":

Number of digits after shift = Location of bad characters-last position in the search term

If the "bad character" is not included in the search term, the last occurrence location is-1.

Take "p" as an example. It is used as a "bad character" and appears at the 6th-bit (starting from 0) of the search term. The last appearance position in the search term is 4, so move 6-4 to 2. Take the second step of "S" as an example. It appears at 6th bits. If the last occurrence is-1 (that is, it does not appear), the entire search term is moved 6-(-1) = 7 bits.

The comparison starts from the end and matches "E" with "E.

Compare the first digit and match "Le" with "Le.

Compare the first digit, "ple" with "ple.

Compare the first digit and match "mple" with "mple.We call this situation "Good suffix" (good suffix), that is, all trailing matching strings.Note that "mple", "ple", "Le", and "E" are good suffixes.

Compared with the previous one, we found that "I" and "A" do not match. Therefore, "I" is "bad character ".

10.

According to the "bad character rule", the search term should be 2-(-1) = 3 characters later. The problem is, is there a better shift method at this time?

11.

We know that there is a "good suffix ". Therefore, you can use"Suffix rules":

Number of digits after the move = position with a good suffix-last position in the search term

When calculating, the value of the location is based on the last character of the "good suffix. If the "good suffix" does not appear repeatedly in the search term, its last appearance location is-1.

Among all the "good suffixes" (mple, ple, le, e), only "E" appears twice in "example", so 6-0 = 6 digits are removed.

12.

As you can see, the "bad character rule" can only move three bits, and the "good suffix rule" can move six bits. So,The basic idea of the Boyer-Moore algorithm is to move the two rules one by one.

What's more clever is that the number of moving digits of these two rules is only related to the search term, not to the original string. Therefore, you can pre-calculate and generate the bad character rule table and the good suffix rule table. During use, you only need to compare the table.

13.

The comparison starts from the end. "P" does not match "E", so "P" is a "bad character ". According to the "bad character rule", the value is 6-4 = 2 characters.

14.

Start from the end and compare by bit. If all matches are found, the search ends. If you want to continue searching (that is, to find all matches), Move 6-0 = 6 digits after the suffix rule, that is, the "E" of the header is moved to the "e" position at the end.

Attached CSHARP algorithm:

/// <Summary> /// return the start position of the specified string in the searched string. If no start position is found,-1 is returned; /// </Summary> /// <Param name = "haystack"> string to be searched </param> /// <Param name = "Needle"> string to be searched </param> /// <Param name = "startpos"> Start position </param> /// the next location found in <returns> </returns> Public static int indexof (string haystack, string needle, int startpos) {If (needle. length = 0) {return 0;} int [] chartable = makechartable (needle); int [] offsettable = makeoff Settable (needle); For (INT I = needle. length-1 + startpos, j = 0; I

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Ruan Yifeng: Boyer-Moore algorithm for String Matching

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Ruan Yifeng: Boyer-Moore algorithm for String Matching

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support