Boyer-Moore algorithm for String Matching

Last Update:2018-12-06 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

The "Search" function (CTRL + F) of various text editors mostly uses the Boyer-Moore algorithm.

Boyer-Moore algorithms are not only efficient, but also clever and easy to understand. In 1977, Professor Robert S. Boyer of the University of Texas and Professor J Strother Moore invented this algorithm.

Next, I will explain this algorithm based on Professor Moore's own example.

Assume that the string is "here is a simple example" and the search term is "example ".

First, the "string" and "search term" headers are aligned and compared from the end.

This is a very clever idea, because if the tail character does not match, you can know that the first seven characters are not the result of a comparison.

We can see that "S" and "E" do not match. At this time,"S" is called "bad character" (bad character), that is, unmatched characters.We also found that "S" is not included in the search term "example", which means that you can directly move the search term to the next digit of "S.

The comparison starts from the end and finds that "p" does not match "E", so "P" is a "bad character ". However, "P" is included in the search term "example. Therefore, after the search term is moved to two places, the two "p" are aligned.

Therefore, we can conclude that"Bad character rules":

Number of digits after shift = Location of bad characters-last position in the search term

If the "bad character" is not included in the search term, the last occurrence location is-1.

Take "p" as an example. It is used as a "bad character" and appears at the 6th-bit (numbered from 0) of the search term. The last appearance position in the search term is 4, so move 6-4 to 2. Take "S" in the second step as an example. It appears at 6th bits. If the last occurrence is-1 (that is, it does not appear), the entire search term is moved 6-(-1) = 7 bits.

The comparison starts from the end, and "E" matches "E.

Compare the first digit. "Le" matches "Le.

Compare the previous one, and "ple" matches "ple.

Compare the first digit, and "mple" matches "mple.We call this situation "Good suffix", that is, all matching strings at the end.Note that "mple", "ple", "Le", and "E" are good suffixes.

Comparing the previous one, we found that "I" and "A" do not match. Therefore, "I" is a "bad character ".

10.

According to the "bad character rules", the search term should be 2-(-1) = 3 characters later. The problem is, is there a better shift method at this time?

11.

We know that there is a "good suffix ". Therefore, you can use"Suffix rules":

Number of digits after the move = position with a good suffix-last position in the search term

When calculating, the value of the position is based on the last character of the "good suffix. If the "good suffix" is not repeated in the search term, its last appearance location is-1.

Among all the "good suffixes" (mple, ple, le, and e), only "E" appears twice in "example", so 6-0 = 6 digits are removed.

12.

As you can see, the "bad character rule" can only be moved to 3 bits, and the "good suffix rule" can be moved to 6 bits. So,The basic idea of the Boyer-Moore algorithm is to move the two rules one by one.

What's more clever is that the number of moving digits of these two rules is only related to the search term, not to the original string. Therefore, you can pre-calculate and generate the bad character rule table and the good suffix rule table. During use, you only need to compare the table.

13.

The comparison starts from the end. "P" does not match "E", so "P" is a "bad character ". Based on the "bad character rule", the value is shifted to 6-4 = 2 characters.

14.

Start from the end and compare by bit. If all matches are found, the search ends. If you want to continue searching (that is, find all matches), Move 6-0 = 6 digits after the suffix rule, that is, the "E" of the header is moved to the "e" position at the end.

Link: http://blog.jobbole.com/39132/

The algorithm is simple and easy to implement ~

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Boyer-Moore algorithm for String Matching

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Boyer-Moore algorithm for String Matching

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support