Boyer-Moore algorithm for String Matching

Source: Internet
Author: User

 

The "Search" function (CTRL + F) of various text editors mostly uses the Boyer-Moore algorithm.

Boyer-Moore algorithms are not only efficient, but also clever and easy to understand. In 1977, Professor Robert S. Boyer of the University of Texas and Professor J Strother Moore invented this algorithm.

Next, I will explain this algorithm based on Professor Moore's own example.

1.

Assume that the string is "here is a simple example" and the search term is "example ".

2.

First, the "string" and "search term" headers are aligned and compared from the end.

This is a very clever idea, because if the tail character does not match, you can know that the first seven characters are not the result of a comparison.

We can see that "S" and "E" do not match. At this time,"S" is called "bad character" (bad character), that is, unmatched characters.We also found that "S" is not included in the search term "example", which means that you can directly move the search term to the next digit of "S.

3.

The comparison starts from the end and finds that "p" does not match "E", so "P" is a "bad character ". However, "P" is included in the search term "example. Therefore, after the search term is moved to two places, the two "p" are aligned.

4.

Therefore, we can conclude that"Bad character rules":

Number of digits after shift = Location of bad characters-last position in the search term

If the "bad character" is not included in the search term, the last occurrence location is-1.

Take "p" as an example. It is used as a "bad character" and appears at the 6th-bit (numbered from 0) of the search term. The last appearance position in the search term is 4, so move 6-4 to 2. Take "S" in the second step as an example. It appears at 6th bits. If the last occurrence is-1 (that is, it does not appear), the entire search term is moved 6-(-1) = 7 bits.

5.

The comparison starts from the end, and "E" matches "E.

6.

Compare the first digit. "Le" matches "Le.

7.

Compare the previous one, and "ple" matches "ple.

8.

Compare the first digit, and "mple" matches "mple.We call this situation "Good suffix", that is, all matching strings at the end.Note that "mple", "ple", "Le", and "E" are good suffixes.

9.

Comparing the previous one, we found that "I" and "A" do not match. Therefore, "I" is a "bad character ".

10.

According to the "bad character rules", the search term should be 2-(-1) = 3 characters later. The problem is, is there a better shift method at this time?

11.

We know that there is a "good suffix ". Therefore, you can use"Suffix rules":

Number of digits after the move = position with a good suffix-last position in the search term

When calculating, the value of the position is based on the last character of the "good suffix. If the "good suffix" is not repeated in the search term, its last appearance location is-1.

Among all the "good suffixes" (mple, ple, le, and e), only "E" appears twice in "example", so 6-0 = 6 digits are removed.

12.

As you can see, the "bad character rule" can only be moved to 3 bits, and the "good suffix rule" can be moved to 6 bits. So,The basic idea of the Boyer-Moore algorithm is to move the two rules one by one.

What's more clever is that the number of moving digits of these two rules is only related to the search term, not to the original string. Therefore, you can pre-calculate and generate the bad character rule table and the good suffix rule table. During use, you only need to compare the table.

13.

The comparison starts from the end. "P" does not match "E", so "P" is a "bad character ". Based on the "bad character rule", the value is shifted to 6-4 = 2 characters.

14.

Start from the end and compare by bit. If all matches are found, the search ends. If you want to continue searching (that is, find all matches), Move 6-0 = 6 digits after the suffix rule, that is, the "E" of the header is moved to the "e" position at the end.

Link: http://blog.jobbole.com/39132/

 

The algorithm is simple and easy to implement ~

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.