Boyer-moore algorithm for string matching

Source: Internet
Author: User

In a previous article, I introduced the KMP algorithm for string matching

However, it is not the most efficient algorithm, the actual adoption is not much. The "Find" function (ctrl+f) of various text editors, mostly using the Boyer-moore algorithm.

Below, I explain this algorithm according to Professor Moore's own example.

1.

The string is assumed to be "Here are A simple EXAMPLE" and the search term is "EXAMPLE".

2.

First, the string is aligned to the head of the search term and is compared from the tail.

This is a very clever idea, because if the trailing characters do not match, then as long as a comparison, you can know that the first 7 characters is definitely not the result to find.

We see that the "S" and "E" do not match. At this point, "S" is called the "bad Character" (character), which is the mismatched character. We also found that "s" is not included in the search term "EXAMPLE", which means that the search term can be moved directly to the next bit of "s".

3.

Still comparing from the tail, we find that "P" does not match "E", so "P" is "bad character". However, "P" is included in the search term "EXAMPLE". So, move the search word back two bits, two "P" to align.

4.

We have thus concluded the "bad character rule":

Post Shift number = position of bad character – last occurrence in search term

If the "bad character" is not included in the search term, the last occurrence is-1.

Take "P" for example, it appears as "bad character" in the 6th digit of the search term (numbering starting from 0), and the last occurrence in the search term is 4, so the 6–4 = 2 bit. Take the second step of the "S" as an example, it appears in the 6th bit, the last occurrence position is 1 (that is, does not appear), then the entire search term is shifted 6– (-1) = 7 bits.

5.

Still compare from the tail, "E" and "E" match.

6.

Compare the previous bit, "le" matches "le".

7.

Compare the previous bit, "ple" and "ple" match.

8.

Compare the previous bit, "mple" matches "Mple". We call this the "good suffix" (good suffix), which is the string that matches all tails. Note that "Mple", "PLE", "LE", "E" are good suffixes.

9.

Compare the previous one and find that "I" and "A" do not match. So, "I" is "bad character".

10.

Depending on the "bad character rule", the search term should be moved back to (-1) = 3 bits. The question is, is there a better way to move at this point?

11.

We know that there is a "good suffix" at this point. Therefore, the "good suffix rule" can be used:

Post Shift number = position of good suffix – last occurrence in search term

The value of the position is calculated as the last character of the "good suffix". If the "good suffix" does not recur in the search term, its last occurrence is-1.

Of all the "good suffixes" (Mple, PLE, LE, E), only "E" appears two times in "EXAMPLE", so the 6–0 is shifted back = 6 bits.

12.

As you can see, the bad character rule only moves 3 bits, and good suffix rules can move 6 bits. Therefore, the basic idea of the Boyer-moore algorithm is to move the larger values of the two rules each time.

More subtly, the number of moving digits of these two rules is only relevant to the search term, regardless of the original string. Therefore, you can pre-calculate the bad character rule table and the good suffix rule table. When using, just check the table to compare a bit.

13.


Continue to compare from the tail, "P" and "E" do not match, so "P" is "bad character". According to "bad character rule", move 6–4 = 2 bits.

14.

Starting from the end of the comparison, find all matches, so the search is over. If you want to continue searching (that is, to find all matches), follow the good suffix rule, 6–0 = 6 bits, that is, the "e" of the head moves to the trailing "e" position.

Boyer-moore algorithm for string matching

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.