On the-boyer-moore algorithm of data structure

Source: Internet
Author: User

The KMP algorithm is described above, the algorithm is less used in string matching, and most of the search functions in various text editors use the Boyer-moore algorithm. In 1977, Professor Robert S. Boyer of the University of Texas and Professor J Strother Moore invented the algorithm.

Algorithm explanation

Start: The string is assumed to be "here's A simple EXAMPLE" and the search term is "EXAMPLE".

1 . First, the "string" is aligned with the "search term" head, starting from the tail. If the trailing characters do not match, then as long as a comparison, you can know that the first 7 characters (overall) is definitely not the result to find.

"S" and "E" do not match. At this point,"S" is called the "bad Character" (character), which is the mismatched character. We also found that "s" is not included in the search term "EXAMPLE", which means that the search term can be moved directly to the next bit of "s".

2, still from the tail began to compare, found "P" and "E" does not match, so "P" is "bad character". However, "P" is included in the search term "EXAMPLE". So, move the search word back two bits, two "P" to align.

3. This concludes the "bad character rule":

Post shift number = The bad character appears when searching for the coordinate value of the word-the bad character in the search term position

With "P" as an example, it appears as a "bad character", appearing in the 6th digit of the search term (starting at 0) starting at the beginning of the comparison from the tail, there is a mismatch (E), "P" is the bad character in the search term where the occurrence of 4, so 6-4 = 2 bits. Take the second step of the "S" as an example, it appears in the 6th bit, the last occurrence position is 1 (that is, does not appear), then the entire search term is shifted 6-(-1) = 7 bits.

4, from the tail start comparison, "E" and "E" match.

5, the front one, "ple" and "ple" match

/

6, compare the previous one, "Mple" and "Mple" match. We call this the "good suffix" (good suffix), which is the string that matches all tails. Note that "Mple", "PLE", "LE", "E" are good suffixes.

7, compared to the previous one, found that "I" and "A" do not match. So, "I" is "bad character". Bad character rule ", at which time the search term should move back 2-(-1) = 3 bits.

We know that there is a "good suffix" at this point. Therefore, the "good suffix rule"can be used:

Post Shift number = position of good suffix-last occurrence in search term

For example, if the last "AB" of the string "Abcdab" is a "good suffix". Then its position is 5 (calculated from 0, take the last "B" value), "Last occurrence in the search term" is 1 (the position of the first "B"), so the 5-1 = 4 bit, the previous "AB" moved to the position of the latter "AB".

For another example, if the string "EF" of "ABCDEF" is a good suffix, the position of "EF" is 5, the last occurrence is 1 (that is, it does not appear), so move back 5-(-1) = 6 bits, that is, the entire string is moved to the next bit of "F".

This rule has three points to note:

(1) The position of "good suffix" shall be the last character. Assuming that the "EF" of "ABCDEF" is a good suffix, its position is "F", which is 5 (calculated from 0).

(2) If a "good suffix" appears only once in the search term, its last occurrence is-1. For example, "EF" appears only once in "ABCDEF", where its last occurrence is-1 (that is, it does not appear).

(3) If there are multiple "good suffixes", the last occurrence of the other "good suffixes" must be in the head except for the longest "good suffix". For example, suppose the "good suffix" of "Babcdab" is "DAB", "AB", "B", what is the last occurrence position of "good suffix"? The answer is, the good suffix is "B" at this time, its last occurrence is the head, that is, the No. 0 position. This rule can also be expressed as: if the longest "good suffix" appears only once, you can rewrite the search term to the following form of "(DA) Babcdab", that is, virtual join the first "Da".

Go back to this example above. At this point, all the "good suffixes" (Mple, PLE, LE, E), only "E" in "EXAMPLE" also appear in the head, so the back shift 6-0 = 6 bits.

8/You can see that the "bad character rule" can only move 3 bits, and "good suffix rule" moves 6 bits. Therefore,the basic idea of the Boyer-moore algorithm is to move the larger values of the two rules each time.

More subtly, the number of moving digits of these two rules is only relevant to the search term, regardless of the original string. Therefore, you can pre-calculate the bad character rule table and the good suffix rule table. When using, just check the table to compare a bit.

Continue to compare from the tail, "P" and "E" do not match, so "P" is "bad character". According to "bad character rule", move back 6-4 = 2 bits.

Starting from the end of the comparison, find all matches, so the search is over. If you want to continue searching (that is, to find all matches), follow the good suffix rule and move back 6-0 = 6 bits, that is, the "e" of the head moves to the trailing "e" position.

Document excerpted from: Nanyi blog

On the-boyer-moore algorithm of data structure

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.