----Boyer-moore algorithm of string matching algorithm

Source: Internet
Author: User

The "Find" function (ctrl+f) of various text editors, mostly using the Boyer-moore algorithm.

The Boyer-moore algorithm is not only efficient, but also ingenious and easy to understand.
In 1977, Professor Robert S. Boyer of the University of Texas and Professor J. Strothermoore invented the algorithm.
Below, I explain this algorithm according to Professor Moore's own example.

1 . The string is assumed to be "Here are A simple EXAMPLE" and the search term is "EXAMPLE".

2.

First, the string is aligned to the head of the search term and is compared from the tail.

This is a very clever idea, because if the trailing characters do not match, then as long as a comparison, you can know that the first 7 characters (overall) is definitely not the result to find.

We see that the "S" and "E" do not match. At this point,"S" is called the "bad Character" (character), which is the mismatched character. We also found that "s" is not included in the search term "EXAMPLE", which means that the search term can be moved directly to the next bit of "s".

3.

Still comparing from the tail, we find that "P" does not match "E", so "P" is "bad character". However, "P" is included in the search term "EXAMPLE". So, move the search word back two bits, two "P" to align.

4.

We have thus concluded the "bad character rule":

后移位数 = 坏字符的位置 – 搜索词中的上一次出现位置

If the "bad character" is not included in the search term, the last occurrence is-1.
Take "P" for example, it appears as "bad character" in the 6th digit of the search term (numbering starting from 0), and the last occurrence in the search term is 4, so the 6–4 = 2 bit. Take the second step of the "S" as an example, it appears in the 6th bit, the last occurrence position is 1 (that is, does not appear), then the entire search term is shifted 6– (-1) = 7 bits.

5.

Still compare from the tail, "E" and "E" match.

6.

Compare the previous bit, "le" matches "le".

7.

Compare the previous bit, "ple" and "ple" match.

8.

Compare the previous bit, "mple" matches "Mple".
We call this the "good suffix" (good suffix), which is the string that matches all tails.
Note that "Mple", "PLE", "LE", "E" are good suffixes.
9.

Compare the previous one and find that "I" and "A" do not match. So, "I" is "bad character".

10.

Depending on the "bad character rule", the search term should be moved back to (-1) = 3 bits. The question is, is there a better way to move at this point?

11.

We know that there is a "good suffix" at this point. Therefore, the "good suffix rule" can be used:

后移位数 = 好后缀的位置 – 搜索词中的上一次出现位置

For example, if the last "AB" of the string "Abcdab" is a "good suffix". Then its position is 5 (calculated from 0, take the last "B" value), the "Last occurrence in the search term" is 1 (the position of the first "B"), so the 5–1 = 4 bit, the former "AB" moved to the position of the latter "AB".

For another example, if the string "EF" of "ABCDEF" is a good suffix, the position of "EF" is 5, the last occurrence is 1 (that is, it does not appear), so the trailing 5– (-1) = 6 bits, that is, the entire string is moved to the next bit of "F".

This rule has three points to note:

(1)”好后缀”的位置以最后一个字符为准。假定”ABCDEF”的”EF”是好后缀,则它的位置以”F”为准,即5(从0开始计算)。(2)如果”好后缀”在搜索词中只出现一次,则它的上一次出现位置为 -1。比如,”EF”在”ABCDEF”之中只出现一次,则它的上一次出现位置为-1(即未出现)。(3)如果”好后缀”有多个,则除了最长的那个”好后缀”,其他”好后缀”的上一次出现位置必须在头部。比如,假定”BABCDAB”的”好后缀”是”DAB”、 “AB”、”B”,请问这时”好后缀”的上一次出现位置是什么?回答是,此时采用的好后缀是”B”,它的上一次出现位置是头部,即第0位。这个规则也可以这样表达:如 果最长的那个”好后缀”只出现一次,则可以把搜索词改写成如下形式进行位置计算”(DA)BABCDAB”,即虚拟加入最前面的”DA”。

Go back to this example above. At this point, all "good suffixes" (Mple, PLE, LE, E), only "E" in "EXAMPLE" also appear in the head, so 6–0 = 6 bit.

12.

As you can see, the bad character rule only moves 3 bits, and good suffix rules can move 6 bits. Therefore, the basic idea of the Boyer-moore algorithm is to move the larger values of the two rules each time.
More subtly, the number of moving digits of these two rules is only relevant to the search term, regardless of the original string. Therefore, you can pre-calculate the bad character rule table and the good suffix rule table. When using, just check the table to compare a bit.

13.

Continue to compare from the tail, "P" and "E" do not match, so "P" is "bad character". According to "bad character rule", move 6–4 = 2 bits.

14.

Starting from the end of the comparison, find all matches, so the search is over. If you want to continue searching (that is, to find all matches), follow the good suffix rule, 6–0 = 6 bits, that is, the "e" of the head moves to the trailing "e" position.
over!
Here's the code:

Package mathstudy; Public classBmtest {FinalStatic intCard_char_set = the;//Character Set size    /* * @param mainstr Main String * * @param subStr mode string * *     Public Static int Getmatchindex(String mainstr, String subStr) {int[] BC = BUILDBC (SUBSTR);//Bad character tabulation        int[] GS = Buildgs (SUBSTR);//good suffix table        //Find matches        inti =0;the starting position of the//pattern string relative to the main string (initially aligned to the left of the main string)         while(Mainstr.length ()-substr.length () >= i) {//Before reaching the right end, continue to right-shift the pattern string            intj = substr.length ()-1;//start with the character at the end of the pattern string             while(Substr.charat (j) = = Mainstr.charat (i + j))if(0>--j)//comparison from right to left                     Break;if(0> J)//If maximum match suffix = = Entire Pattern string (description already fully matched)                 Break;Elsei + = MAX (Gs[j], J-bc[mainstr.charat (i + j)]);//In the shift amount between BC and GS Select the big person, corresponding to move the mode string}return(i); }/* Construction bad charactor shift table bc[]-character sheet */    protected Static int[]Buildbc(String subStr) {int[] BC =New int[Card_char_set];//Initialize bad character tabulation        intJ for(j =0; J < Card_char_set; J + +) bc[j] =-1;//First assume that the character does not appear in P         for(j =0; J < Substr.length (); J + +)//left-to-right iteration: Update the bc[] value of each characterBc[substr.charat (j)] = J;return(BC); }/* * Construct good Suffix shift table gs[]-good suffix table */    protected Static int[]Buildgs(String subStr) {intm = Substr.length ();int[] SS = Computesuffixsize (SUBSTR);//calculates the longest matching suffix length corresponding to each character        int[] GS =New int[M];//Good Suffix Index        intJ for(j =0; J < M; J + +) gs[j] = m;inti =0; for(j = m-1; J >=-1; j--)if(-1= = J | | J +1= = Ss[j])//If ss[-1] = 0 is defined, it can be unified as: if (j+1 = =                                            //ss[j])                 for(; i < m-j-1; i++)if(Gs[i] = = m) Gs[i] = m-j-1; for(j =0; J < M-1; J + +) Gs[m-ss[j]-1] = m-j-1;return(GS); }/* * Calculates the maximum matching length of each prefix of p with each suffix of p */    protected Static int[]computesuffixsize(String subStr) {intm = Substr.length ();int[] SS =New int[M];//Suffix Size Table        intS, t;//Sub-string p[s+1, ..., T] match with suffix p[m+s-t, ..., m-1]        intJ//position of the current characterSs[m-1] = m;//For the last character, the longest suffix to match is the entire P-strings = M-1;//From the last second character, scan p from right to left, and then calculate the rest of the ss[]t = M-2; for(j = m-2; J >=0; j--) {if(J > s) && (J-s > ss[(M-1-T) + j])) ss[j] = ss[(M-1-T) + j];Else{T = j;//The end of the substring that matches the suffix is the current characters = MIN (S, j);//The starting point of the substring matching the suffix                 while((0<= s) && (Substr.charat (s) = = Substr.charat ((M-1-T) + s)) s--; SS[J] = t-S;//Match the length of the oldest string with the suffix}        }return(SS); }protected Static int MAX(intAintb) {return(A > B)?    A:B; }protected Static int MIN(intAintb) {return(A < b)?    A:B; }//Test class     Public Static void Main(string[] args) {String Mainstr ="Here's A simple EXAMPLE"; String subStr ="EXAMPLE"; System. out. println ("The location of the string match is:"+ Getmatchindex (MAINSTR, subStr)); }}

Reference: Ruan Yi Feng

----Boyer-moore algorithm for string matching algorithm

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.