String search algorithm Boyer-moore Java implementation __ Code

Source: Internet
Author: User
Tags data structures

Because it is the reason for career change after graduation, so I do not have a systematic study of data structures, algorithms, such as introductory classes. Frankly speaking, there is no such a foundation, haha. So this blog is mainly written to their own to see, because the time is limited, I write the content estimate is far from everyone expected so detailed, so can this text can choose to ignore OH.

Algorithm Introduction: about Boyer-moore algorithm (hereafter abbreviation BM algorithm) The concept of a search on the internet a lot. So here is not a concrete elaboration. There are some suggestions for reference Nanyi This article (this paper is delicate and easy to understand):

Nanyi: Boyer-moore algorithm for string matching

The essence of the algorithm: This string lookup algorithm is efficient because it can skip multiple characters at a time when the string cannot match exactly. It does not require a comparison of the characters in the string being searched. So, how to skip it. Of course, it is necessary to skip some unnecessary comparisons by using pattern strings (patterns) and text (text) in the matching process with known information. The posting bad character algorithm (Bad-character) and the Good suffix Algorithm (good-suffix), which are recommended above, are used to determine how many bits to move (shift) or move, not to elaborate here.

Full algorithm: because this article is mainly used to help remember, not the persuasion to tell you how to implement this algorithm. So I put the complete code (Java implementation), and then do a further code analysis.

public class Boyermoore {public static void main (string[] args) {String text = ' Here's a simple example
        ";
        String pattern = "example";
        Boyermoore BM = new Boyermoore ();
    Bm.boyermoore (pattern, text);
        public void Boyermoore (string pattern, string text) {int m = pattern.length ();
        int n = text.length ();
        map<string, integer> BMBC = new hashmap<string, integer> ();
        int[] Bmgs = new Int[m];
        Proprocessing PREBMBC (Pattern, M, BMBC);
        Prebmgs (Pattern, M, Bmgs);
        searching int j = 0;
        int i = 0;
        int count = 0; while (J <= n-m) {for (i = m-1 i >= 0 && pattern.charat (i) = = Text.charat (i + j);
            i--) {//for counting count++;
                } if (I < 0) {System.out.println ("one position is:" + j);
         J + + Bmgs[0];   else {j + + Math.max (Bmgs[i], GETBMBC (string.valueof (Text.charat (i + j)), BmB
            C, M)-m + 1 + i);
    } System.out.println ("Count:" + count); private void PREBMBC (String pattern, int patlength, map<string, integer> bmbc) {System.out.print
        ln ("BMBC Start process ..."); {for (int i = patLength-2 i >= 0; i--) if (!bmbc.containskey (string.valueof (
            i))) {Bmbc.put (string.valueof (Pattern.charat (i)), (Integer) (Patlength-i-1));
        }} private void Prebmgs (String pattern, int patlength, int[] bmgs) {int I, J;
        int[] suffix = new int[patlength];
        Suffix (pattern, patlength, suffix);
        There is no substring in the pattern string that matches a good suffix, nor does it find a maximum prefix for (i = 0; i < patlength; i++) {bmgs[i] = patlength; No substring in the pattern string matches a good suffix, but finds a maximum prefix j =0;  for (i = patLength-1 i >= 0; i--) {if (suffix[i] = = i + 1) {for (; J < patLength-1-I;  J + +) {if (bmgs[j] = = patlength) {Bmgs[j]
                    = PatLength-1-I;
        (i = 0; i < patLength-1; i++) with substring matching on}}//pattern string
        {bmgs[patlength-1-suffix[i]] = patLength-1-I;
        } System.out.print ("Bmgs:");
        for (i = 0; i < patlength i++) {System.out.print (Bmgs[i] + ",");
    } System.out.println ();
        } private void suffix (String pattern, int patlength, int[] suffix) {suffix[patlength-1] = patlength;
        int q = 0;
            for (int i = patLength-2 i >= 0; i--) {q = i; while (q >= 0 && pattern.charat (q) = = Pattern.charat (pAtLength-1-i + q)) {q--;
        } Suffix[i] = I-q; } private int GETBMBC (String c, map<string, integer> bmbc, int m) {//if the corresponding value is returned in the rule, otherwise return Patt
        Ern Length if (Bmbc.containskey (c)) {return bmbc.get (c);
        else {return m; }
    }

}

algorithm theory Discussion and code analysis:
A1: A theoretical approach to bad character algorithms
When a bad character appears, the BM algorithm moves the pattern string to the right, allowing the corresponding character in the pattern string to correspond to the bad character and then continue to match. There are two scenarios for bad character algorithms.

1. When there is a bad character in the pattern string, the corresponding character in the pattern string is relative to the bad character (because it is the corresponding character on the right of the bad character and the pattern string. So the pattern string is likely to appear left-shift, which may be the case of backtracking, but if you go back, the moving distance is negative, Definitely not the maximum number of moves.)

2. There are no bad characters in the pattern string, so it is good to move the whole pattern string to the right.

A2: Bad character algorithm specific execution steps:
BM Algorithm substring comparison mismatch, according to the bad character algorithm to calculate the distance to the right, to use the BMBC array, and the good suffix algorithm to calculate the distance to the right of pattern to use the Bmgs array. Here's how to compute the BMBC array.

Bmbc[] Array, a character index, such as bmbc[' V ', that represents the last occurrence of the character V in the pattern string at the end of the pattern string.

Compute bad character array bmbc[]:
This calculation should be very easy, seems to only need bmbc[i] = m–1–i on the line, but this is not true, because the character at the I position may appear in multiple places in pattern (as shown below), and what we need is the rightmost position, which requires each loop to be judged, very troublesome, poor performance. The trick here is to use characters as subscripts instead of positional numbers as subscripts. It just needs to be traversed, which seems to be a space-changing practice, but if the pure 8-bit character also requires only 256 space size, and for large mode, it may be more than 256 of its length, so this is worthwhile (this is why the greater the data, BM algorithm more efficient one of the reasons).

As mentioned earlier, the calculation of bmbc[] is divided into two cases, corresponding to the first one by one.
Case1: Characters appear in the pattern string, bmbc[' V ' represents the last occurrence of the character V in the pattern string, the distance pattern string
The length of the tail, as shown in the figure above.
Case2: The character does not appear in the pattern string, such as no character V in the pattern string, then bmbc[' V ' = strlen.

It is also simple to write Case1 as pseudo code:

void Prebmbc (char *pattern, int m, int bmbc[])
{
    int i;

    for (i = 0; i < 256 i++)
    {
        bmbc[i] = m;
    }

    for (i = 0; i < m-1 i++)
    {
        Bmbc[pattern[i]] = m-1-i;
    }
}

Of course, in the complete code I posted, use map as the BMBC storage structure, so the Java representation of CASE1 is as follows:

private void PREBMBC (String pattern, int patlength, map<string, integer> bmbc)
    {
        System.out.println (" BMBC Start process ... ");
        {for
            (int i = patLength-2 i >= 0; i--)
            if (!bmbc.containskey string.valueof (Pattern.charat (i)))
            C17/>bmbc.put (string.valueof (Pattern.charat (i)), (Integer) (Patlength-i-1));}}
    

So, how to express Case2, incredibly simple, see below: visible use of map as a BMBC storage container saves more space when the text character cannot be exhausted by 256:

private int GETBMBC (String c, map<string, integer> bmbc, int m)
    {
        //if the corresponding value is returned in the rule, otherwise the length of pattern is returned. Parameter m constant equals pattern length
        if (Bmbc.containskey (c))
        {return
            bmbc.get (c);
        }
        else
        {return
            m;
        }
    }

B1: A study on the theory of good suffix algorithm
If the program matches a good suffix, and there is another part of the same suffix or suffix in the pattern, move the next suffix or part to the current suffix position. If we say that the character and text of pattern have already been matched, but the next character does not match, I need to move to match. If you say that the following U characters appear or appear in other positions in the pattern, we move the pattern right to the previous U character or part and the last U character or part of the same, and if you say that the U character does not appear at all other positions in the pattern, it is good to move the entire pattern directly to the right. In this way, the good suffix algorithm has three kinds of situations:

1. The pattern string has substring and good suffix exactly match, then the most right one substring moved to the position of good suffix to continue to match.

2. If there is no substring matching the good suffix, find the oldest string with the following characteristics in the good suffix, making p[m-s...m]=p[0...s].

3. If there is no complete substring matching the suffix, move the entire pattern string right.

in a comprehensive sense, the complete BM algorithm's move rule is: The pattern string each compares the move step length to the max (Shift (good suffix), shift (bad character)), namely the BM algorithm is each time moves the pattern string to the right distance is, according to the good suffix algorithm and the bad character algorithm calculates the maximum value. The preprocessing array of bad character algorithms is bmbc[], and a good preprocessing array for the suffix algorithm is bmgs[].

B2: Good suffix algorithm to perform the steps:
Here the subscript for bmgs[] is a number instead of a character, representing the position of the character in pattern. As mentioned earlier, the calculation of the Bmgs array is divided into three cases, corresponding to the first one by one. Suppose the good suffix length in the graph is represented by an array suff[].
Case1: corresponding to the good suffix algorithm case1, the following figure, K is a good suffix before the position.

Case2: corresponding good suffix algorithm case2: as shown in the following figure:

CASE3: Correspondence and good suffix algorithm case3,bmgs[i] = strlen (pattern) = M

According to the diagram above, the code given is as follows:

private void Prebmgs (String pattern, int patlength, int[] bmgs) {int I, J;
        int[] suffix = new int[patlength];

        Suffix (pattern, patlength, suffix);
        All values are first assigned to M, containing Case3 for (i = 0; i < patlength; i++) {bmgs[i] = patlength;
        }//Case2 j = 0;  for (i = patLength-1 i >= 0; i--) {if (suffix[i] = = i + 1) {for (; J < patLength-1-I;  J + +) {if (bmgs[j] = = patlength) {Bmgs[j]
                    = PatLength-1-I;
        There is the longest good suffix in the mode string, also known as Case1 for (i = 0; i < patLength-1; i++)
        {bmgs[patlength-1-suffix[i]] = patLength-1-I;
        } System.out.print ("Bmgs:");
        for (i = 0; i < patlength i++) {System.out.print (Bmgs[i] + ","); }
        System.out.println (); }

The above code uses the suffix array, and I'm asking for the array. In fact, Suffix[i] is the length of a common suffix string with the i position character identifier suffix and the last word identifier suffix in pattern. So, it is implemented as follows:

private void suffix (String pattern, int patlength, int[] suffix)
    {
        suffix[patlength-1] = patlength;
        int q = 0;
        for (int i = patLength-2 i >= 0; i--)
        {
            q = i; 
        while (q >= 0 && pattern.charat (q) = = Pattern.charat (patLength-1-i + q))
            {
                q--;
            }
            Suffix[i] = i-q
            }
    }

So far, the BM algorithm key code is basically finished. The complete code was also given at the beginning. Here, I would like to say that there are many of the code here to optimize and improve the place, interested readers, you can refer to the following blog (in C #) OH:

grep string search algorithm Boyer-moore (3-5 times faster than KMP)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.