String Algorithm topic

Source: Internet
Author: User

This topic mainly deals with the String match (String Matching Problem) strstr Problem:

Assume that there is a string of Text T, Length: n, that is, T [0... n-1]

Now we need to find Pattern P in T, with the length: m, that is, P [0... M-1] (n> = m)


Common algorithms include:

1) Brute Force Method

2) Rabin-Karp String Matching Algorithm

3) String Matching with Finite Automata

4) KMP Algorithm

5) Boyce-Moore Algorithm

6) Suffix Trees


1) Brute Force Method:



Package String; public class BruteForce {public static void main (String [] args) {String T = "mississippi"; String P = "ssi"; bruteForce (T, P );} // Time: O (nm), Space: O (1) public static void bruteForce (String T, String P) {int n = T. length (); // Textint m = P. length (); // Patternfor (int I = 0; I <= n-m; I ++) {// I indicates the offset on T, note that the last start check position is n-mint j = 0; while (j <m & T. charAt (I + j) = P. charAt (j) {// j indicates the position where P is matched. j ++;} if (j = m) {// j has all matched the length of P and returns the starting point of the first match, System. out. println ("Pattern found at index" + I );}}}}


Http://www.geeksforgeeks.org/searching-for-patterns-set-1-naive-pattern-searching/


2) Rabin-Karp String Matching Algorithm

The Preprocessing time of Rabin-Karp is O (m), and the matching time O (n-m + 1) m) is the same as that of the simple algorithm, there is more time for preprocessing. Why do we need to learn this algorithm? Although the Rain-Karp is the same as the plain match in the worst case, it is usually much faster than the plain Algorithm in actual application. In addition, the expected matching time of this algorithm is O (n) [see Introduction to algorithms]. However, the Rabin-Karp Algorithm requires numerical computation, and the speed is certainly not faster than the KMP algorithm, why do we need to learn the Rabin-Karp Algorithm after we have the KMP algorithm? I personally think that learning is a kind of thinking, a kind of thinking for solving problems. The more we see, the more open our horizons will be. In the face of practical problems, you can find a more suitable algorithm. For example, Rabin-Karp is a good choice for two-dimensional pattern matching.
In addition, the Rabin-Karp Algorithm is very interesting and treats characters as numbers. The basic idea is: If Tm is a substring of T whose length is | P |, if the number on the modulo after being converted to a numeric value (generally a prime number) is the same as the number on the mode string P after being converted to a numeric value, the Tm may be a legal match.

The difficulty of this algorithm is that the p and t values may be very large, which makes it difficult to process them conveniently. There is a simple remedy to this problem, and a suitable number q is used to calculate the p and t modulus. Each character is actually an eleven decimal integer, so p, t and recursive expressions can all perform Modulo q, so the p value of the Modulo q can be calculated in the O (m) time, calculate all TKs of the Modulo q in the O (n-m + 1) time. See introduction to algorithms or http://net.pku.edu.cn /~ Course/cs101/2007/resource/Intro2Algorithm/book6/chap34.htm
The recursive formula is as follows:
Ts + 1 = (d (ts-T [s + 1] h) + T [s + m + 1]) mod q
For example, if d = 10 (decimal) m = 5, ts = 31415, we want to remove the highest digit T [s + 1] = 3, add another low-level number (assuming T [s + 5 + 1] = 2:
Ts + 1 = 10 (31415-1000*3) + 2 = 14152


Average, best time complexity: O (n + m)

Worst time complexity: O (nm)

The worst case is that all text strings and pattern have the same hash value, which degrades the algorithm to O (nm)

Package String; public class RabinKarp {public static void main (String [] args) {String T = "mississippi"; String P = "ssi"; int q= 101; search (P, T, q);} public static int d = 256; public static void search (String P, String T, int q) {int M = P. length (); int N = T. length (); int I, j; int p = 0; // hash value for pattern int t = 0; // hash value for txtint h = 1; // The value of h wocould be "pow (d, M-1) % q" for (I = 0; I
 
  

Http://www.geeksforgeeks.org/searching-for-patterns-set-3-rabin-karp-algorithm/

Http://www.youtube.com/watch? V = d3TZpfnpJZ0


3) String Matching with Finite Automata

Suppose we want to scan the text string T to find all the locations where the mode P appears. This method can be used to pre-process mode P first. Then, you only need to check each text character of T once and check the time used by each text character as a constant, therefore, the time required to perform the matching after the pre-processing of the automatic machine is partial (n ).

Assume that the text length is n and the mode length is m, the automatic machine will have such states as 0, 1,..., and m, and the initial state is 0. Let alone the details of how an automatic machine is calculated, and focus only on the role of the automatic machine. When Scanning text from left to right, you can find the next state of the automatic machine based on the current state of the automatic machine and the value of a for each character a. This keeps scanning, when the status value of a certain automatic machine changes to m, we can think that a match is successful. Let's take a look at the following simple example:

Assume that there are only three types of characters a, B, and c in the text and mode. The text T is "abababaca", and the pattern P is "ababaca". Then, an automatic machine is created based on the pattern P, such as (B) (no matter the implementation details ):

(A) The figure shows some status conversion details.




<喎?http: www.bkjia.com kf ware vc " target="_blank" class="keylink"> VcD4KPHA + Environment + 7 + Environment + jOxLG + 19a3 + Environment/Environment + Environment/environment/Cw + bWu8rHvenJ3NfUtq + 7 + rXEubm9qKOstvjP6s + 4 tcTW PMP3uf2zzL/J0tSyzr + finite + a1xMTHwO/A/dfTo6y9qMGixKPKvVAgPSA = "ababaca" finite automatic machine. First, you need to understand that if the current status value is k, it actually indicates that the maximum length of the suffix of the current text and the pattern prefix is k, then read it into the next text character, even if the character matches, the status value can be up to k + 1. if the current status value is 5, the last five digits of the current text position are "ababa", which is equal to the first five digits of the pattern.

If the next text character is "c", the status value can be updated to 6. if the next digit is "a", we need to find the maximum length of the text suffix and the pattern prefix again. A simple search method can be set to k = 6 (where the state value is the largest) to determine whether the last k bits of the text are equal to the first k bits of the pattern, if not, continue searching for k = k-1. Because the last five digits of the text "ababa" are actually the first five digits of the pattern, text is not required for building an automatic machine. In this case, the status value is changed to 1 (only a is equal ). Similarly, when the next digit is "B", the status value is changed to 4 (the first four digits of the pattern "abab" are equal to the suffix of "ababab)

The following is a pseudo-code for books: Σ represents the character set. Delta (q, a) can be understood as reading the status value after adding character.


The above method is used to calculate the automatic machine. If the number of characters is k, the time for creating the automatic machine preprocessing is O (m ^ 3 * k ), there is a way to improve the time to O (m * k ). The processing time of round (n) after preprocessing.

Package String; public class FiniteAutomata {public static int getNextState (String pat, int M, int state, int x) {// If the character c is same as next character in pattern, // then simply increment stateif (state <M & x = pat. charAt (state) {return state + 1;} int ns, I; // ns stores the result which is next state // ns finally contains the longest prefix which is also suffix // in "pat [0 .. state-1] c "// Start from the largest possible value and stop when you find // a prefix which is also suffixfor (ns = state; ns> 0; ns --) {if (pat. charAt (ns-1) = x) {for (I = 0; I
   
    
Http://www.cnblogs.com/jolin123/p/3443543.html

Http://www.geeksforgeeks.org/searching-for-patterns-set-5-finite-automata/


4) KMP Algorithm

The most understandable KMP algorithm is explained by Robert Sedgewick of Princeton. He uses the automatic machine model to explain it. The difficulty lies in the establishment of a dfa table. The subtle point is to maintain an X variable. Each time based on the match and mismatch conditions, combined with the X position, the current position is introduced, and update the value of X.

With the dfa table, search becomes a linear process. The point is that the I pointer keeps moving forward and never goes backward. J indicates different States and the number of characters on match.

Time Complexity: O (n + m)

Space Complexity: O (m)

Package String; public class KMP {private static int [] [] dfa; // return offset of first match; N if no matchpublic static int search (String text, String pat) {createDFA (pat); // simulate operation of DFA on textint M = pat. length (); int N = text. length (); int I, j; for (I = 0, j = 0; I
     
      

Http://algs4.cs.princeton.edu/53substring/KMP.java.html

Https://www.cs.princeton.edu/courses/archive/fall10/cos226/demo/53KnuthMorrisPratt.pdf

Http://www.cmi.ac.in /~ Kshitij/talks/kmp-talk/kmp.pdf




5) Boyce-Moore Algorithm

package String;public class BoyerMoore {private final int R;     // the radix    private int[] right;     // the bad-character skip array    private char[] pattern;  // store the pattern as a character array    private String pat;      // or as a string    // pattern provided as a string    public BoyerMoore(String pat) {        this.R = 256;        this.pat = pat;        // position of rightmost occurrence of c in the pattern        right = new int[R];        for (int c = 0; c < R; c++)            right[c] = -1;        for (int j = 0; j < pat.length(); j++)            right[pat.charAt(j)] = j;    }    // pattern provided as a character array    public BoyerMoore(char[] pattern, int R) {        this.R = R;        this.pattern = new char[pattern.length];        for (int j = 0; j < pattern.length; j++)            this.pattern[j] = pattern[j];        // position of rightmost occurrence of c in the pattern        right = new int[R];        for (int c = 0; c < R; c++)            right[c] = -1;        for (int j = 0; j < pattern.length; j++)            right[pattern[j]] = j;    }    // return offset of first match; N if no match    public int search(String txt) {        int M = pat.length();        int N = txt.length();        int skip;        for (int i = 0; i <= N - M; i += skip) {            skip = 0;            for (int j = M-1; j >= 0; j--) {                if (pat.charAt(j) != txt.charAt(i+j)) {                    skip = Math.max(1, j - right[txt.charAt(i+j)]);                    break;                }            }            if (skip == 0) return i;    // found        }        return N;                       // not found    }    // return offset of first match; N if no match    public int search(char[] text) {        int M = pattern.length;        int N = text.length;        int skip;        for (int i = 0; i <= N - M; i += skip) {            skip = 0;            for (int j = M-1; j >= 0; j--) {                if (pattern[j] != text[i+j]) {                    skip = Math.max(1, j - right[text[i+j]]);                    break;                }            }            if (skip == 0) return i;    // found        }        return N;                       // not found    }    // test client    public static void main(String[] args) {        String pat = "ssi";        String txt = "mississippi";        char[] pattern = pat.toCharArray();        char[] text    = txt.toCharArray();        BoyerMoore boyermoore1 = new BoyerMoore(pat);        BoyerMoore boyermoore2 = new BoyerMoore(pattern, 256);        int offset1 = boyermoore1.search(txt);        int offset2 = boyermoore2.search(text);        System.out.println("Find in offset: " + offset1);        System.out.println("Find in offset: " + offset2);    }}

Https://www.youtube.com/watch? V = rDPuaNw9_Eo

Http://algs4.cs.princeton.edu/53substring/BoyerMoore.java.html


6) Suffix Trees

Before writing a suffix Tree, we must first introduce two common data structures for storing String: Trie and Ternary Search trees.

The summary of Trie can be referred here: http://blog.csdn.net/fightforyourdream/article/details/18332799

The advantage of Trie is that the search speed is fast, and the disadvantage is that there is a lot of memory, because every node has to store 26 pointers to its children.

Therefore, the Ternary Search Tree came into being. It combines BST memory efficiency and Trie time efficiency.

For more information about the Ternary Search Tree, see:

Http://www.cnblogs.com/rush/archive/2012/12/30/2839996.html

Http://www.geeksforgeeks.org/ternary-search-tree/

For example:

Insert the AB, ABCD, ABBA, and BCD strings to the search tree. First, insert the strings to the tree.ABThen we insert a stringABCDBecause ABCD and AB have the same prefix AB, the C node is stored in CenterChild of B, and D is stored in CenterChild of C. when insertingABBABecause ABBA and AB have the same prefix AB and B is less than character C, B is stored in LeftChild of C.BCD timeBecause character B is greater than character A, B is stored in RightChild of C.


In fact, Hashtable can also be used to store strings. It has high memory efficiency but cannot be sorted.


Suffix tree video:

Https://www.youtube.com/watch? V = hLsrPsFHPcQ

V_JULY_v from Trie tree (Dictionary tree) talked about suffix tree (10.28 Revision) http://blog.csdn.net/v_july_v/article/details/6897097


Suffix tree Group



Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.