From the longest common substring to the suffix automaton (lcs->sam)

Source: Internet
Author: User
Tags time limit
longest Common substring-suffix automaton

This article will start with the longest common substring, the gradual explanation of the suffix automata, I hope that through their own understanding to help everyone, the article directory is as follows: the longest common string problem suffix automata introduction to the automatic machine history suffix The theory foundation of suffix automata how to construct suffix automatic machine suffix automatic machine application summary maximum common substring problem

First, let's look at a classic example.

Title: Give the maximum common continuous substring time limit for n lengths of not more than 100000 strings
: 2s
Note: set
  X = <a, B, C, F, B, c>
  Y = <a, B, F, C, a, b>
longest Common Sequence public oldest oldest sequence <a, B, C, B>, a length of 4
longest Common Substring the public oldest string <a, b> length of 2
and subsequence The problem is, Substring problem not only requires that the subscript sequence is incremented, It also requires
increments of 1 per increment, i.e. two subscript sequences:
<i, i+1, i+2, ..., i+k-1> and <j, j+1, j+2, ..., j+k-1>

 solving: using classical dynamic programming, Time complexity O (NLOGN) c[i][j] represents the length of the largest substring of Xi and Yi, such as X = <y, E, D, f> y = &l  T;y, E, K, f> c[1][1] = 1 C[2][2] = 2 C[3][3] = 0 C[4][4] = 1 (sequence:c[4][4]==3) The dynamic transfer equation is: if Xi = = YJ, then c[i][j] = c[i-1][j-1]+1 If XI!  = YJ, then c[i][j] = 0 Finally the length of longest Common substring is equal to max{c[i][j], 1<=i<=n, 1<=j<=m} The dynamic core algorithm: for (i = 1; i < len1+1; i++) {for (j = 1; j < Len2+1; J + +) {if (str1[i-1]==str2[j-1])//substring c[i-1  
            ][J-1] compared to the c[i][j]=c[i-1][j-1]+1;  
            else//squence C[i][j-1] Value comparison, there is no need for c[i][j]=0;  
                if (C[i][j]>max) {max=c[i][j];  
                X=i;  
            Y=j; }  
        }  
    }  

Although the time complexity of O (NLOGN) is already a better algorithm for solving such problems, if the length of the string is too long, such as looking for the longest common substring in the text, assuming that the length of a text is 100000, then the Nlogn is about 2000000, and it is conceivable that efficiency cannot keep up with the demand. Especially in the OJ evaluation.

so consider whether this kind of problem can be solved in linear time complexity. suffix automata history

Common tools for string handling:
Suffix array suffix arrays
Suffix Tree suffix Trees
Aho-corasick Automaton AC automata (dictionary tree, KMP pattern matching algorithm Basics)
Hash hashes

Historically, Blumer and others in 1983 first proposed the linear scale of the suffix automata, and then in 1985-1986, the first algorithm for constructing the suffix automata in linear time (Crochemore,blumer, etc.) was proposed.

An iconic event that is widely used and known: 2012 NOI (National Youth Informatics Olympiad) The Chen Lijie of Hangzhou foreign Language school in winter Camp the introduction of suffix automata

What is a suffix automaton

The suffix Automaton (SAM) of the given string s,s is an automaton that recognizes all suffixes of s, and the Sam is a direction-free graph where vertices are states and edges represent transitions between states.

Consider a string acadd
SAM) ">
Obviously the space and time complexity are O (n^2)

Here are some examples of simple string-building suffix automata.
The initial state is denoted as T0, and the terminating state is marked with an asterisk (*).

SAM) ">
SAM) ">

Before we describe the building algorithm, it is necessary to introduce some new concepts and brief proofs, which are very important for understanding the concept of suffix automata. theoretical basis of suffix automata

end-of -line equivalence:
1. Consider any non-empty substring of the string s. We call the end set Endpos (t) as: All the sets of the end of the position where T appears in S.
2. We call two substrings t_1 and t_2 "endpoints equivalent" if their end set is consistent: Endpos (t_1) =endpos (t_2). Therefore, all s non-empty strings can be divided into several classes based on the equivalence of the endpoints.

For Sam:
The number of states in a suffix automaton is equivalent to the number of end-of-line equivalence classes of all substrings, plus the initial state. Each state corresponds to one or more substrings that have the same set of endpoints.

We use this statement as a hypothesis, and then we describe a linear time-based algorithm for constructing a suffix automaton.
There are several theorems that need to be known before construction:

Theorem 1. two non-empty strings U and V (Length (u) <=length (v)) are the end-point equivalents, when and only if u appears as a suffix of V only in the string s.

Theorem 2. consider two non-empty set u,v (Length (u) <=length (v)). Their endpoint collections do not intersect, or Endpos (v) is a subset of Endpos (U). Further, this depends on whether U is a suffix of w.

Theorem 3. consider an endpoint equivalence class. The substrings in this equivalence class are sorted in descending order of length. In the sorted sequence, each substring is shorter than the previous substring, and thus is the suffix of the previous string. In other words, the strings in an equivalence class of an endpoint are suffixes, and their lengths are followed by all the numbers in the interval [x, y]. (x, y refers to the minimum and maximum substring length subscript in an equivalence class)

The suffix link considers a state v≠t_0. As far as we know, there is a set of defined substrings where the elements and V have the same endpoint set. Also, if we remember that W is the oldest of them, the remaining substrings are the suffix of W.
We also know that the first few suffixes of W (in descending order of length) in the same endpoint equivalence class, the remaining suffixes (at least including the null suffix) in the other endpoint equivalence class. So T is the first such suffix--we build a suffix link to it.
In other words, the suffix of v links link (v) to the longest suffix of w in different equivalence classes.
Here we assume that the initial state t_0 in a separate endpoint equivalence class (containing only empty strings), and Endpos (t_0) ={-1,..., Length (s)-1}.

Theorem 4. The suffix links make up a tree that is rooted in t_0.

Theorem 5. If we set all the legitimate end points into a tree (which makes the child a subset of the parents), the tree will be the same tree as the suffix link.
Here is an example of a suffix link that represents the string "ABCBC":
SAM) "> how to construct a suffix automaton

Look at the code first (only 20 lines in the core.) )

construct struct struct state {int len, link;//link as suffix link, len is node length map<char,int> next;//st[0].next[c]=1 indicates 0 nodes are arriving by C status transfer
1 nodes};
Initialize variable const int maxlen = 100000;
State St[maxlen*2];
int sz, last;
    Initialize Sam void Sa_init () {sz = last = 0;
    St[0].len = 0;
    St[0].link =-1;
    ++sz;
    /*///If the suffix automaton is established more than once on different strings, this code needs to be executed: for (int i=0; i<maxlen*2; ++i) st[i].next.clear ();
    */}//Core code, increment method, obviously linear space, linear complexity void sa_extend (char c) {int cur = sz++;
    St[cur].len = St[last].len + 1;
    int p;
    for (p=last; P!=-1 &&!st[p].next.count (c); p=st[p].link) st[p].next[c] = cur;
    if (p = =-1) st[cur].link = 0;
        else {int q = st[p].next[c];
        if (st[p].len + 1 = = St[q].len) St[cur].link = q;
            else {int clone = sz++;
            St[clone].len = St[p].len + 1;
            St[clone].next = St[q].next;
            St[clone].link = St[q].link; for (; P!=-1 && St[p].next[C]==q;
            P=st[p].link) St[p].next[c] = clone;
        St[q].link = St[cur].link = Clone;
}} last = cur;
 }

In conjunction with the example description: constructs the aabbabd suffix automaton, the red subscript is the node corresponding length.
1. Building the initialization node s
SAM) ">
2. A
SAM) ">
3. AA
SAM) ">
① add a, expand Node 2, Transfer Node 1 state to 2, st[1].next[a]=2
② View the suffix link s for Node 1, because S has a state transfer st[0].next[a]=1, and St[0].len+1=st[1].len
The suffix link for node 2 points to Node 1, which is st[2].link=1
4. AaB
SAM) ">
① increase B, expand node 3, Transfer Node 2 state to 3, st[2].next[b]=3
② View the suffix of Node 2 link Node 1 if there is a state transfer st[1].next[b], no, add State transfer st[1].next[b]=3
③ the same to see if node 1 's suffix node S has a state transition st[0].next[b], and if not, add a state transfer st[0].next[b]=3
④ the suffix link of node 3 is pointed to the S-node, which is the st[3].link=0, since backtracking to the S-node has no suffix link.
5. Aabb
SAM) ">
① increase B, expand node 4, Transfer Node 3 state to 4, st[3].next[b]=4
② view node 3 suffix link node s if there is a state transfer st[0].next[b], found to have, st[0].next=3, but because of St[0].len+1!=st[3].len, so take the clone action
③ new Node 5

st[5].len=st[0].len+1;
St[5].next=st[3].next;
St[5].link=st[3].link;

④ starting at node 0, follow the suffix link to see if there is a state transition for the transition transfer st[].next[b]=5, and if not, add the class state transfer until P!=-1 && st[p].next[c]==q
⑤ finally the suffix link for node 3 and node 4 points to the new node 5
6 .....
SAM) ">
SAM) ">

Obviously the above process completes the construction of state automata with the time complexity of O (n)

This results in the properties of several suffix automata
1.  The number of states: The number of suffix automata established by the string s of length n is not more than 2n-1 (for n>=3).
2.  Number of state transitions: in a suffix automaton established by the string s of length n, the number of transfers does not exceed 3n-4 (for n>=3).

lemma: suffix automata and suffix trees can be converted to each other application of suffix automata

Existence Query
Question: Given the text T, the query format is as follows: Given the string p, ask if P is a substring of T.
Algorithm: We establish a suffix automaton for text T with O (length (T)).
Explanation: Starting from the S state, walk along the character of the string p, and if you can walk the general is a substring, otherwise it is not

Number of strings of different substrings
Question: Given the string s, ask how many different substrings it has.
Algorithm: build s suffix automaton, O (n)
Explanation: Any substring of s corresponds to a path in the SAM, so the number of path bars is the number of substrings

Total length of different substrings
Question: Given the string s, the total length of all its different substrings is calculated.
Algorithm: Similar to the above question, add a length

Dictionary preface k Boy string
Question: Given the string S, a series of queries--given an integer k_i, computes the first k_i of all substrings of S.
Complexity requirements: One-time inquiry O (Length (ans) *alphabet), where ans is the answer to the inquiry, Alphabet is the alphabet size.
Algorithm: The basic idea of this problem is similar to the previous two questions. Dictionary order K-string-the path of the dictionary in the automaton. Therefore, considering the number of different paths from each state, we will be able to easily determine the K-small path, starting from the initial state to determine the answer by bit.

Minimum cycle shift
Problem. Given the string s, find and it loops isomorphic to the dictionary-ordered minimum string.

Number of occurrences query
Question. Given the text T, the query format is as follows: Given the string p, I want to find out how many times P has appeared as a substring in the text t (the interval can intersect).

First occurrence location query
Question. Given the text T, the query format is as follows: Given the string p, the position of the first occurrence of p in the text.

All occurrences of location query
Question. Given the text T, the query format is as follows: Given the string p, it is required to give all occurrences of P in t (where the interval can intersect).

The shortest string that the query does not appear in the text
Question. The given string s and the alphabet. Ask to find a string of the shortest length so that it is not a substring of s.

Find the longest common substring of two strings
Question. Given two strings s and T. Ask to find their longest common substring, a string x, which is a substring of s and T.

The longest common substring of multiple strings
Problem. Give the K string s_1~s_k. Ask to find their longest common substring, a string x, which is a substring of all s_i. Summary

Almost all of the string processing can be achieved with the suffix automaton, of course, according to the specific problem of specific analysis, the clever combination of suffix automata and other algorithms, such as dynamic planning, will quickly and effectively solve the problem, the above also mentioned suffix tree and suffix automata can be transformed In fact, the structure of the suffix tree time complexity can also reach the linear level O (n), the typical suffix tree Ukkonen algorithm, but its construction of code is not a simple 20 lines of things; suffix automata as a type of automata, not only the ingenious application of automata, It is a deepening of automata theory, which realizes the unique performance of automata in language processing from the application level. Of course, the beginning of the article also mentioned AC automata, the same is the self-motive, the complexity of the two and the degree of understanding is very similar, but the ease of implementation, suffix automata is always better than AC automata.

Finally, I would like to thank the following bloggers for their sharing: (Everyone understands the difference, especially the suffix automata, but the core idea is basically the same.)
Suffix automata: The construction and application of O (N) 1.[
2.2012-year NOI winter Camp Chen Lijie Lecture Notes
3. Application of SAM various string processing code implementation of "suffix automaton" mu yang
4. Suffix automata and linear construction suffix tree
5. Suffix Automaton Sam Learning guide
6. Suffix Automaton (fhq+neroysq completion)
7. suffix automatic mechanism making process demonstration

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.