Generalized suffix automata

Last Update:2015-05-04 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

1). introduction of automatic machine
First, let's introduce what is the automata, the function of finite state automata is to recognize the string, so that an automaton a, if he can recognize the string S, is recorded as a (s) =ture, otherwise a (s) =false.
The automata consists of five parts, Alpha: Character set, State: Set of States, Init: initial state, end: Ending state set, trans: state transfer function.
Make trans (s,ch) indicate that the current state is S, after reading the character ch, the state reached. If trans (s,ch) This transfer does not exist, for convenience, set it to NULL, while NULL can only be transferred to NULL. Null indicates a state that does not exist. At the same time another trans (S,STR) indicates that the current state is S, after reading into STR, the state reached.

Trans (S,STR): cur=s; For i=0 to Length (str) -1cur=trans (cur,str[i]);

Trans (S,STR) is cur.
Then the machine a can recognize the string is all make trans (init,x) ∈end the string x. Make it a reg (A). The string that can be recognized from the state S, which is all the resulting string x. Make it a reg (s).
2). the establishment process of suffix automata model
definition of suffix automata : The suffix automaton for a given string s,s suffix automaton (later précis-writers is Sam) is an automaton capable of identifying all suffixes of S. That is, Sam (x) =ture, when and only if X is the suffix of s, it is later explained that the suffix automata can also be used to identify all substrings.
The simplest implementation : Consider the string "aabbabd", we can insert all the suffixes of the string into a trie (dictionary tree), just like that. So the initial state is the root, the state transfer function is the edge of the tree, and the end state set is all the leaves. Note that this structure will have O (N2) nodes for strings of length n. The number of nodes is too many, how to do?

Minimal state self-suffix automata: As the name implies, is the least state of the suffix automata, in the back can prove that its size is linear, we first introduce some properties. If we get this simplest state suffix automaton sam, we make St (str) represent trans (INIT,STR). Is the state that the initial state can reach after it starts reading into the string str.
Analysis: The string is S, the set of his suffixes is suf, the set of his contiguous substrings is FAC, the suffix starting from position A is suffix (a), s[l,r) represents a substring of the interval of [L,r] in S. Subscript starting from 0, for a string s, if it does not belong to FAC, then St (s) =null. Because it is not possible to add any strings to the S suffix, there is no reason to waste space. At the same time if the string s belongs to FAC, then St (s) ≠null. Since S is a substring of s, it is possible to add some characters to the suffix of s, so that if we want to recognize all the suffixes, we cannot let go of this possibility. We cannot establish a state for each S∈FAC, because the size of FAC is O (N2). We consider which strings St (a) can recognize, that is, Reg (St (a)). The string x can be recognized by the automaton, if and only if X∈suf. ST (a) is able to recognize string x when and only if Ax∈suf. Because we've already read the string A. That is, Ax is the suffix of s, then x is the suffix of s, and Reg (ST (a)) is a collection of suffixes. For a state s, the only thing we care about is Reg (s).
If a is present in the [L,r] position in S, then he will be able to identify the suffix of s starting with R. Example: So if a set of occurrences of a in S is {[L1,R1], [L2,R2],..., [Ln,rn]} then Reg (ST (a)) is {Suffix (R1), Suffix (R2),..., Suffix (R3)}. Make Right (a) ={R1,R2,..., rn} then Reg (ST (a)) is entirely up to you (a).
For two substrings A,B∈FAC if Right (a) =right (b), then ST (a) =st (b). So a state s, consisting of a string of right (s) that is set by all of the. R∈right (s) may be used so long as the substring is determined by a given length. Consider for a right set, it is easy to prove that if the length l,r appropriate, then the length of l≤m≤r m will also be appropriate. So the proper length is bound to be an interval, so the interval of S is [min (s), Max (s)].
about the properties of substrings: Since each substring is bound to be contained in a certain state of Sam, then a string A is a substring of s, and if and only if, ST (s) ≠null, then we can use Sam to solve the substring decision problem, but also can find out the number of this substring, that is, the size of the right set of the state. Saving the right collection directly in a state consumes too much space, and we can see that the state is the same set of all children in the parent tree, and further, that is, the set of the right of the leaf node in all its descendants in the parent tree. So if you sort by DFS order, the right set of a state is the set of the right set of the leaf nodes in a contiguous interval, then we can quickly find all occurrences of a substring. DFS sequence of the tree: The nodes in all subtrees form an interval.
Linear Construction Algorithm: Our construction algorithm is online, that is, by adding characters from left to right, and then constructing Sam, this algorithm is much simpler than the suffix tree, although it may not be very well understood. Let's go back to the nature:
Review of definitions and nature: State S, transfer trans, initial state init, end state collection end, parent string s,s suffix automaton sam (abbreviation for Suffix automaton), right (str) represents all occurrences of STR in the parent string s in the end position collection.

A state s indicates that all substrings are the same as the right set, which is right (s).
The Parent (s) represents a true subset of right (s) and a state x with the smallest size of right (x).
The parent function can represent a tree structure called the parent tree.
A right set and a length that defines a substring.
For the state S, make right (s) a valid substring length is an interval of [Min (s), Max (s)].
Max (Parent (s)) = Min (s)-1.
The number of States of the SMA and the number of edges are O (N).
Trans (S,ch) ==null represents an edge that departs from S without marking for Ch.
Consider a state s, his right (s) ={r1,r2,..., RN}, if there is a s→t labeled C side, consider R's right set, due to a character, s right set, only s[ri]=c meet the requirements. So the right set of T is {ri+1| S[ri]=c}.
So if s starts with an edge labeled X, then the parent (s) must be there as well.
Also make f=parent (s), right (trans (s,c)) ∈right (trans (f,c)).
There is a clear inference that Max (t) >max (s).
We add one character at a time and update the current Sam so that he becomes the SAM that contains the new character the current string is T, the new character is x, and the length of T is L,sam (t) →sam (TX)
So we've added some new substrings, they're all strings of TX, the suffix of TX, and the addition of an x to the back of T, then we'll consider all the nodes that represent the suffix of t (that is, l in the right collection) V1,v2,v3,...
Because of the inevitable existence of a right (p) ={l} node P (ST (T)). So V1,v2,..., VK, because the right set contains L, then they are necessarily all ancestors of p in the parent tree. You can use the parent function to get them, and when we add a character X, another NP represents St (Tx), then right (NP) ={l+1}. Let their descendants to the ancestors of the V1=p,v2,..., Vk=root, consider one of the right set of V ={r1,r2,..., rn=l}, then add a new character X in his back to form the new state of NV, only s[ri]= Ri those of x are eligible.
At the same time we know that if there is no edge marked X from V (I don't want to see RN), there is no RI that satisfies this requirement in the right set of V,
So because V1,v2,v3,... The right set is gradually enlarged, and if VI departs with the side labeled X, then vi+1 must have.
For a V that departs without marking the edge of X, only the RN in his right set satisfies the requirement, so according to the transfer rule mentioned earlier, let it connect to the side of NP marking X.
The VP is V1,v2,..., the first of the VK has a status of labeled X.
Consider the right set of VP ={R1,R2,..., RN}, so that trans (vp,x) =q, then the right set of Q is {ri+1| A collection of s[ri]=x} (note that this is the case before the update, so RN is not counted).
Note: We may not necessarily be able to insert l+1 directly in the right collection of Q.
Next consider the node NQ, in the process of transfer, the end position l+1 is not working, so trans (NQ) is the same as the original trans (q), copy can.
Next, if there is already a node NQ, we have to deal with it.
Memories: V1,v2,..., VK is a number of nodes with the right set containing {L} sorted by descendant to ancestor, where VP is the first ancestor of an edge labeled X. X is the newly added character of this round.
Because the VP,..., VK all have the edge labeled X, and the right set of points to reach, as the starting point of the right set of the larger, will also become larger, then only a VP,..., ve, by marking the edge of X, the original is to the node Q. (Q=trans (vp,x)).
So since here Q node has been replaced by NQ, we just set the VP,..., ve of Trans (*,x) to NQ can be.
each stage: review:
Make the current string T, and add the character X.
Make P=st (t), right (p) ={length (t)} The word point
New Np=st (Tx), right (NP) ={length (T) +1} node.
To all ancestors of the side of p without marking x V,trans (X,V) =NP
Find the first ancestor VP of P, which has the edge of label X, if there is no such VP, then
Parent (P) =root, ending the stage.
Make Q=trans (vp,x),
If Max (q) =max (VP) +1 makes parent= (NP) =Q, end the phase.
Otherwise, new node Nq,trans (nq,*) =trans (q,*)
Parent (NQ) = parent (q) (previous)
Parent (q) = NQ
Parent (NP) =NQ
The ancestor V,trans (v,x) of all trans (v,x) = = Q P is changed to NQ.
Illustrations: (From: http://hi.baidu.com/myidea/item/142c5cd45901a51820e25039)
We are going to build the string "Aabbab" into a suffix automaton model.
1. Start with only one empty node root.

2. Now add the character ' a ', which is the suffix automaton for the string "a":

3. Add the character ' a ', that is, the suffix of the string "AA" automata:

4. Add the character ' B ', the suffix automaton of the string "AaB":

5. Add the character ' B ', the suffix automaton of the string "AABB":

Error )
All end points of the parent tree are at the end of the state, so node 3 is the end state, so the graph "AB" is also a suffix, but "ab" is not the "aabb" suffix, we also need a point assist, so we have to build:

Correct )
When 0 (S) was found, it was found that the transfer "B" already had a node occupied, so create a new node 5, copy all information 3rd (including the parent), and then update the Len value, which is node[5]->len=node[5]->parent->len+1, So the number 5th node can represent the suffix empty string (number No. 0 for the string) + character "b" = suffix "B", node 30% is the middle state, so the node is the original receiving State of the node point 3 to the transfer to point to 5, then we found that the original receiving node pointing to 3 must be the current node 0 (S , so it can be updated directly along the parent, and then the parent and ancestor of Node 5 join the current acceptance state, and reiterate one point: the string of a node and its parents ' representatives has the same suffix, and the length of the string is reduced, because the number 5th node is the receiving State, so his father is also the receiving State, At the same time, the same as any receiving State with the same suffix length less than the current node of the non-access node must be the parent of the current node, such as the 5th node with the same suffix length of less than 5th nodes of the node must be 5th number of fathers, must be accepted as a state. So in order to maintain this nature, we should redefine the father of node 3rd to 5, which is basically understood here.
Then insert the ' a ', the ' b ' suffix automata as follows:

Generalized suffix automata:

The traditional suffix automata is to solve the matching problem of single main string, and the generalized suffix automaton can be used to solve the matching problem of multiple main strings.

How to build multiple main strings into generalized suffix automata? First, a main string is established as a suffix automaton, so that after the last will be reset, so that Last=root, the next string to start from the beginning of the node, the next state if not present, then the rules of the prefix automaton to establish a new node.

If the next state has been established, we can move directly to that state, since arriving at that state indicates that all suffixes of the string that have been matched successfully are in that state and the parent node of that state, parent node ... Until root, so we need to be on that state as well as his parent node, the parent node of his parent node ... Until root is updated, the content of the update is of course different from the topic, if you ask for the number of occurrences of a string, cnt++, if the location of a string appears, the position exists in an array of nodes.

For example: (Give n a string of length 100, a target string of m length k, and ask where each target string appears in the main string.) )

Versatility is very strong, if the target string is arbitrary length, the code is all used in the place of K to delete, if the number of target string, the structure is added a count cnt, and then the lower Xx.push_back () to cnt++, if the main string length arbitrary, you can use if (last== root&&weizhi! =0) to record the number of rows. Other used in the same suffix automata, there are many examples and the Internet, there is not much to describe ~. ~

Advantages: No longer need to connect multiple hosts to set up a suffix automaton! The complexity of space is greatly reduced ~ Code difficulty is not high, is to understand a little complex.

#include <cstdio> #include <cstring> #include <iostream> #include <vector>using namespace std; int weizhi,n,k,m;/* global variable Wezhi convenient to pass the coordinates at this time, K for K value, M for M K-mer index*/int f[30];/* will map a,c,g,t to 0,1,2,3 respectively, can reduce the cost of a single node. The struct complex{/* defines the node */complex* tranc[4],*father;/* through father and Tranc to link the parent tree, Tranc to the next state query */int len;/* represents the second state right The max*/vector<int> ans;/* storage Answer */}*last,*root,none,*nil=&none;    Complex *newcomplex (int _)/* Apply for new node */{Complex *c=new Complex;    C->len = _;    Fill (C->tranc,c->tranc + 4,nil);    C->father = nil; return C;} void init ()/* Initialize */{f[' A '-' a ']=0,f[' C '-' a ']=1,f[' G '-' a ']=2,f[' T '-' a ']=3;/* initialize the transfer function f */root = last = Newcomplex (0);    * Establish the node root, initialize last*/}inline void Add (int c)/* To add state c*/{Complex *np,*p = Last in the generalized Sam after the last;        if (last->tranc[f[c]]==nil) {np= Newcomplex (Last->len + 1);        for (; p! = nil && P->tranc[f[c]] = = nil; p = p->father) p->tranc[f[c]] = NP; if (p = = nil) NP-&GT;father = root;            else {Complex *q = p->tranc[f[c]];            if (Q->len = = P->len + 1) np->father = q;                else {Complex *nq = Newcomplex (P->len + 1);                memcpy (nq->tranc,q->tranc,sizeof (Nq->tranc));                Nq->father = q->father;                Np->father = Q->father = NQ;                if (nq->len>=k&&nq->father->len<k) {nq->ans=q->ans;            } for (;p!=nil&&p->tranc[f[c]]==q;p=p->father) p->tranc[f[c]] = NQ;    }}} else{np=last->tranc[f[c]];    } last = NP;        for (; NP! = root; np = Np->father) {if (np->len<k) break;        if (np->father->len<k) {np->ans.push_back (Weizhi);    }}}inline int Ask (char s[])/* Insert new character */{Complex *now = root;    int length = strlen (s); for (int i = 0, i < length; ++i) if (now->tranc[f[s[i]-' a ']]! = nil) now = now->tranc[f[s[i]-' a '];    else return 0;    int p=0; for (int i=0;i<now->ans.size (); i++) {if (now->ans[i]%100>=length-1) {printf ("No.%d:\tx:%d\t,y: %d\n ", p+1,now->ans[i]/100+1,now->ans[i]%100+2-k);/* Output anser here, please modify the output format here.        */} else p--;    p++; } return Now->ans.size ();}            void Scan_dna () {for (int. j=0;j<n;j++) {for (int i=0;i<100;i++) {char C;            if (C=getchar (), c==eof) break;            if (c>= ' a ' &&c< ' Z ') {ADD (C-' a '); weizhi++;                } else{i--;                if (last!=root&&weizhi%100) {weizhi+=100-weizhi%100;            } last=root;    }}}}void Scan_k_mer_print () {//printf ("%lld", (Long Long) sizeof (Complex));/* node consumption memory test */}int K () {int k;    printf ("Please input N and k:\n"); scanf ("%d%d", &n,&k); return k;}    void scan_m_s () {fclose (stdin);    Freopen ("CON", "R", stdin);    printf ("Please input m:\n");    scanf ("%d", &m);    Char ss[105];        for (int i = 0; i < m; ++i) {printf ("Please input the%dth lenght is%d k-mer index:\n", i+1,k);        scanf ("%s", ss);p rintf ("%s", SS);        int Last_ans = ASK (ss);    printf ("all:%d\n", Last_ans);    }}void autoscan_m_s () {scanf ("%d", &m);    Char ss[105];        for (int i = 0; i < m; ++i) {scanf ("%s", ss);p rintf ("%s", SS);        int Last_ans = ASK (ss);    printf ("all:%d\n", Last_ans); }}int Main () {k=k ();/* Read in K value */freopen ("F1out.txt", "R", stdin);/* Test data file *//* data output file */init ();/* Initialize */Scan_dna ();/* Read    Into the DNA sequence and establish a generalized suffix automata model */printf ("The Establishment of success!\n"); Freopen ("Out_2.txt", "w", stdout);/* If you want to export the answer file, uncomment the line */////////////To console input K-mer index select scan_s (), and comment out autoscan_m_s (); */* To enter the file K-mer index, select autoscan_m_s (), and comment out scan_m_s (); */scan_m_s ();/* Manual input *///autoscan_m_s ();/* File read in, note: Please ensure that the file format is correct!*/return 0;}

Compression suffix automata:

The idea is to compress a small node with no branched chain into a node, and the next of the parent node of the compression node should record the state of the first node in this compressed node.

Because has not written, the realization will give you!

Thank you clj the Great god ppt Guide! Thank you little Peng Yu, yyn, fork elder sister's guidance! Thanks for modeling let me think deeply!

Generalized suffix automata

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Generalized suffix automata

Contact Us

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support