[Go] suffix automaton

Source: Internet
Author: User

Original address: http://blog.sina.com.cn/s/blog_8fcd775901019mi4.html

Feel yourself looking at this finally feel can understand! You can also feel how the suffix automaton is a data structure ...

The author's own words will be expressed in italics ... [Might help you understand, but it may also undermine our own understanding?] So... If you don't understand, then come and watch ...]

Commonly used string processing tools:

1. Index of whole words: sort + binary; Hash table. can solve the whole word match, but does not support the prefix search; hash table can solve multi-pattern string search and match problem with RK when the pattern string is fixed long. Overall, the index of the whole word in the substring search inside the performance and ideal. Of course there are advantages, that is, small space.

2. Prefix index: Kmp/trie tree/ac automaton. The AC automata can be seen as a hybrid of the KMP + trie tree, because KMP supports single-string searches and fail pointers, whereas the trie tree supports multiple searches, but no fail pointers. Two crosses together there is an AC automaton, which supports both multiple string matching and fail pointers, and O (n) can scan out all the pattern strings within a time, which is really powerful.

3. Suffix index: suffix tree, suffix array. Suffix index is generally only for a single string processing, you can reverse the single string of internal structure information (dictionary order, the longest public prefix, etc.), of course, you can also combine multiple strings together to do the suffix index, and then find the structure information between these strings.

It should be said that the single-string index, the suffix array is very powerful, has been able to solve a lot of problems. And then the automatic machine I have never heard of, on-line check, as if there is not too much information about it, there is nothing to say. Seemingly a lot of work it can do, the suffix array can also be done, it can not be done, the suffix array will be completed. (Well, it's blown out).

Although the suffix automaton has a very attractive side: the code volume is less than 50 lines, too short, too tempting, and is an O (n) on-line algorithm, O (1) (averaging) incremental construction. In contrast, the suffix array of the largest short plate is not incremental construction, although some offline algorithm can solve the problem, but this trick only in the competition some use, the actual project inside, the online algorithm must be online. As a weak dish, my personal appreciation of the algorithm has always been "simple and useful things, is beautiful!" "The suffix automaton is a bar, so it took a little time to learn. This article is mainly personal summary, detailed reference CLJ ppt.

[The following paragraph on the data structure of the words I very much agree]

We know that data structure is a two-tuple = data + operation. The soul of a data structure is that in the process of manipulating it, keeping certain properties constant and accomplishing tasks efficiently. These properties are often divided into two types: 1. functional properties; 2. Performance properties. Take the balanced binary tree as an example: the functional nature is the order of the left and right sub-tree, and then the key to retrieve, add, delete, and the performance property is to maintain the depth balance of the left and right subtree to achieve an O (log (n)) operating limit. Changes in the operation of maintaining a critical quality, the individual is considered to be the core content of data structure design.

Then look at the suffix automaton, the following content is mainly a personal summary.

What is the functional nature of the suffix automaton?

Suffix automata are only interested in suffixes. For the string str, the set Sam (str) is its corresponding suffix automaton, then Sam (STR) receives and receives only all suffixes of str. This means that for all suffixes of STR, there is a legal transfer within the SAM (str) and is transferred to the final state . As an additional function, the suffix automaton can not only recognize suffixes, but also recognize all substrings of Str. [The difference is that one has to be transferred to the final state and one will stop in the middle]

What is the performance nature of the suffix automaton?

The length of STR is n, then the number of States of the suffix Automaton Sam (str) is O (n). Because there are only n suffixes, so the number of states is O (n) seems reasonable, each corresponding. But don't forget, the suffix automaton can not only recognize the suffix, but also can recognize all the substrings of STR, the number of these substrings is O (n^2), this look becomes unreasonable, O (n) of the state number of how to recognize O (n^2) string? In the prefix index, take trie tree as an example, the state and prefix (string) is one by one corresponding, so the state number of the trie tree is just equal to the total length of the string (also the number of prefixes), can be considered O (n). But the suffix automata can not do so, but also to one by one correspondence, the consequence of this is that the state number has also become O (n^2). This means that to implement the number of States of O (n), it is necessary to make some substrings mapped to the same state for the reuse of the state! Has the problem come up again? Which strings should be mapped to the same state? Can you do anything?

Key Concepts + main observations:

1. Right set: For any of the substrings of Str s,right (s) is a collection that contains the endpoints of all occurrences of s in Str. For example, the substring s in str appear k times, have s = str[l1, r1) = str[l2, r2) = ... = Str[lk, RK), then right (s) = {r1, r2, ....., RK}.

2. Status and Meaning: string s1, S2 are mapped to the same state when and only if they have the same right set. That is, state (S1) = State (S2) when and only if Right (S1) = right (S2). (i.e. the state can be reused).

3 nature of the state 1--right set: Since strings mapped to the same state have the same right set, the right set can be a character of a string as well as a state. For any one of the SAM (str) state x, use right (x) to represent the corresponding set of the sets. The right set can also be seen as a set of suffixes [that is, the suffix suffix (r), which begins with an element R in right (x)], which can be received by the successor State of X, and conversely, any path of state X to the final state corresponds to a suffix in str, This suffix should belong to right (x). On this level, the necessary and sufficient condition that position R belongs to right (x) is that Suffix (r) can be identified by the successor State of X.

4. Nature of the state 2--string: substr (x) represents all substrings that are transferred to the State X.

5. The relationship between right (x) and substr (x):

SUBSTR (x), right (x), a string s in any given SubStr (x), is defined to determine the right (x), which is straightforward [because this state is just one of the strings in any SUBSTR (x)] ;

Right (x)->substr (x), without the above so direct, if arbitrarily given a position R (x), how can we determine SubStr (x)? The answer is string length! If you know the length Len, it is easy to know that Str[r–len, R) belongs to substr (x), and if you know all the lengths, you can determine the entire substr (x)!

It can be proved that the length of the string in SubStr (x) is exactly the same as a continuous interval, which can be represented by [Max (s), Min (s)].

  [How to prove it?] ]

Give an example such as the string "ABCD" first.

Then the right set of "ABCD", "BCD", "CD", "D" is equal, and they should be projected in a state.

So it is found that these right sets of equal elements between the actually have a deep connection ... Between them is the containment relationship! and is the suffix of the containment relationship!

Then generalize thinking, if you already know the longest string in a state, it is clear that the right set of all suffixes of this string must contain the right set of this string [it is obvious that you can think of a line in the brain.]

Of course, after a certain location of the suffix, there may be a few more places, such as the string "ABCDCD", "ABCD" and "BCD" the right set is equal, but "CD" and "D" there is a place, the state is different. As to why it is continuous, it is also very obvious, because it is impossible to disconnect.

And think carefully right (x)->substr (x) This property will find that the right collection is actually equal to this, and only the longest string and some of its suffixes! Because the position is fixed, there are only a few. [Feeling very verbose, but the author to the back to understand this sentence, in fact, it can be seen in this]

6. Relationship between State and state: for two different states X, Y. either right (x) is the empty set with Right (Y), or the other is a true subset of the other. (This property guarantees that the total number of States is linear).

  [How to prove it?] ]

Suppose right (x) ={a1...ak1},right (y) ={b1...bk2}.

If there is a ai=bj=r, you can take a string a in substr (x), take a string B in substr (y), because the two strings at the end of R coincident, either a is a suffix of B, or b is a suffix [a!=b], it may be possible to set a is the suffix of B.

Then at the end of the B occurrence position a must also appear, that is, a where the right set must contain the right set where B is located, and because X!=y, so x contains Y.

So if the right set has intersections, then it must be included with each other.

7. Parent Tree: According to Nature 6, for each state x, we can determine a parent (x) as follows. y = parent (x) when and only if, right (y) is the smallest of all the sets that contain the (x), and if such y does not exist, then the parent (x) = the initial state. It is important to note that this parent is not a precursor to a transfer relationship, there are multiple precursors in a state in a suffix automaton, but only one parent. In addition, from the perspective of the right set, the parent tree, from the leaves to the root, is actually some intersecting sets of the process of merging constantly. So one point of the parent pointer is the essential meaning of a state: to extend the right set and add one of the set to the other! At the same time, because of the parent tree, we do not have to have a right set for each state, but rather a hierarchical structure to keep the spatial complexity linear.

8. Relationship between parent (s) and S: Max (parent (s)) = min (s) –1;trans (S, ch)! = NULL then trans (parent (s), ch)! = null.

  [How to explain it?] ]

The Parent (s) is the smallest set that contains S, that is, the right collection where the break is located

For example, "ABCDCDD" right has three {4}{4,6}{4,6,7}, where "ABCD", "BCD" belongs to {4}, "CD" belongs to {4,6}, "D" belongs to {4,6,7}, these are the first broken positions found in the move. So Max (Parent (s)) =min (s)-1.

Trans (S,CH) should be said to be the transfer side CH, then we take into account the suffix from their common position, because the parent (s) right is larger, so s can lead to, the parent (s) can also lead, so if s can transfer, then the parent (s) can also be transferred.

9. Transfer character, precursor: if there is trans (x1, C1) = trans (x2, C2) ... =trans (xk, CK) = x, then C1 = C2 ... = ck! and state x1, x2, ... xk form a contiguous parent chain in the parent tree (that is, a parent-child relationship). The bottom of the chain is the smallest son for Xi, when and only if STEP[XI] + 1 = step[x] (step for the construction of the new node is given with the mark)!

[How to prove it?] ]

There is also a proof about step[xi] + 1 = step[x], for the moment I do not know how to prove ...

To understand the construction algorithm of suffix automata:

What do Sam (T) to Sam (Tx) need to update? What is our basis?

First of all, what are the qualities we need to maintain?

A) transfer legitimacy: Receive all suffixes of TX and ensure that all substrings of TX are legally transferred! (so it involves an update of the transfer matrix!) )

b) State legitimacy: the right set for each state and new state satisfies the definition, that is, all substrings that are transferred to the same state have the same right set. (updates involving the parent chain)

For a), because the substring of the TX = All the substrings of the T + tx all suffixes [are inclusive of x and do not contain x], so we only need to ensure that all the suffixes of the TX have a legitimate transfer of the line! And the suffix of the Tx is entirely by adding a character x to all the suffixes of T, and we just need to find out the status of the suffix of T in Sam (t), and then make sure that the state is X-shifted in Sam (TX)!

How do I find out the status of the suffix in Sam (T)? Since all suffixes of T have a common occurrence position r = Length (t), which leads to their state right set intersection non-null, and then according to the nature of 6 know that these States must constitute a parent chain, so the simplest way to find these states is to follow the parent chain back! [below: Drawing Dafa good! ]

Set Sam (T) suffix corresponding to the end state ={v1, V2, ... VK}, what will happen when backtracking?

One is trans (p, x) = null, that is, there is no X-shift, we must add an X-transfer Sam (TX) to the end of NP, while guaranteeing the nature of B);

that q = trans (p, x)! = NULL when? Well, there's already x shifted, we just need to keep the nature B) on the line. Expand Right (Q), that is, the NP of the parent's pointer to the q is not OK! Amount: But that's a problem!

Let's first look at what the precursor of the transfer to Q is, depending on the nature of the 9 Q has a M precursor: P1, p2, ..., PM, and meet (parent[p1] = p2, parent[p2] = P3, ...).

Now the question is, where is p in this chain?

If P = P1 (at this time step[p] + 1= step[q]), the preceding practice is no problem, because all strings transferred from P2......PM to Q must be the suffix of the string transferred from p to Q, so these strings must also appear in position length (Tx), So expand right (Q) = right (q) + {length (Tx)} Of course no problem! However, if p! = P1 is in trouble, for any of the nodes that appear in front of P, PJ, you can prove that the string transferred from PJ to Q must not appear in position length (Tx). If it is still simple (q) = right (q) + {length (Tx)}, it causes the property B) to fail on node Q because some strings are transferred to Q but its right is not! How do you do that? You can see here that p divides the precursor into two parts, the front part shifts to the right set of Q, and the next part of the set should be expanded. The simplest way is to split Q into two, corresponding to two parts of the precursor, this is the method of construction algorithm inside! In fact, the key point of understanding this approach is the nature of 9!

[Go] suffix automaton

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.