Suffix Automaton detailed

Last Update:2018-07-25 Source: Internet

Author: User

Tags constant

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Original paper (Russian) address: Suffix_automata

Automatic suffix machine

Suffix automata (word-to-loop)-is a powerful data structure that allows you to solve many string problems.

For example, you can use a suffix automaton to search for all occurrences of another string in a string, or to calculate the number of different substrings-all in a linear

Time to resolve.

Intuitively, a suffix automaton can be understood as a concise message for all substrings. An important fact is that the suffix automaton contains a length in the form of a compressed

For all information of n strings, only O (n) space is required. Also, it can be constructed in O (n) Time (if we consider the size K of the alphabet as a constant, otherwise

Is O (n*logk)).

Historically, Blumer and others in 1983 first proposed the linear scale of the suffix automata, and then in 1985-1986, people proposed the first linear time built

Algorithm for suffix automata (crochemore,blumer, etc.). See more details at the end of the text link.

Suffix automata in English is called "suffix automaton" (plural form: suffix automata), the word has a non-circular graph--"direcged acyclic

Word graph "(abbreviated as" DAWG ").

definition of suffix automata

Definition. The suffix of the given string s is a minimized deterministic finite state automaton, which is capable of receiving all suffixes of the string s.

This definition is explained below:

· The suffix automata is a direction-free graph in which vertices are states and edges represent transitions between states.

· A state t_0 is called the initial state, and it is able to reach all remaining states.

· All transitions in the automata-that is, the forward edges-are marked by some kind of symbol. Transfers from a certain state must have different marks. other hand

The state transfer cannot be on any character).

· One or more states are marked as terminating state. If we go from the initial state t_0 through any path to a certain terminating state, and sequentially write out all passing edges of the

tag, the string you get must be a certain suffix of s.

· In all automata that meet the above conditions, the suffix automaton has this least number of vertices. (suffix automata are not required to have the fewest number of sides)

The simplest properties of the suffix automata

Minimalist-the most important property of the suffix automaton is that it contains information about all the substrings of S. In other words, for any path departing from the initial state t_0, if we

Write the tag on the passing edge, and the substring formed must be a substring of s. Accordingly, any substring of s corresponds to a path starting from the initial state of T_0.

To simplify the instructions, we call the substring "match" the path from the initial state, if the edge markers on that path make up this substring. Accordingly, we call any path

"Match" a substring that consists of the markers of the edges in the path.

Each state of the suffix automaton leads to one or more paths starting from the initial state. We call this state a number of ways to match these paths.

An example of building a suffix automaton

Here are some examples of simple string-building suffix automata.

The initial state is denoted as T0, and the terminating state is marked with an asterisk (*).

S= ""

S= "a"

s= "AA"

s= "AB"

s= "ABA"

S= "ABB"

S= "ABBB"

an algorithm for constructing suffix automata with linear time

Before we describe the building algorithm, it is necessary to introduce some new concepts and brief proofs, which are very important for understanding the concept of suffix automata.

End Position Endpos, their properties and their connection to the suffix automaton:

Consider any non-empty substring of the string s. We call the end set Endpos (t) as: All the sets of the end of the position where T appears in S.

We call two substrings t_1 and t_2 "endpoints equivalent" if their end set is consistent: Endpos (t_1) =endpos (t_2). Therefore, all S's non-empty strings can be

into several classes based on the equivalence of the endpoints.

In fact, for the suffix automaton, the end-point equivalent string remains the same property. In other words, the number of states in the suffix automaton is equivalent to the end-point equivalence class of all substrings

Number, plus the initial state. Each state corresponds to one or more substrings that have the same set of endpoints.

We take this statement as a hypothesis and then describe a linear time-based algorithm for constructing a suffix automaton--as we will see shortly,

All suffix automata must be of the nature, except the minimum (i.e. the minimum number of vertices), will be satisfied (the minimum is generated by Nerode, see References).

We give some simple but important facts about the end set.

lemma 1. Two non-empty strings U and V (Length (u) <=length (v)) are the endpoints equivalent when and only if u appears as a suffix of w only in the string s.

proved to be obvious.

lemma 2. Consider two non-u,w sets (Length (U) <=length (w)). Their endpoint collections do not intersect, or Endpos (W) is a subset of Endpos (U). into a

Step, depending on whether U is the suffix of w:

proof . Suppose that two sets Endpos (U) and Endpos (W) have at least one common element, which means that the string W and U end at the same position, that is, U is the suffix of W.

Therefore, the end point of each occurrence of the string w will appear, which means that Endpos (W) is contained in the Endpos (U).

lemma 3. Consider an endpoint equivalence class. The substrings in this equivalence class are sorted in descending order of length. In the sorted sequence, each substring will be shorter than the previous substring, but

The suffix of the previous string. In other words, the strings in an equivalence class of an endpoint are suffixes, and their lengths are followed by all the numbers in the interval [x, y].

proof . Consider this end-point equivalence class. If it contains only one substring, then the correctness of lemma 3 is obvious. Suppose you now have more than one substring.

According to Lemma 1, two different end-point equivalent substrings always satisfy one is the strict suffix of the other. Therefore, substrings in the same endpoint equivalence class cannot have the same length

Degree.

Make w longer, U is the shortest substring in the equivalence class. According to Lemma 1,u is a strict suffix of W. Consider the suffix between any of the lengths of [Length (U) and Length (W)],

By the lemma 1, it is obvious that it is in the endpoint equivalence class.

suffix link

Consider a state v≠t_0. As far as we know, there is a certain set of substrings, where the elements and V have the same endpoint set. And if we remember W is its

, the remaining substrings are the suffix of W. We also know that the first few suffixes of w (descending by length) are in the same endpoint equivalence class, with the remaining suffixes (at least

Null suffix) in the other endpoint equivalence class. So T is the first such suffix--we build a suffix link to it.

In other words, the suffix of v links link (v) to the longest suffix of w in different equivalence classes.

Here we assume that the initial state t_0 in a separate endpoint equivalence class (containing only empty strings), and Endpos (t_0) ={-1,..., Length (s)-1}.

lemma 4. Suffix links make up a tree that is rooted in t_0.

proof . Consider any state v≠t_0. The suffix Links link (v) refers to a state that corresponds to a string length that is strictly smaller than itself (according to the suffix link definition and lemma 3).

So, moving along the suffix link, we'll arrive at T_0, which corresponds to an empty string.

lemma 5. If we set all legitimate end points into a tree (which makes the child a subset of parents), the tree will be the same tree as the suffix link.

proof . The fact that the end set can form a tree is derived from lemma 2 (two end sets either do not intersect, or one contains another).

We now consider the arbitrary state v≠t_0, and its suffix link (v). According to the definition of the suffix link and lemma 2:

Endpos (v) ⊂endpos (link (v))

This and the previous lemma attest to our assertion that the suffix link tree and the Endpoint collection tree are the same.

Here is an example of a suffix link that represents the string "ABCBC":

Summary

Before learning the specific algorithm, summarize the knowledge accumulated above, and introduce two auxiliary symbols.

· All substrings of s can be divided into equivalence classes according to their end sets.

· The suffix automaton consists of an initial state t_0 and the state of all the different endpoint equivalence classes.

· Each State v corresponds to one or more strings, we remember that longest (v) is the oldest of them, and Len (v) is its length. We remember that shortest (v) is in these strings

The shortest of its length is Minlen (v).

· All strings corresponding to this state are different suffixes of longest (v) and include all lengths between [Minlen (v), Len (v)].

· The suffix link defined for each state v≠t_0 corresponds to a suffix of minlen (v)-1 for the length of longest (v). The suffix link forms a t_0 root.

Tree, and this tree is in fact the tree-like relationship of all the end sets. The relationship between Minlen (v) and link (v) is expressed as follows: Minlen (v) =len (link (v)) +1.

· If we start moving along the suffix link from any node v_0, we will arrive at the initial state t_0 sooner or later. In this case, we get a series of disjoint zones

[Minlen (V_i), Len (v_i)], whose assembly is a continuous interval.

a linear time algorithm for constructing suffix automata

We describe the algorithm below. The algorithm is online, that is, to add characters to s and modify the current automata appropriately.

To achieve the purpose of linear space, we will only store the value of Len,link for each state, as well as the transfer list. We do not support the flag termination state (we will

Show how to add these markers when the suffix automatic mechanism is completed, if necessary.

The initial automaton consists of a state t_0, which we call the 0 State (the remaining states will be called,... ）。 For this state, make len=0, for the sake of convenience, will link

The value is set to-1 (point to an empty state).

Therefore, the task now becomes an operation that implements adding a character C to the end of the current string.

Here we describe this operation:

· 1. Last to the state of the entire string (initially last=0, we will change the value after each character addition).

· 2. Create a new state cur, cur =len (last) +1, and the value of link (cur) is not deterministic.

· 3. We initially at last, if it does not transfer the character C, then add the transfer of the character C, point to cur, and then go to the suffix link, check again--if not

With the transfer of the character C, add it up. If a node already has a transfer of character C, it stops and makes P the number of the state.

· 4. If "A node already has the transfer of character C" This event never occurred, and we came to an empty state-1 (via the t_0 suffix pointer), we simply make link (cur) = 0,

Jump.

· 5. Suppose we stop at a certain state Q, which is transferred from a certain State p by the character C. There are two situations: Len (P) +1=len (q) or not.

· 6. If Len (P) +1=len (q), then we simply make link (cur) =q, jump out.

· 7. Otherwise, the situation becomes more complex. A new "copy" state of Q must be created: Create a new state clone, copy the data of Q to it (suffix link, and

Transfer), in addition to the value of Len: Need to make Len (clone) =len (p) +1.

· 8. After copying, we point the cur suffix link to clone and redirect the suffix link of Q to clone.

· 9. Finally, the last thing we need to do is to go along the suffix link from P, and for each state we check to see if there is a shift to the Q, the character C,

If there is, redirect it to clone (if not, terminate the loop).

· 10. In any case, regardless of where the addition is terminated, we will eventually update the last value to be assigned a value of cur.

If we also need to know which nodes are terminating nodes and which are not, we can find all the terminating nodes after building the suffix automaton for the entire string. For this we

Consider the node that corresponds to the entire string (obviously, the node we stored in the last variable), and we follow its suffix link until it reaches its initial state, and

Each node of the path is marked as the terminating node. Well understood, so we tagged the corresponding state of all suffixes of the string s, which is the terminating state we want to find.

In the next section we consider each step of the algorithm in detail and prove its correctness.

Here we only note that the addition of each character causes one or two states to be added to the automata. Therefore, the number of States is obviously linear.

The linearity of the transfer quantity and the linear time complexity of the algorithm are difficult to understand, and they will be shown below, after the proof of the correctness of the algorithm.

The correctness proof of the algorithm

· We call the transfer (P,Q) to be continuous if Len (p) +1=len (q). Otherwise, Len (p) +1<len (q), we call it discontinuous transfer.

· As can be seen in the algorithm description, continuous transfer and discontinuous transfer result in different branches of the algorithm flow. Continuous transfer) was so named because, since the first

Once they appear, they will remain unchanged. In contrast, a discontinuous transfer may be changed during the process of adding new characters to a string (which may change the state that the edge points to).

· To avoid ambiguity, we call s the string that we have built the automaton, and it is preparing to add the current character C.

· At the beginning of the algorithm we created the new state cur, which will match the entire string s+c. The reason why we have to create a new state is obvious--adding new words

, a new endpoint equivalence class-a substring ending with the end of the new string S+c-is presented.

· After the new state is created, the algorithm starts with a state that matches the entire string s, moves along the suffix link, and on the way tries to add a transfer of the character C that points to cur. But

We will only add a new transfer if we do not have a transfer conflict, so once we encounter a transfer of character C, we must stop immediately.

· The simplest case-if we come to an empty state-1, the transfer of character C is added to all nodes en route. This means that the character C has never been in the string s before

Is. We have successfully added all the transitions, just note the status cur suffix link-it must be equal to 0, because in this case cur matches the string s+c a

Tangent suffix.

· The second case-when we enter an existing transfer (P,Q). This means that we try to add a character x+c to the string (where x is the string s

Length is Len (p)), and the string was previously added to the automaton (that is, the string x+c is already included as a substring in the string s). Because we assume that the character

The serial S automaton has been properly constructed and we should not add new transfers.
However, the Cur suffix link points to where there is some complexity. We need to point the suffix link to a state where the length is exactly equal to X+c, that is, the Len value of the state must be

Must be equal to Len (p) +1. But such a situation may not exist: In this case we must add a "split" state.

· Thus, one possible scenario is that the transfer (P,Q) becomes continuous, that is, Len (q) =len (p) +1. In this case, things become simple, no more splits

We only need to point the cur suffix link to Q.

· Another more complicated situation-when the transfer is discontinuous, Len (q) >len (p) +1. This means that the state Q does not just match the substring that we must have, length Len (p) +1

W+c, it also matches a longer substring. We have to create a new "split" State Q: Divide the substring into two segments, and the first paragraph will end at length len (p) +1.

How to achieve this "split" it. We "copy" a status Q and copy it to clone, but the parameter Len (clone) =len (p) +1. We copy all transfers of Q to clone,

Because we don't want to change the path through P anyway. The suffix link from clone always points to the original suffix link of q, and the suffix link of Q will point to clone.

After the copy, we point the cur suffix link to clone--we copy it to do this.

The last step-redirect some of the shifts to Q, and change them to point to clone. Which transfers must be redirected. Only need to redirect those that match all w+c

The. That is, we need to continue to move along the suffix link, starting with P, as long as it does not reach the null state-1 or does not reach a state, its C transfer points to a different than the Q

State.

proving that the number of operations is linear

First, we have said to make sure that the size of the alphabet is constant. Otherwise, the linear time is no longer true: transfers from one vertex are stored in the B-tree,

It supports quick find and add operations by value. Therefore, if we remember that the size of the alphabet is K, the asymptotic complexity of the algorithm will be O (N*LOGK), Spatial complexity O (n). But

Yes, if the alphabet is small enough, it is possible to sacrifice part of the space, not using a balanced tree, and an array of length k for each node (which supports quick lookups by value) and a

Dynamic linked list (supports fast traversal of all existing key values) storage transfer. This allows the O (n) algorithm to run, but consumes O (NK) space.

Therefore, we assume that the size of the alphabet is constant, that is, every action that is transferred by character query, add transfer, look for the next shift-all of these operations we

Considered to be O (1).

If we look at all the parts of the algorithm, we find that the linear time complexity of three is not obvious:

· First: Start with the last state, move along the suffix link, and add the transfer of the character C.

· Second place: Copy transfer when q is copied to the new state clone.

· Third place: Redirect the transfer of the point Q to clone.

We use the well-known fact that the size of the suffix automaton (the number of States and transfers) is linear. (The proof of the number of States is linear is derived from the algorithm itself, for

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More