Suffix Automaton detailed

Source: Internet
Author: User

Reprint to: http://blog.csdn.net/qq_35649707/article/details/66473069

Automatic suffix machine

Suffix automata (word-to-loop)-is a powerful data structure that allows you to solve many string problems.

For example, you can use a suffix automaton to search for all occurrences of another string in a string, or to calculate the number of different substrings-all in a linear

Time to resolve.

Intuitively, a suffix automaton can be understood as a concise message for all substrings. An important fact is that the suffix automaton contains a length in the form of a compressed

For all information of n strings, only O (n) space is required. Also, it can be constructed in O (n) Time (if we consider the size K of the alphabet as a constant, otherwise

Is O (n*logk)).

Historically, Blumer and others in 1983 first proposed the linear scale of the suffix automata, and then in 1985-1986, people proposed the first linear time built

Algorithm for suffix automata (crochemore,blumer, etc.). See more details at the end of the text link.

Suffix automata in English is called "suffix automaton" (plural form: suffix automata), the word has a non-circular graph--"direcged acyclic

Word graph "(abbreviated as" DAWG ").


definition of suffix automata

Definition. The suffix of the given string s is a minimized deterministic finite state automaton, which is capable of receiving all suffixes of the string s.


This definition is explained below:

· The suffix automata is a direction-free graph in which vertices are states and edges represent transitions between states.

· A state t_0 is called the initial state, and it is able to reach all remaining states.

· All transitions in the automata-that is, the forward edges-are marked by some kind of symbol. Transfers from a certain state must have different marks. other hand

The state transfer cannot be on any character).

· One or more states are marked as terminating state. If we go from the initial state t_0 through any path to a certain terminating state, and sequentially write out all passing edges of the

tag, the string you get must be a certain suffix of s.

· In all automata that meet the above conditions, the suffix automaton has this least number of vertices. (suffix automata are not required to have the fewest number of sides)


The simplest properties of the suffix automata

Minimalist-the most important property of the suffix automaton is that it contains information about all the substrings of S. In other words, for any path departing from the initial state t_0, if we

Write the tag on the passing edge, and the substring formed must be a substring of s. Accordingly, any substring of s corresponds to a path starting from the initial state of T_0.

To simplify the instructions, we call the substring "match" the path from the initial state, if the edge markers on that path make up this substring. Accordingly, we call any path

"Match" a substring that consists of the markers of the edges in the path.

Each state of the suffix automaton leads to one or more paths starting from the initial state. We call this state a number of ways to match these paths.


An example of building a suffix automaton

Here are some examples of simple string-building suffix automata.

The initial state is denoted as T0, and the terminating state is marked with an asterisk (*).

S= ""


S= "a"

s= "AA"


s= "AB"


s= "ABA"

S= "ABB"

S= "ABBB"


an algorithm for constructing suffix automata with linear time

Before we describe the building algorithm, it is necessary to introduce some new concepts and brief proofs, which are very important for understanding the concept of suffix automata.

End Position Endpos, their properties and their connection to the suffix automaton:

Consider any non-empty substring of the string s. We call the end set Endpos (t) as: All the sets of the end of the position where T appears in S.

We call two substrings t_1 and t_2 "endpoints equivalent" if their end set is consistent: Endpos (t_1) =endpos (t_2). Therefore, all S's non-empty strings can be

into several classes based on the equivalence of the endpoints.

In fact, for the suffix automaton, the end-point equivalent string remains the same property. In other words, the number of states in the suffix automaton is equivalent to the end-point equivalence class of all substrings

Number, plus the initial state. Each state corresponds to one or more substrings that have the same set of endpoints.

We take this statement as a hypothesis and then describe a linear time-based algorithm for constructing a suffix automaton--as we will see shortly,

All suffix automata must be of the nature, except the minimum (i.e. the minimum number of vertices), will be satisfied (the minimum is generated by Nerode, see References).

We give some simple but important facts about the end set.


lemma 1. Two non-empty strings U and V (Length (u) <=length (v)) are the endpoints equivalent when and only if u appears as a suffix of w only in the string s.

proved to be obvious. (Translator Note: The proof of a few words I do not understand, can not be compiled ... )

lemma 2. Consider two non-u,w sets (Length (U) <=length (w)). Their endpoint collections do not intersect, or Endpos (W) is a subset of Endpos (U). into a

Step, depending on whether U is the suffix of w:



proof . Suppose that two sets Endpos (U) and Endpos (W) have at least one common element, which means that the string W and U end at the same position, that is, U is the suffix of W.

Therefore, the end point of each occurrence of the string w will appear, which means that Endpos (W) is contained in the Endpos (U).

lemma 3. Consider an endpoint equivalence class. The substrings in this equivalence class are sorted in descending order of length. In the sorted sequence, each substring will be shorter than the previous substring, but

The suffix of the previous string. In other words, the strings in an equivalence class of an endpoint are suffixes, and their lengths are followed by all the numbers in the interval [x, y].

proof . Consider this end-point equivalence class. If it contains only one substring, then the correctness of lemma 3 is obvious. Suppose you now have more than one substring.

According to Lemma 1, two different end-point equivalent substrings always satisfy one is the strict suffix of the other. Therefore, substrings in the same endpoint equivalence class cannot have the same length

Degree.

Make w longer, U is the shortest substring in the equivalence class. According to Lemma 1,u is a strict suffix of W. Consider the suffix between any of the lengths of [Length (U) and Length (W)],

By the lemma 1, it is obvious that it is in the endpoint equivalence class.

suffix link

Consider a state v≠t_0. As far as we know, there is a certain set of substrings, where the elements and V have the same endpoint set. And if we remember W is its

, the remaining substrings are the suffix of W. We also know that the first few suffixes of w (descending by length) are in the same endpoint equivalence class, with the remaining suffixes (at least

Null suffix) in the other endpoint equivalence class. So T is the first such suffix--we build a suffix link to it.

In other words, the suffix of v links link (v) to the longest suffix of w in different equivalence classes.

Here we assume that the initial state t_0 in a separate endpoint equivalence class (containing only empty strings), and Endpos (t_0) ={-1,..., Length (s)-1}.

lemma 4. Suffix links make up a tree that is rooted in t_0.

proof . Consider any state v≠t_0. The suffix Links link (v) refers to a state that corresponds to a string length that is strictly smaller than itself (according to the suffix link definition and lemma 3).

So, moving along the suffix link, we'll arrive at T_0, which corresponds to an empty string.

lemma 5. If we set all legitimate end points into a tree (which makes the child a subset of parents), the tree will be the same tree as the suffix link.

proof . The fact that the end set can form a tree is derived from lemma 2 (two end sets either do not intersect, or one contains another).

We now consider the arbitrary state v≠t_0, and its suffix link (v). According to the definition of the suffix link and lemma 2:

Endpos (v) ⊂endpos (link (v))

This and the previous lemma attest to our assertion that the suffix link tree and the Endpoint collection tree are the same.

Here is an example of a suffix link that represents the string "ABCBC":




Summary

Before learning the specific algorithm, summarize the knowledge accumulated above, and introduce two auxiliary symbols.

· All substrings of s can be divided into equivalence classes according to their end sets.

· The suffix automaton consists of an initial state t_0 and the state of all the different endpoint equivalence classes.

· Each State v corresponds to one or more strings, we remember that longest (v) is the oldest of them, and Len (v) is its length. We remember that shortest (v) is in these strings

The shortest of its length is Minlen (v).

· All strings corresponding to this state are different suffixes of longest (v) and include all lengths between [Minlen (v), Len (v)].

· The suffix link defined for each state v≠t_0 corresponds to a suffix of minlen (v)-1 for the length of longest (v). The suffix link forms a t_0 root.

Tree, and this tree is in fact the tree-like relationship of all the end sets. The relationship between Minlen (v) and link (v) is expressed as follows: Minlen (v) =len (link (v)) +1.

· If we start moving along the suffix link from any node v_0, we will arrive at the initial state t_0 sooner or later. In this case, we get a series of disjoint zones

[Minlen (V_i), Len (v_i)], whose assembly is a continuous interval.


a linear time algorithm for constructing suffix automata

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.