[Algorithm series 24] suffix tree (Suffix trees)

Source: Internet
Author: User

Previous article ([Algorithm Series 20] dictionary tree (Trie)) we introduced the dictionary tree in detail. With these basics, we can better understand the suffix tree.

An introduction pattern matching problem

Given a text text[0...n-1], and a pattern string pattern[0...m-1], write a function search (char pattern[], Char text[]), and print out all the positions (n > m) where pattern appears in text.

This problem already has two classical algorithms: KMP algorithm, finite automaton, the former is the pattern string patterns do preprocessing, the latter is to treat the textual text of the verification of preprocessing. After preprocessing, the time complexity of O (n) can be reached, and n is the length of the text.

Suffix tree can be used to preprocess text, construct a text suffix tree, you can in O (m) time to search for any one pattern,m is the length of pattern string patterns.

Ii. Introduction

The suffix tree is designed to support valid string matching and querying, such as the above problem. The suffix tree (Suffix) is a data structure that can quickly solve many problems with strings. The concept of suffix tree was first proposed by Weiner in 1973, and Quantitation was improved by McCreight in 1976 and Ukkonen in 1992 and 1995.

A summary sentence: a given text of the suffix tree is a compressed suffix dictionary tree.

In the previous article we have discussed the dictionary tree (Trie), let's take a look at the Compression dictionary tree (compressed Trie).

Three-compressed dictionary tree (compressed Trie)

Let's look at a set of words to describe what a compressed dictionary tree is:

{bear, bell, bid, bull, buy, sell, stock, stop}

We build a dictionary tree with the above set of words as follows:

Here is the compression dictionary tree. The compressed dictionary tree is converted from a dictionary tree, compressing the single-node chain in the dictionary tree. That is, a single side without bifurcation, to compress.

Quad suffix Compression dictionary tree

After the above discussion, we know that the suffix tree is a compressed dictionary tree with all the suffixes of the literal text. A suffix tree is generated after a few steps:
(1) Generates all suffixes for the text of a given literal.
(2) Depending on all suffixes as valid words, generate a compression dictionary tree.

We take "banana\0" (' s ' is the end character) as an example, all the suffixes of the string are:

banana\0anana\0nana\0ana\0na\0a\0\0

Suppose we consider that all suffixes of the above string are valid words and construct a dictionary tree as follows:

If we merge the single-node chain, we get the following compression dictionary tree, which is the suffix tree for the given text "Banana\0".

So far, we've learned what a suffix tree is.

Five suffix Tree application

(1) from the target string s to determine whether to include the pattern string T (pattern searching)

方案:用S构造后缀树,按在Trie中搜索子串的方法搜索T即可。原理:若T在S中,则T必然是S的某个后缀的前缀。例如:S = "leconte" T = "con",查找T是否在S中,则T(con)必然是S(leconte)的后缀之一"conte"的前缀。

(2) Find the number of string T repetitions from the target string s

方案:用S+‘$‘构造后缀树,搜索T节点下的叶节点数目即为重复次数原理:如果T在S中重复了两次,则S应有两个后缀以T为前缀,重复次数就自然统计出来了。

(3) Finding the longest repeating substring from the target string s (finding the longest repeated substring)

方案:原理同2,具体做法就是找到最深的非叶节点。这个深是指从root所经历过的字符个数,最深非叶节点所经历的字符串起来就是最长重复子串。为什么要非叶节点呢?因为既然是要重复,当然叶节点个数要>=2。

(4) Find the longest common substring from the target string T and S (finding the longest common substr ing)

方案:将S1#S2$作为字符串压入后缀树,找到最深的非叶节点,且该节点的叶节点既有#也有$(无#)。

(5) Finding the longest palindrome string from the target string T (finding the longest palindrome in a string)

Reference:

Pattern Searching | Set 8 (Suffix Tree Introduction)
Trie

[Algorithm series 24] suffix tree (Suffix trees)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.