[Reproduced] Dictionary tree (trie tree), suffix tree

Source: Internet
Author: User

(1) dictionary tree (trie tree)

Trie is a simple but useful data structure that is commonly used to implement dictionary queries. When we do an instant response to the user input Ajax search box, that is trie start. Essentially, Trie is a tree that stores multiple strings. The edges between adjacent nodes represent one character, so that each branch of the tree represents a substring, and the leaf node of the tree represents the complete string. Unlike normal trees, the same string prefix shares the same branch. The clearest example is the case. Given a set of words, Inn, int, at, age, ADV, ant, we can get the following trie:

Can be seen:

    • Each edge corresponds to one letter.
    • Each node corresponds to a prefix. The leaf node corresponds to the longest prefix, the word itself.
    • The Word Inn has a common prefix "in" with the word int, so they share a branch on the left, Root->i->in. Similarly, ate, age, ADV, and Ant share the prefix "a", so they share the edge from the root node to node "a".
    • The query is very simple. For example, to find an int, follow the path I-, in-and int.
    • The basic algorithm of building trie is also very simple, it is simply inserting each letter of each word into the trie. Before inserting, see if the prefix exists. If it exists, it is shared, otherwise the corresponding node and edge are created. For example, to insert the Word add, you have the following steps:
      1. Investigate the prefix "a" and find that side a already exists. Then follow edge A to Node A.
      2. In the remainder of the string "dd" the prefix "D", found from Node A, there is already a side D exists. So follow the side d to the node ad
      3. Examine the last character "D", which starts with no edge d from the node ad, then creates the node ad's child node add and marks the edge ad->add as D.

(2) Suffix tree

The so-called suffix tree is a compressed dictionary tree that contains all the suffixes of a string. Let's talk about the definition of suffix. Given a string of length n s=s1s2: Si.. Sn, and integer i,1 <= i <= N, substring sisi+1...sn are the suffixes of the string s. Take the string S=xmadamyx as an example, its length is 8, so s[1..8], s[2..8], ..., s[8..8] is the suffix of S, we generally also calculate the empty string suffix. In this way, we have the following suffix altogether. For suffix s[i: N], we say that this suffix starts with I.

    1. S[1..8], Xmadamyx, which is the string itself, starts at 1
    2. S[2..8], Madamyx, starting position is 2
    3. S[3..8], Adamyx, starting position is 3
    4. S[4..8], Damyx, starting position is 4
    5. S[5..8], AMYX, starting position is 5
    6. S[6..8], MYX, starting position is 6
    7. S[7..8], YX, starting position is 7
    8. S[8..8], X, starting position is 8
    9. Empty string. Recorded as $.

All these suffix strings form a tree of dictionaries:

Looking closely, we can see a lot of places worth compressing. For example, the branches of the blue box callout are only one, there is no need to use a separate node to represent the same side. If we allow any one of the edges to contain more than one letter, we can compress this non-forked path to an edge. In addition, each side already contains enough suffix information, so we don't have to label the nodes with string information. We only need to mark the starting position of each suffix on the leaf node. So we get:

Such a structure loses some suffixes. For example, the suffix x disappears in, because it happens to be the prefix of the string xmadamyx. To avoid this, we also stipulate that each suffix cannot be prefixed with another suffix. To solve this problem is actually quite simple, add an empty string after the substring to be processed. For example, before we deal with Xmadamyx, we first turn Xmadamyx into xmadamyx$, so we get suffix tree.

This will form a suffix tree. about how to build a suffix tree, there is a very mature algorithm, can be resolved in O (n) time.

(3) Generalized suffix tree

A traditional suffix tree can only handle all suffixes of a single word. The generalized suffix tree stores all suffixes of any number of words. For example, the string "Abab" and "Baba", first use them with a special end linked, such as "abab$baba#", and then the new character after the connection of the suffix tree, traverse the resulting suffix tree, such as encountered special characters, such as "$", "#" and so on to remove the subtree with the node as the heel, The last derived suffix tree is the generalized suffix tree of the original string group. The essence is to put all the suffixes of the two strings, namely: abab$,bab$,ab$,b$,baba#,aba#,ba#,a#, make up the dictionary tree, and then compress processing. A common application of generalized suffix trees is to judge the acquaintance of two strings.

[Reproduced] Dictionary tree (trie tree), suffix tree

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.