Dictionary tree (trie tree) and suffix tree

Source: Internet
Author: User

(1) dictionary tree (trie tree)

Trie is a simple but practical data structure and is usually used for dictionary query. Trie starts when we make an Ajax search box that responds to user input in real time. In essence, trie is a tree that stores multiple strings. The edge between adjacent nodes represents a character, so each branch of the tree represents a substring, and the leaf node of the tree represents a complete string. Different from a common tree, the same string prefix shares the same branch. The example is the clearest. Given a group of words, Inn, Int, AT, age, ADV, ant, we can get the following trie:

We can see that:

  • Each edge corresponds to a letter.
  • Each node corresponds to a prefix. The leaf node corresponds to the longest prefix, that is, the word itself.
  • The word inn and the word int have the same prefix "in", so they share a branch on the left, root-> I-> in. Similarly, ate, age, ADV, and ant share the prefix "A", so they share the edge from the root node to the "A" node.
  • Query is very simple. For example, you can find the int in the path I-> In-> Int.
  • The basic algorithm for building trie is also very simple. It is nothing more than inserting each letter of each word into trie one by one. Check whether the prefix exists before insertion. If yes, it will be shared. Otherwise, the corresponding node and edge will be created. For example, to insert the word "add", you can perform the following steps:
    1. Test the prefix "a" and find that edge a already exists. So we walked along side A to node.
    2. Test the prefix "D" of the remaining string "DD" and find that edge D already exists from node. So we walked along edge D to the node ad.
    3. Test the last character "D". Now there is no edge D starting from the node ad, so create the subnode add of the node ad and mark the edge ad-> Add as D.

(2) suffix tree

The suffix tree is a compressed dictionary tree that contains all the suffixes of a string. Let's talk about the definition of the suffix first. Given a string of n s = s1s2 .. si .. SN, and integer I, 1 <= I <= N, substring Sisi + 1... sn is the suffix of string S. Take the string S = xmadamyx as an example. Its length is 8, so s [1 .. 8], s [2 .. 8],..., s [8 .. 8] is regarded as the suffix of S. We generally also calculate the Null String as the suffix. In this way, we have the following suffixes. For the suffix s [I. N], we say this suffix starts with I.

  1. S [1 .. 8], xmadamyx, that is, the string itself, starting from 1
  2. S [2 .. 8], madamyx, starting from 2
  3. S [3 .. 8], adamyx, starting from 3
  4. S [4 .. 8], damyx, starting from 4
  5. S [], amyx, starting from 5
  6. S [6 .. 8], myx, starting position 6
  7. S [], YX, starting from 7
  8. S [8 .. 8], X, starting from 8
  9. Empty string. As $.

All these suffix strings form a dictionary tree:

After careful observation, we can see a lot of points worth compressing. For example, the branches marked in the blue box are all single seedlings and do not need to be represented by the same edge of a single node. If we allow any side to contain multiple letters, we can compress the non-forking path to one side. In addition, each edge already contains enough suffix information, so we do not need to mark the node with string information. We only need to mark the starting position of each suffix on the leaf node. So we get:

Some suffixes are lost in such a structure. For example, the suffix x disappears in because it is the prefix of the string xmadamyx. To avoid this situation, we also stipulate that each suffix cannot be the prefix of other suffixes. It is actually quite simple to solve this problem. Just add an empty string after the child string to be processed. For example, before processing xmadamyx, we first change xmadamyx to xmadamyx $, and then we get the suffix tree.

This forms a suffix tree. There are mature algorithms for creating a suffix tree that can be solved in O (n) time.

(3) generalized suffix tree

The traditional suffix tree can only process all suffixes of a word. The generalized suffix tree stores all suffixes of any number of words. For example, the strings "Abab" and "Baba" are first linked with a special Terminator, for example, "Abab $ Baba #", then, find the suffix tree of the new character after the connection and traverse the obtained suffix tree. In case of special characters, such as "$ ", "#" and so on, remove the child tree with the node as the heel, and the obtained suffix tree is the generalized suffix tree of the original string group. The essence is to make all the suffixes of the two strings, that is, Abab $, Bab $, AB $, B $, Baba #, ABA #, ba #, A #, to form a dictionary tree, then perform compression. A common application of the generalized suffix tree is to judge the acquaintance of two strings.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.