Dictionary Tree Trie

Source: Internet
Author: User

First, the dictionary tree

The--trie tree, also known as the prefix tree (Prefix trees), a word lookup tree, or a key tree, is a multi-fork tree structure.

is a Trie tree that represents the keyword set {"A", "to", "Tea", "Ted", "Ten", "I", "in", "Inn"}. The basic properties of the trie tree can be summed up:

    • The root node does not contain characters, and each child node outside of the root node contains a single character.
    • From the root node to a node, the characters that pass through the path are concatenated to the corresponding string for that node.
    • All child nodes of each node contain characters that differ from each other.

Typically, when implemented, a flag is set in the node structure to mark whether a word (keyword) is formed at that point.

As you can see, the trie tree's keywords are usually strings, and the trie tree keeps each keyword on a path instead of a node. In addition, there are two common prefix keywords, the same path in the prefix portion of the trie tree species. so Trie is also called the prefix tree .

second, the advantages and disadvantages of the dictionary tree Advantages
    1. Inserts and queries are highly efficient, all O (M), where m is the length of the string to insert/query.

      • About the query, someone would say that the hash table time complexity is O (1) is not faster? But the efficiency of hash search depends on the good or bad of hash function, if a bad hash function causes many conflicts, the efficiency is not higher than the trie tree.
    2. The different keywords in the trie tree do not create conflicts.

    3. A hash collision occurs only if a keyword is allowed to be associated with multiple values in the trie tree.

    4. Trie tree does not need hash value, has faster speed to short string. In general, hash values are also required to traverse strings.

    5. The trie tree can be sorted by dictionary order for keywords.

      • Dictionary ordering (lexicographical order) is a sort of sequence that is used to form random variables. The method is, in alphabetical order, or the number of small large order, from small to large formation sequence.
    6. Each trie tree can be seen as a simple version of a finite state automaton (dfa,deterministic finite automation), that is, for an arbitrarily given state (①) belonging to the Automaton and a character (②) belonging to the Automata alphabet, Can go to the next state based on the given transfer function (③). which

      • ① determines the state of an automaton for each node of the trie tree.
      • ② given a character that belongs to the alphabet of the automaton, it is possible to see branches formed by different characters in the graph;
      • The process of ③ from the current node into the next level node is derived from the state transfer function.

      The core idea is: space change time, use the common prefix of the string to reduce the meaningless string comparison to achieve the purpose of improving the query efficiency.

Disadvantages
    1. When the hash function is good, the lookup efficiency of the trie tree is lower than the hash search.

    2. Large space consumption.

three, the application of Trie tree
  1. String retrieval

    Search, query function is the most primitive function of trie tree, the idea is to start a character from the root node to compare.

    • If a different character is found along the road, it means that the string does not exist in the collection.
    • If all characters are all compared and identical, it is also necessary to determine the last node identity bit (which marks the node as a keyword).
  2. Word Frequency statistics
    Trie trees are often used by search engines for text frequency statistics.
    Idea: In order to achieve the word frequency statistics, we modified the node structure, with an integer variable count to count. Perform an insert operation on each keyword, if present, Count plus 1, if not present, insert post count 1.
    (1.2. Can be done with hash table)

  3. String sort
    The trie tree can sort a large number of strings in a dictionary order, and the idea is simple: iterate through all the keywords, insert them all into the trie tree, and all the sons of each node of the tree are clearly sorted alphabetically, and then sequentially traverse through all the keywords in the output trie tree.

  4. Prefix matching
    For example: Find all strings starting with AB in a collection of strings. We only need to construct a trie tree with all the strings, and then output the keywords on the path beginning with a->b->. Trie tree prefix matching is commonly used for search hints. If you enter a URL, you can automatically search for possible choices. When there is no exact match for the search results, you can return the most similar possible prefixes.

  5. As an auxiliary structure
    such as suffix tree, AC automatic machine
    Have poor automata reference: http://blog.csdn.net/yukuninfoaxiom/article/details/6057736

  6. compared to hash table
    Advantages:

    • Trie data lookup and imperfect hash table (linked list implementation) in the worst case faster; for trie tree, the worst is O (M), M is the length of the lookup string For an imperfect hash table, there will be a key-value conflict (different key hashes are the same), the worst is O (n), and N is the number of all characters produced. Typically O (m) is used for hash calculations, and O (1) is used for data lookups.

    • Trie different keys do not conflict

    • Trie buckets are similar to hash tables used to store key conflicts, only need

    • When more keys are added to Trie when a single key is associated with multiple values No need to provide a hash method or change the hash method

    • Trie to provide alphabetical order by key
      Disadvantage:

    • Trie data Lookup In some cases (disk or random access time is much higher than main memory) slower than hash table

    • When the key value is certain types (such as floating-point), the prefix chain is long and the prefix is not particularly meaningful.

    • Some trie will consume more memory than the hash table. For Trie, each character of each string is allocated memory, and for most hashes only one piece of memory is allocated for the entire entry.

  7. Compared to a two-fork search tree

    Binary search tree, also known as binary sorting tree, it satisfies:

    • Any node if the left dial hand tree is not empty, the values of all the nodes of the left subtree are smaller than the value of the root node;

    • Any node if the right subtree is not empty, the value of all nodes in the right subtree is greater than the value of the root node;

    • The left and right sub-trees are also two-fork search tree;

    • The values for all nodes are not the same.

    In fact, the advantage of binary search tree is in the time complexity of finding and inserting, usually only O (log n), a lot of the collection is implemented by it. At the time of inserting, it is essentially adding a new leaf node to the tree, avoiding the node movement, the complexity of searching, inserting and deleting equals to the height of the tree, which is O (log n), the worst case of all the nodes of the whole tree has only one child node, which becomes a linear table, and the complexity is O (n).

    In the worst case, the trie tree looks faster than a two-fork search tree, which is only O (m) if the length of the search string is expressed in M, usually (the number of nodes in the tree is much larger than the length of the search string), which is far less than O (n).

Iv. Realization of
#include <iostream>#include <string>using namespace STD;#define Alphabet_sizetypedef structtrie_node{intCount//Record the number of words that the node representsTrie_node *children[alphabet_size];//each child node}*trie;trie_node* Create_trie_node () {trie_node* Pnode =NewTrie_node (); Pnode->count =0; for(intI=0; i<alphabet_size; ++i) Pnode->children[i] = NULL;returnPnode;}voidTrie_insert (trie root,Char* key) {trie_node* node = root;Char* p = key; while(*p) {if(node->children[*p-' A '] = = NULL) {node->children[*p-' A '] = Create_trie_node (); } node = node->children[*p-' A '];    ++p; } Node->count + =1;}/** * Query: There is no return 0, there is a return number of occurrences */intTrie_search (trie root,Char* key) {trie_node* node = root;Char* p = key; while(*p && Node!=null) {node = node->children[*p-' A '];    ++p; }if(node = = NULL)return 0;Else        returnNode->count;}intMain () {//Keyword collection    Charkeys[][8] = {"the","a","There","Answer","any","by","Bye","their"}; Trie root = Create_trie_node ();//Create Trie tree     for(inti =0; I <8; i++) Trie_insert (root, keys[i]);//Retrieving Strings    Chars[][ +] = {"Present in Trie","not present in Trie"};printf("%s---%s\n","the", Trie_search (Root,"the") >0? s[0]:s[1]);printf("%s---%s\n","These", Trie_search (Root,"These") >0? s[0]:s[1]);printf("%s---%s\n","their", Trie_search (Root,"their") >0? s[0]:s[1]);printf("%s---%s\n","Thaw", Trie_search (Root,"Thaw") >0? s[0]:s[1]);return 0;}

For trie trees, we generally only implement insert and search operations. This code can be used to retrieve words and to count the word frequency.

Five, trie tree improvement
    1. bitwise tree (btiwise Trie): The principle and ordinary Trie tree, but the ordinary Trie tree storage of the smallest unit is the character, but bitwise Trie is stored in a bit. The access of bit data is implemented directly by CPU instruction, and it is theoretically faster than normal trie tree for binary data.

    2. Node compression

      • ① Branch compression: For a stable trie tree, it is basically a search and read operation, which can compress some branches completely. For example, the right-most branch Inn can be compressed directly into a node "inn" without the need to exist as a regular subtree. Radix tree is based on this principle to solve the problem of trie tree too deep.

      • ② Node Mapping table: This is also the Trie tree node may be almost completely determined to use, for each state of the Trie tree node, if the total number of States repeated a lot, through an element is a multidimensional array of numbers (such as triple array Trie) to represent, The space overhead of storing the trie tree itself is smaller, although additional mapping tables are introduced.

    3. Even group Trie tree (Double array Trie)

      In the premise of guaranteeing the speed of trie tree retrieval, a data structure is proposed to improve the space utilization, which is essentially a deterministic finite automaton. (The so-called DFA is a state-specific automaton, for a given state of the automaton and a data of this automaton character, it can be transferred to the next state according to the pre-given State transfer function.) )

      For DAT, each node represents a state of the automaton and, depending on the variable, makes a state transition, completing the query when it reaches the end state or cannot be transferred.

      Reference: http://blog.csdn.net/zzran/article/details/8462002

Vi. Other forms of trie trees

The relationship between these algorithm data structures is mainly explained. The yellow part of the figure mainly describes some key points of these algorithms and data structures.

These relationships can be seen in the figure: EXTEND-KMP is an extension of the KMP; AC automata is a multi-string form of KMP; it is a finite automaton, whereas the trie diagram is actually a deterministic finite automata; AC automata, trie graphs, suffix trees are actually a kind of trie suffix arrays and suffix trees are data structures related to the suffix collection of strings; the suffix pointer in the trie diagram and the suffix link in the suffix tree are both concepts and their consistency.

Performance Comparison of trie trees

Reference Blog http://www.hankcs.com/nlp/performance-comparison-of-several-trie-tree.html

References
    1. Trie Tree http://www.raychase.net/1783?replytocom=264917

    2. Trie Tree http://blog.csdn.net/v_july_v/article/details/6897097

    3. BitWise Trie http://blog.csdn.net/breeze_gao/article/details/8461856

    4. AC Automatic Machine http://www.cppblog.com/menjitianya/archive/2014/07/10/207604.html

Dictionary Tree Trie

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.