Detailed description and application of the Trie tree, detailed application of the Trie tree

Last Update:2015-01-16 Source: Internet

Author: User

Tags first string

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Detailed description and application of the Trie tree, detailed application of the Trie tree
The Trie tree, also known as the word search tree or key tree, is a tree structure and a variant of the hash tree. A typical application is to count and sort a large number of strings (but not limited to strings), so it is often used by the search engine system for text word frequency statistics. It has the following advantages: minimizes unnecessary string comparisons and improves query efficiency than hash tables.The core idea of Trie is to change the space for time. The public prefix of the string is used to reduce the overhead of the query time to improve the efficiency.

The power of Trie lies in its time complexity. Its insertion and query time complexity is O (k), where k is the length of the key, and it is irrelevant to how many elements are saved in Trie. The Hash table is called O (1), but it will certainly be O (k) When calculating hash, and there are also problems such as collision. The disadvantage of Trie is that space consumption is very high.

Trie Tree features:

1) The root node does not contain characters. Each node except the root node contains only one character.

2) from the root node to a node, the character passing through the path is connected to the string corresponding to the node.

3) The characters in each subnode are different.

4) if the number of characters is n, the outbound degree of each node is n, which is also a reflection of space for time, wasting a lot of space.

5) The insert search complexity is O (n), and n is the string length.

Basic Idea (take the letter tree as an example ):
1. Insert Process
For a word, start from the root and go down along the node branches in the tree corresponding to each letter of the word until the word is traversed and the last node is marked as red, indicates that the word has been inserted into the Trie tree.
2. Query Process
Similarly, the trie tree is traversed alphabetically by words starting from the root. Once a node is identified as nonexistent or the last node is not marked as red after word traversal, the word does not exist, if the last node is marked in red, the word exists.

Data Structure of the dictionary tree:
Build a dictionary tree using strings. This dictionary tree stores the common prefix information of strings, which can reduce the complexity of query operations.
The following uses a dictionary tree built with English words as an example. each node in the Trie tree contains 26 child nodes, because there are a total of 26 English letters (assuming that the word is composed of lowercase letters ).
You can declare the struct containing the node information of the Trie tree:
[Cpp]View plaincopy

Typedef struct Trie_node
{
Int count; // count the number of times a word prefix appears.
Struct Trie_node * next [26]; // pointer to each subtree
Bool exist; // indicates whether a word is formed at the node.
} TrieNode, * Trie;

Next is a pointer array that stores pointers to each child node.
For example, if the string "abc", "AB", "bd", "dda" is given, a Trie tree is constructed based on the string sequence. The constructed tree is as follows:

The root node of the Trie tree does not contain any information. The first string is "abc" and the first letter is 'A ', therefore, the next subscript of the array in the root node is 'A'-97 and the value is not NULL. Similarly, in the Trie tree constructed, the red node indicates that a word can be formed here. Obviously, to check whether the word "abc" exists, the search length is O (len), and len is the length of the string to be searched. If you use a general one-by-one query, the query length is O (len * n), and n is the number of strings. Obviously, the search efficiency based on the Trie tree is much higher.
For example, the word abc, AB, bd, and dda exist in the Trie tree. In actual problems, you can change the flag of the marked color to other variables that meet the requirements of the subject, such as count.

It is known that there are n words with an average length of 10 consisting of lower-case letters to determine whether a string is a prefix substring of another string.

The following three methods are compared:

1. The easiest way to think of it is to search from the beginning to the end of the string set to check whether each string is the prefix of a string in the string set. The complexity is O (n ^ 2 ).
2. Use hash: We use hash to store all prefix substrings of all strings. The complexity of hash creation for sub-strings is O (n * len ). The query complexity is O (n) * O (1) = O (n ).

3. Trie: When querying whether the string abc is the prefix of a string, it is clear that B, c, d .... you do not need to search for strings starting with a, so that you can quickly narrow down the search scope and improve the search targeting. Therefore, the complexity of Trie creation is O (n * len), while the creation + query can be executed simultaneously in trie, And the creation process can become the Query Process, hash cannot implement this function. So the total complexity is O (n * len), and the actual query complexity is only O (len ).

Trie Tree operations
There are three main operations in the Trie tree: insert, search, and delete. Generally, a single node is rarely deleted in the Trie tree. Therefore, you only need to delete the entire tree.
1. Insert
Assume that the str string exists and the root node of the Trie tree is root. I = 0, p = root.
1) Take str [I] and judge whether p-> next [str [I]-97] is null. If it is null, create the node temp, and point p-> next [str [I]-97] to temp, and then p to temp;
If not empty, p = p-> next [str [I]-97];
2) I ++, continue to take the operation in str [I], loop 1) until the terminator '\ 0' is encountered, then the exist in the current node p is set to true.
2. Search
Assume that the string to be searched is str, the root node of the Trie tree is root, I = 0, p = root
1) Take str [I] and judge whether p-> next [str [I]-97] is null. If it is null, false is returned. If it is not null, then p = p-> next [str [I]-97] continues to take the character.
2) Repeat the operation in 1) until the terminator '\ 0' is encountered. If the current node p is not empty and the exist is true, true is returned; otherwise, false is returned.
3. Delete

Deletion can be performed recursively.

# Include <iostream> # include <cstring> using namespace std; typedef struct Trie_node {int count; // count the number of times a word prefix occurs, struct Trie_node * next [26]; // pointer to each subtree bool exist; // indicates whether a word is formed at the node} TrieNode, * Trie; TrieNode * createTrieNode () {TrieNode * node = (TrieNode *) malloc (sizeof (TrieNode); node-> count = 0; node-> exist = false; memset (node-> next, 0, sizeof (node-> next )); // The Initialization is a null pointer return node;} void Trie _ Insert (Trie root, char * word) {Trie node = root; char * p = word; int id; while (* p) {id = * p-'A '; if (node-> next [id] = NULL) {node-> next [id] = createTrieNode ();} node = node-> next [id]; // each insert step is equivalent to a new string passing through, and the pointer moves down ++ p; node-> count + = 1; // This line of code is used to count the number of times each word prefix appears (including the number of times each word appears)} node-> exist = true; // mark the end of a word here to form a word} int Trie_search (Trie root, char * word) {Trie node = root; char * p = wo Rd; int id; while (* p) {id = * p-'A'; node = node-> next [id]; ++ p; if (node = NULL) return 0;} return node-> count;} int main (void) {Trie root = createTrieNode (); // initialize the root node char str [12] of the dictionary tree; bool flag = false; while (gets (str) {if (flag) printf ("% d \ n ", trie_search (root, str); else {if (strlen (str )! = 0) {Trie_insert (root, str);} else flag = true ;}} return 0 ;}

Trie tree application:

1. string SEARCH, Word Frequency Statistics, and popular search engines

Store the information about some known strings (dictionaries) in the trie tree in advance to find out whether or how often other unknown strings have occurred.

Example:

1) there is a 1 GB file with each row containing a word. The word size cannot exceed 16 bytes and the memory size is limited to 1 MB. Returns the top 100 words with the highest frequency.

2) A Word Table consisting of N words and an article written in lowercase English are provided. Please write all the words not in the word list in the earliest order.

3) give a dictionary where the word is a bad word. All words are lowercase letters. A text section is provided. Each line of the text is composed of lowercase letters. Determines whether the text contains any bad words. For example, if rob is a bad word, the text problem contains bad words.

4) 10 million strings, some of which are repeated. You need to remove all duplicates and keep strings that are not repeated.

5) Search for popular queries: The Search Engine records all the search strings used for each search using log files. The length of each query string is 1-bytes. Suppose there are currently 10 million records, and these query strings have a relatively high number of repeated reads. Although the total number is 10 million, the number of duplicate reads cannot exceed 3 million if the number of duplicate reads is removed. The higher the repetition of a query string, the more users query it, and the more popular it is. Please count the top 10 query strings. The memory required cannot exceed 1 GB.

2. The longest common prefix of the string

The Trie tree uses the public prefix of multiple strings to save storage space. On the contrary, when we store a large number of strings in a trie tree, we can quickly obtain the public prefix of some strings. Example:

1) give N lower-case English strings and Q queries, that is, ask the length of the longest common prefix of two strings. solution:

First, create the corresponding letter tree for all strings. At this time, we found that the length of the longest Common prefix of two strings is the number of Common Ancestor of the nodes where they are located. Therefore, the problem is converted to the Least Common Ancestor of Offline nodes, (LCA.

Recently, the public ancestor issue is also a classic issue. You can use the following methods:

1. Use the Disjoint Set to use the classic Tarjan algorithm;

2. After finding the Euler Sequence of the letter tree, you can convert it into a typical Range Minimum Query (RMQ) problem;

3. Sorting

The Trie tree is a multi-Cross Tree. As long as you traverse the entire tree in sequence, the corresponding string is output in the lexicographically ordered result.

Example: give you N English names that are composed of only one word, which are different from each other, so that you can sort them in lexicographically ascending order.

4 as an auxiliary structure of other data structures and algorithms

Such as suffix tree and AC automatic machine.

3. Advanced implementation of the Trie tree

You can use Double-Array. Using Double Arrays can greatly reduce memory usage

(5) An Implementation of Double-Array Trie:

Http://linux.thai.net /~ Thep/datrie/datrie.html

(6) An Efficient Implementation of Trie Structures:

Http://www.google.com.hk/url? Sa = t & source = web & cd = 4 & ved = 0 cdemo-jad & url = http % 3A % 2F % 2Fciteseerx.ist.psu.edu % 2 Fviewdoc % 2 Fdownload % 3 Fdoi % 3D10. 1.1.14.8665% 26rep % 3Drep1% 26 type % 3 Dpdf & ei = qaehTZiyJ4u3cYuR_O4B & usg = AFQjCNF5icQbRO8_WKRd5lMh-eWFIty_fQ & sig2 = xfqSGYHBKqOLXjdONIQNVw

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More