Introduction to Dictionary tree and simple Application _ Dictionary tree

Source: Internet
Author: User

(2010-03-13)


1. Background

Vocabulary search, Word frequency statistics, such as string operations, search engines, Text processing systems and other frequently used business, now assume that there is such a simple text processing example: There is a 10,000-word article, to find out the word "was" in this article the number of times. In general, readers who have not studied data structure courses may adopt the simplest but least efficient method of exhaustive traversal: Read the entire story's words into a large array of strings, and then match "was" one by one. For readers who have studied data structure courses, it is not difficult to think of the 2 classic structures that are easy to find in data structure--balanced lookup binary tree and hash table: For a balanced lookup binary tree (or its enhanced version-the red-black tree), the word as the keyword, the number of times it appears as a key value, Deposit two fork trees according to the rule of balance; For a hash table, a string hash algorithm is used to store words that hash out the same value (the same hash address) in the same hash table item (or hash bucket) in order to find it. Then, a brief analysis of the search efficiency of the three (assuming the number of words N, the average length of words is D):
(1) Exhaustive method, obviously, to traverse n words, each word consumes the string comparison time is O (d), so the total time complexity of O (DN);
(2) The balance looks for the binary tree, looks for a word the time is O (logn), then the total lookup time complexity is O (DLOGN);
(3) Hash table, calculates the hash value of a word O (d), assuming that all the words hash out the number of hash bucket is m, the word is evenly distributed in this m bucket, then the total time spent is O (d+n/m*d) =o (n/m*d).
Now, introduce a new data structure-dictionary tree, it is specially used for text Chacun processing, and in the text storage space and search efficiency are superior to the balanced binary tree and hash table.

2. Concept

Suppose there is an English dictionary, we are looking for a word, in the index directory, you look for a collection of words in the first letter of the word, and then look for a subset of the second letter in the word set in the first letter, and so on, until you find the whole word, and the dictionary tree forms and looks like the same entity dictionary.

The dictionary tree (tire), also known as the word lookup tree, is a tree structure for storing a large number of strings, which has the advantage of saving the storage space by using the common prefix of the string, and thus effectively improving the search efficiency. The basic properties are:
(1) The root node does not contain characters, and each node outside the Roots node contains only one character;
(2) from the root node to a certain node, the path through the characters connected, for the node corresponding to the string;
(3) All child nodes of each node contain different characters.

3, the construction of the dictionary tree

From the concept of the second section to understand, it is not difficult to think of the dictionary tree construction method. For example, if you have b,abc,abd,bcd,abcd,efg,hii these 6 words, the dictionary tree that you build is the following diagram:

From the figure above, for each node, the path from the root node to it is a word, and if the node is marked red, the word exists. The time unit defined in the first section shall prevail, looking at efficiency, assuming that after reading the article into the dictionary tree, marking the number of times it appears on the last letter node of each word, the total number of subsequent occurrences of a word is only O (d). In terms of storage space, the three structures in the first section need to store all the words, such as AB, ABC, Abde three words need to store all the letters of all words a total of 9 letters, while the dictionary tree can use the same prefix words to share the characteristics of the prefix space, only need to store 5 letters. Therefore, both in terms of time efficiency and space capacity, the dictionary tree is superior to the general data structure for the processing of a large number of strings.

The basic operations of the

        dictionary tree are found, inserted, and deleted, although the deletion is relatively rare. This article only realizes the deletion of the whole tree, and the deletion of single word is very simple. The steps to search for keywords in the dictionary tree are: (1) Start a search from the root node; (2) to find the first letter of the keyword, and according to the letter select the corresponding subtree and go to the subtree to continue to retrieve; (3) on the corresponding subtree, get the second letter to find the keyword, and further select the corresponding subtree to retrieve. (4) Iterative process ... (5) At a certain node, all the letters of the keyword have been removed, then read the information attached to the node, that is, to complete the search, other operations similar processing. A simple dictionary tree structure operation brief code is as follows:

Basic implementation of/*************************************************** Name:trie tree Basic Implementation of Description:trie tree, including find, insert and delete operations ******** /#include <algorithm> #include <iostream> using namespace std

;
const int sonnum=26,base= ' a '; struct Trie {int num;//record how many words to the node, that is, how many words have a prefix with the end of the node bool terminal;//if terminal==true, the node has no subsequent nodes int count;//records
The number of occurrences of the word, this node is the end letter of a complete word struct Trie *son[sonnum];//subsequent nodes};
    /********************************* Create a new node *********************************/Trie *newtrie () {Trie *temp=new;
    temp->num=1;
    temp->terminal=false;
    temp->count=0;
    for (int i=0;i<sonnum;++i) temp->son[i]=null;
return temp; /********************************* inserts a new word into the dictionary tree pnt: root s: New word len: new word length *********************************/void insert (
    Trie *pnt,char *s,int len) {Trie *temp=pnt;
       for (int i=0;i<len;++i) {if (temp->son[s[i]-base]==null) Temp->son[s[i]-base]=newtrie (); else {temp->son[s[i]-base]->num++;temp->terminal=false;}
    temp=temp->son[s[i]-base];
    } temp->terminal=true;
temp->count++; /********************************* Delete entire tree pnt: root *********************************/void Delete (Trie *pnt) {if pnt!=
        NULL) {for (int i=0;i<sonnum;++i) if (pnt->son[i]!=null) Delete (pnt->son[i)); 
        Delete PNT;
    Pnt=null; /********************************* find words at the end of the dictionary tree pnt: root s: Word len: Word length *********************************/trie* Fin
    D (Trie *pnt,char *s,int len) {Trie *temp=pnt;
        for (int i=0;i<len;++i) if (temp->son[s[i]-base]!=null) temp=temp->son[s[i]-base];
    else return NULL;
return temp; }

In addition to the subsequent node group son, the members in the Dictionary tree node structure tire are information stored by the nodes, and the designers can modify them according to the requirements.

4. Application

As can be seen from the above sections, the application of the dictionary tree is mainly used in the statistical analysis of the massive string data, the following two simple dictionary tree application examples are listed below:
(1) A simple search engine, will be the user input every day to save the search words, and regular statistics, detection of the highest frequency of several search words. This is actually a little bit of an example of the first section, in the search engine background to create a dictionary tree, when the user input search words, if the Word does not insert the word in the dictionary tree, the word at the end of the number of points to store the Mark plus 1, then, in the periodic statistics, To calculate the number of memory tags for each node to retrieve the most frequently used words.
(2) Given n words, these n words can be sorted in a dictionary order with a small degree of time complexity. If you read these N words into an ordinary string array, the time it takes to sort with either a normal sort or a quick sort is O (n^2) or O (Nlogn), and if you read these N words into the dictionary tree (the node's child node character values are in the dictionary order from left to right), Then the dictionary tree can be used to output the sequence of Word list, the time complexity of O (Nd).

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.