In natural language processing (NLP) Research, NGram is the most basic but also the most useful method of comparison. Here N is the length of the string to be compared, the TrieTree I introduced today is a data structure closely related to NGram, which is called a dictionary tree. TrieTree is simply a multi-Cross Tree. Each node stores a character. The advantage of this is that when we want to compare NGram, you only need to traverse the tree along a tree fork from the root node of the tree to complete the comparison. If not found, stop this traversal. This is a little abstract. Let's look at a practical example.
Suppose we have the following words in the dictionary:
Shanghai
Shanghai
Shanghaiese
Shanghai Company
Beijing
Dipper
Willow
Yangpu District
: The words hanging on the root node are above, north, and Yang,
If we use 3 Gram for the word "Yangpu district of Shanghai", there will be Shanghai, Yang, Yangpu, and Yangpu district of Shanghai. Now we need to know which words can be recognized by this dictionary, generally, NGram can be used for word segmentation. With this tree, we only need to take each character in sequence and start from the root for comparison. For example, in Shanghai, we can match the path above-> sea-> City, so it matches; for example, Yang City, because there is no "sea" character hanging on the root node, so stop; Yangpu city cannot match; eventually match Yangpu district, get the path of Yang> Pu> district, match.
In the end, we can divide the "Yangpu District" into Shanghai and Yangpu district.
Although TrieTree saves a lot of time than a normal string array, It is not costly, because you need to build the tree According to the dictionary first, this price is not low, of course, once TrieTree is built, it can be reused for an application. Therefore, the performance improvement is objective for large-scale comparison.
The following is the C # Implementation of TrieTree.
Copy codeThe Code is as follows: public class TrieTree
{
TrieNode _ root = null;
Private TrieTree ()
{
_ Root = new TrieNode (char. MaxValue, 0 );
CharCount = 0;
}
Static TrieTree _ instance = null;
Public static TrieTree GetInstance ()
{
If (_ instance = null)
{
_ Instance = new TrieTree ();
}
Return _ instance;
}
Public TrieNode Root
{
Get {return _ root;
}
}
Public void AddWord (char ch)
{
TrieNode newnode = _ root. AddChild (ch );
Newnode. IncreaseFrequency ();
Newnode. WordEnded = true;
} Int charCount;
Public void AddWord (string word)
{
If (word. Length = 1)
{
AddWord (word [0]);
CharCount ++;
}
Else
{
Char [] chars = word. ToCharArray ();
TrieNode node = _ root;
CharCount + = chars. Length;
For (int I = 0; I <chars. Length; I ++)
{
TrieNode newnode = node. AddChild (chars [I]);
Newnode. IncreaseFrequency ();
Node = newnode;
}
Node. WordEnded = true;
}
}
Public int GetFrequency (char ch)
{
TrieNode matchedNode = _ root. Children. FirstOrDefault (n => n. Character = ch );
If (matchedNode = null)
{
Return 0;
}
Return matchedNode. Frequency;
}
Public int GetFrequency (string word)
{
If (word. Length = 1)
{
Return GetFrequency (word [0]);
}
Else
{
Char [] chars = word. ToCharArray ();
TrieNode node = _ root;
For (int I = 0; I <chars. Length; I ++)
{
If (node. Children = null)
Return 0;
TrieNode matchednode = node. Children. FirstOrDefault (n => n. Character = chars [I]);
If (matchednode = null)
{
Return 0;
}
Node = matchednode;
}
If (node. WordEnded = true)
Return node. Frequency;
Else
Return-1;
}
}
}
Here we use the singleton mode, because TrieTree is similar to the cache and does not need to be created again. The following is the implementation of TreeNode:
Copy codeThe Code is as follows: public class TrieNode
{
Public TrieNode (char ch, int depth)
{
This. Character = ch;
This. _ depth = depth;
}
Public char Character;
Int _ depth;
Public int Depth
{
Get {return _ depth;
}
}
TrieNode _ parent = null;
Public TrieNode Parent
{
Get {
Return _ parent;
}
Set {_ parent = value;
}
}
Public bool WordEnded = false;
HashSet <TrieNode> _ children = null;
Public HashSet <TrieNode> Children
{
Get {
Return _ children;
}
}
Public TrieNode GetChildNode (char ch)
{
If (_ children! = Null)
Return _ children. FirstOrDefault (n => n. Character = ch );
Else
Return null;
}
Public TrieNode AddChild (char ch)
{
TrieNode matchedNode = null;
If (_ children! = Null)
{
MatchedNode = _ children. FirstOrDefault (n => n. Character = ch );
}
If (matchedNode! = Null)
// Found the char in the list
{
// MatchedNode. IncreaseFrequency ();
Return matchedNode;
}
Else
{
// Not found
TrieNode node = new TrieNode (ch, this. Depth + 1 );
Node. Parent = this;
// Node. IncreaseFrequency ();
If (_ children = null)
_ Children = new HashSet <TrieNode> ();
_ Children. Add (node );
Return node;
}
}
Int _ frequency = 0;
Public int Frequency
{
Get {return _ frequency;
}
}
Public void IncreaseFrequency ()
{
_ Frequency ++;
}
Public string GetWord ()
{
TrieNode tmp = this;
String result = string. Empty;
While (tmp. Parent! = Null) // until root node
{
Result = tmp. Character + result;
Tmp = tmp. Parent;
}
Return result;
}
Public override string ToString ()
{
Return Convert. ToString (this. Character );
}
}