Double-array trie (double array dictionary tree)

Source: Internet
Author: User

 

Http://blog.huang-wei.com/2010/07/15/double-array-trie%ef%bc%88%e5%8f%8c%e6%95%b0%e7%bb%84%e5%ad%97%e5%85%b8%e6%a0%91%ef%bc%89/

 

 

Trie has been widely used in ACM and is also a very effective index structure, so the benefits are not mentioned.

In essence, it is a fixed finite state automation (DFA). There are also several implementations of it. The most easy to implement in ACM is the multi-path search tree.

However, the biggest disadvantage of trie is that it occupies too much space and is easy to crack the memory. Of course, the trie tree is also optimized in ACM, such as specifying the height, use a non-random access structure (reduce the width) for nodes with fewer branches, but these are all at the cost of partial search efficiency.

Here we introduce an implementation, double-array trie (double array dictionary tree), which is actually a dual array and has nothing to do with the tree structure. It can reduce memory waste to some extent.

Two arrays. One isBase[], The other isCheck[]. Set the subscript of the array to I. IfBase[I],Check[I] is 0, indicating that the location is empty. IfBase[I] is a negative value, indicating that the State is terminated (that is, word ).Check[I] indicates the previous status of this status.

Definition 1.For input charactersC, Slave statusSTransfer to statusTThe double array dictionary tree meets the following conditions:

check[base[s] + c] = s
base[s] + c = t 

 

 

SlaveDefinition 1, We can get the search algorithm for the given statusSAnd input charactersC:

t := base[s] + c;
if check[t] = s then
    next state := t
else
    fail
endif

 

Insert operation. Assume thatBaseThe value is I, and the character sequence codes of the second character are C1, C2, C3... CN, must satisfyBase[I + C1],Check[I + C1],Base[I + C2],Check[I + C2],Base[I + C3],Check[I + C3]…Base[I + CN],Check[I + CN] is 0.

We know that the implementation of the dual array is to allocate space to the new State only when the State has been transferred, or it can be expressed as only allocating space for the State to be transferred. When the above conditions cannot be met, make adjustments so thatBaseThe value meets the preceding conditions. This adjustment only affects the redistribution of nodes at the next layer of the current node, because the Address Allocation of all nodes depends onBaseThe starting subscript specified by the array.

First, we get the redistribution algorithm:

Procedure Relocate(s : state; b : base_index)
{ Move base for state s to a new place beginning at b }
begin
    foreach input character c for the state s
    { i.e. foreach c such that check[base[s] + c]] = s }
    begin
        check[b + c] := s;     { mark owner }
        base[b + c] := base[base[s] + c];     { copy data }
        { the node base[s] + c is to be moved to b + c;
          Hence, for any i for which check[i] = base[s] + c, update check[i] to b + c }
        foreach input character d for the node base[s] + c
        begin
            check[base[base[s] + c] + d] := b + c
        end;
        check[base[s] + c] := none     { free the cell }
    end;
    base[s] := b
end

 

For example, if there are two words AC and DA, insert the word AC first.BaseArray

When inserting the DA's D, we found that the address has been occupied by C.

It is re-allocated from the visible locations A and D,BaseThe value is changed from 0 to 1.

Suppose there areNNodes. the character set size isM, The size of datrie space isN + cm,CIt is a coefficient dependent on the degree of sparsity of trie. The size of the multi-path search tree isNm.
Note that the complexity here is calculated by offline algorithms (offline algorithm), that is, the entire dictionary is obtained during processing. The space complexity of online algorithms is also related to the order in which words appear. The more ordered the word, the less space occupied.
The complexity of the search algorithm is related to the length of the string to be searched. This complexity is the same as that of multiple search trees.
In the insert algorithm, if redistribution occurs, we need to attach the time complexity of scanning subnodes, and there is a newBaseThe algorithm complexity determined by the value. If we use a brute force algorithm (for loop scanning), the time complexity of the Insertion Algorithm is O (NM + cm2)..

In the actual coding process, the datrie code is more difficult than multi-path query trees, mainly because the State representation is not as clear as the tree structure, and the subscript can be easily mixed up.
Note that,BaseA positive value indicates the starting offset. A negative value indicates that the status is terminated.BaseMake sure that the value is positive.
For example, when inserting d in the null trie State, because the first empty address is 1Base= 1-4 =-3.BaseThe positive and negative meanings are damaged.

Optimization:

  1. Search for empty addresses
  2. Base[I],Check[I] is 0, indicating that the location is empty. We can make use of this part. There is very little information contained in all the 0 tags. We useBaseAndCheckArrays form a two-way linked list.

    Definition 2.Set R1, R2 ,... , RCM is an ordered sequence of idle addresses, so our two-way linked list can be defined:

    check[0] = -r[1]
    check[r[i]] = -r[i]+1 1 <= i <= cm-1
    check[r[cm]] = 0
    base[0] = -r[cm]
    base[r[1]] = 0
    base[r[i+1]] = -r[i] ; 1 <= i <= cm-1

    Because weBase[0] as the root node, so you can setBase[0] excluded from the linked list, whileCheck[0] can be used as the head node of the linked list.

    Set the node status transition set to P = {C1, C2 ,..., CP}, relying on the linked list, we can get a new empty address search algorithm:

    {find least free cell s such that s > c[1]}
    s := -check[0];
    while s <> 0 and s <= c[1do
        s := -check[s]
    end;
    if s = 0 then return FAIL;  {or reserve some additional space}
     
    {continue searching for the row, given that s matches c[1]}
    while s <> 0 do
        i := 2;
        while i <= p and check[s + c[i] - c[1]] < 0 do
            i := i + 1
        end;
        if i = p + 1 then return s - c[1];  {all cells required are free, so return it}
        s := -check[-s]
    end;
    return FAIL;  {or reserve some additional space}

    The time complexity of the optimized empty address search algorithm is O (Cm2), While the time complexity of the re-allocation algorithm is O (M2), The total time complexity is O (Cm2+M2) = O (Cm2).

    After the node is reassigned or deleted, the original address is voided and can be added to the linked list again. In this way, if a subset of the original state transition set is encountered, it can be used.
    In fact, this part of optimization is made of idle information fields into a linked list, so this part of the insertion and deletion optimization principle is easy to understand, the time complexity is O (Cm).

    t := -check[0];
    while check[t] <> 0 and t < s do
        t := -check[t]
    end;
    {t now points to the cell after s' place}
    check[s] := -t;
    check[-base[t]] := -s;
    base[s] := base[t];
    base[t] := -s;
  3. Array Length Compression
  4. When a node is deleted, we can not only add it back to the linked list, but also re-confirm the status of the largest non-empty node.BaseValue, because deletion may lead to space in front of it to accommodate its status transition set. In this way, we may be able to delete some null states, so that the array length may be compressed.

  5. Character suffix Compression
  6. This idea is based on the suffix tree. We can store suffixes without branches separately, but this structure must be independent of datrie, so we will not detail it here. For details, see [aoe1989].

On the whole, in ACM, datrie has a slightly higher encoding complexity and a low insertion efficiency, which is not widely used. But in reality, the dictionary size is generally relatively stable, and there is a lot of room for optimization in offline algorithms. In this case, datrie's space advantage will be more obvious. After all, the advantages of trie's efficient retrieval efficiency are worth studying.
This log has been written long enough. If you have time, sort out the datrie test report. Alas, I found that my language organization capability is getting worse and worse. I can't even write my skills, so I have to keep down the technical logs ~~

The following code only adds the datrie code after the empty address search optimization. The ing structure of the character set table also needs to be discussed continuously. In this Code, only English letters are supported.

# Define alphasize30 <br/> # define max10000 <br/> # define alphaid (x) (1 + x-'A') <br/> # define idalpha (X) (X-1 + 'A') <br/> # define empty (x) (basei [x] <0 & check [x] <0) <br/> # define delete_free_node (x) Check [-basei [x] = check [x]; /<br/> basei [-check [x] = basei [X];/<br/> maxsize = max (maxsize, X) <br/> # define add_free_node (x) basei [x] = max;/<br/> check [x] = max; /<br/> abnodes ++ <br/> class datire <br/> {< Br/> Public: <br/> void Init () {<br/> // Double Circular linked list (counter T 0) <br/> for (INT I = 1; I <Max; I ++) {<br/> check [I] =-(I + 1); <br/> basei [I] =-(I-1 ); <br/>}< br/> basei [1] = 0; // so check [0] can be updated <br/> check [MAX-1] = 1; <br/> // basei [0] is root-index <br/> // check [0] Point to first free cell <br/> basei [0] = 0; <br/> check [0] =-1; <br/> // STAT <br/> diffwords = 0; <br/> maxsiz E = 0; <br/> nodes = 1; <br/> abnodes = 0; <br/>}< br/> void print (int s, char * Buf, int d) {<br/> If (basei [s] <0) <br/> puts (BUF ); <br/> int Si = ABS (basei [s]); <br/> for (INT I = Si + 1; I <= Si + alphasize; I ++) {<br/> If (check [I] = s) {<br/> Buf [d] = idalpha (I-Si ); buf [d + 1] = '/0'; <br/> Print (I, Buf, D + 1 ); <br/> Buf [d] = '/0'; <br/>}< br/> bool insert (string word) {<br/> int S = 0, T; <Br/> for (INT I = 0; I <word. length (); I ++) {<br/> char CH = word [I]; <br/> T = ABS (basei [s]) + alphaid (CH ); <br/> If (S = check [T]) <br/> S = T; <br/> else if (empty (t )) {<br/> delete_free_node (t); <br/> basei [T] = T; <br/> check [T] = s; <br/> S = T; <br/> nodes ++; <br/>}< br/> else {<br/> int newb = findb (S, alphaid (CH )); <br/> If (NEWB =-1) <br/> return false; <br/> else {<br/> relocate (S, newb); <br /> I --; <br/>}< br/> If (basei [s]> 0) <br/> diffwords ++; <br/> basei [s] =-ABS (basei [s]); <br/> return true; <br/>}< br/> bool find (string word) {<br/> int S = 0, T; <br/> int I; <br/> for (I = 0; I <word. length (); I ++) {<br/> char CH = word [I]; <br/> T = ABS (basei [s]) + alphaid (CH ); <br/> If (S = check [T]) <br/> S = T; <br/> else <br/> break; <br/>}< br/> return (I = word. length () & base I [s] <0); <br/>}< br/> protected: <br/> int findb (INT S, int NEWC) {<br/> NS = 0; <br/> int I, j; <br/> int Si = ABS (basei [s]); <br/> for (I = Si + 1; I <= Si + alphasize; I ++) {<br/> If (check [I] = s) <br/> sonc [Ns ++] = I-Si; <br/>}< br/> sonc [Ns ++] = NEWC; <br/> int Minson = min (sonc [0], NEWC); <br/> // I <Si, the new place must be after old place <br/> // I <Minson, the negative base value has oth Er meaning <br/> for (I =-check [0]; I! = 0 & (I <Si | I <Minson); I =-check [I]); <br/> for (; I! = 0; I =-check [I]) {<br/> for (j = 0; j <ns; j ++) {<br/> If (! Empty (I + sonc [J]-Minson) <br/> break; <br/>}< br/> If (j = NS) {<br/> NS --; <br/> assert (I-Minson> = 0); <br/> return I-Minson; <br/>}< br/> return-1; <br/>}< br/> void relocate (int s, int B) {<br/> for (INT I = ns-1; I> = 0; I --) {<br/> int news = B + sonc [I]; <br/> int olds = ABS (basei [s]) + sonc [I]; <br/> delete_free_node (News); <br/> check [News] = s; <br/> basei [News] = basei [olds]; <B R/> int ISI = ABS (basei [olds]); <br/> for (Int J = ISI + 1; j <= ISI + alphasize; j ++) {<br/> If (check [J] = olds) <br/> check [J] = News; <br/>}< br/> add_free_node (olds ); <br/>}< br/> basei [s] = (basei [s] <0? -1: 1) * B; <br/>}< br/> protected: <br/> int basei [Max]; <br/> int check [Max]; <br/> // helper <br/> int sonc [alphasize]; <br/> int ns; <br/> public: <br/> // STAT <br/> int maxsize; // used memory size <br/> int nodes; // trie nodes <br/> int abnodes; // abandoned trie nodes <br/> int diffwords; // diff words <br/> // free nodes = maxsize-nodes-abnodes <br/> }; 

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.