The trie tree is a search tree abbreviated to the English word "retrieval". It can establish an effective data retrieval organization structure and is a common implementation of the dictionary in the Chinese matching word splitting algorithm. In essence, it is a fixed finite state automation (DFA). Each node represents a State of the automation. In the dictionary, this status includes "word prefix" and "already used.
Double array trie (double-array trie) is a simple and effective implementation of the trie tree. It consists of two integer Arrays: Base [] and check []. Set the subscript of the array to I. If base and check are both 0, the position is null. If the base value is negative, the state is a word. Check indicates the previous state of the status, t = base + A, check [T] = I.
The following example (derived from <double-array trie (double-array trie) data structure and implementation>) describes how to use double-array trie (double-array trie) construct a word segmentation algorithm dictionary. Assume that there are only the words "Ah, Argentina, agua, Arabia, Arab, and Egypt" in the word table, which can be represented by the trie tree:
First, we encode all the 10 Chinese characters in the Word Table: Ah-1, A-2, alas-3, root-4, gum-5, pull-6, and-7, Ting-8, bo-9, and man-10 .. For each Chinese character, a base value needs to be determined so that all words starting with this Chinese character can be put down in the double array. For example, to determine the Base Value of the word "A", assume that the second sequence code of the word starting with "A" is A1, A2, A3 ...... An, we must find a value of I so that base [I + A1], check [I + A1], base [I + A2], check [I + A2]… Base [I + an] And check [I + an] are all 0. Once I is found, the Base Value of "a" is determined as I. This method is used to construct the double-array trie (double-array trie). After four traversal, all words are placed in the double array, and then the word table is traversed to modify the base value. Because we use a negative base value to indicate that this position is a word. If status I corresponds to a word and base = 0, set base = (-1) * I. If the base value is not 0, set base = (-1) * base. The dual array is as follows:
In the double array generated using the above method, convert "ah", "A", "E", "a root", "ala", "a gum", "Egypt ", "Arab", "Arab", and "Argentina" are regarded as States. Each State corresponds to a subscript of the array. For example, if the subscript of "A root" is I = 8, the check content is the subscript of "A", and the base value is the base value of "Argentina. If the sequence code of "ting" is X = 8, the subscript of "Argentina" is base + x = base [8] + 8 = 12.
Ii. Basic operations and Problems
1. Query
The query process of the trie tree is actually a DFA state transfer process. It is easy to implement in the dual array: you only need to perform State transfer according to the state mark. for example, to query "Argentina", first find the subscript 2 in the "A" status based on the sequence code B = 2 of ", then, find the subscript base + D = 8 of "A root" based on the sequence code D = 4 of "root", and check [base + D] = B, it indicates that "Agen" is a part of a word and can continue to be queried. Then find the status "Argentina ". Its subscript is y = 12. In this case, base [y] <0, check [y] = base + D = 8 indicates that "Argentina" is in the Word Table and the query is complete.
During the query process, we can see that the query time for a word is only related to its length, that is, its time complexity is O (1 ). in Chinese, many words are single-word and double-word, with fewer words than three. therefore, the trie tree dictionary query built with double arrays is the fastest possible theoretically Chinese mechanical word segmentation.
2. Insert and delete
The disadvantage of Dual Arrays is that every state in the constructor is dependent on other States. Therefore, when a word is inserted or deleted in a dictionary, it is often necessary to make global adjustments to the double array structure, and the flexibility is poor.
Insert a word into the original double array trie tree, which is equivalent to adding a state to DFA. First, we should find the location where the status should be located based on the query method. If the location is blank, it is fine to directly insert it. If this location is not empty. Then we had to re-scan the Base Value of the largest prefix status that already exists in the status according to the same method during construction, and then obtain the Base Value of the successor node of the status in sequence. Pay attention to the changes in the check value.
For example, if "aragan" becomes a word one day, we need to insert this state into the trie tree. According to the calculation, its position should be 8, but 8 is a status. therefore, we have to confirm that "ala" is the largest base value in the prefix state. rescan to obtain base [10] = 11. In this case, the status 15 is "aragan", the base [15] is negative (into words), check [15] = 10, and the status 20 is "arayan ", and base [20] =-4, check = 10.
This process is actually very time-consuming, because you have to scan each possible base value in order to determine the base value with the largest prefix status. This confirmation process is basically tolerable during construction. After all, even if you use the previous one, there is no problem with the construction over the past two days (as long as you can run it effectively after the construction ). However, when insertion is frequent, if it takes so long a run time every time, it is really intolerable.
The implementation of double array deletion is relatively simple. You only need to set the corresponding state of the deleted words to null-that is, the Base Value and check are set to 0. However, it has a space efficiency problem. For example, when we delete the word "Egypt" above, status 11 is set to null. Status 10 becomes a useless node-it is not a word and cannot be reused when new words are inserted. Therefore, as deletion continues, the number of empty and useless status points increases, and the utilization of space will decrease.
Iii. Simple Optimization
The basic idea of optimization is to construct the double array trie tree as a dynamic retrieval method to solve the problems of insertion and deletion.
1. Insert Optimization
When inserting a new base value, we only need to traverse the empty state. The appearance of a non-empty State means that a base value is defeated, so we can ignore it. Therefore, we can construct a sequence for all null States. When determining the base value, we only need to scan the sequence.
Incrementing nodes R1, R2 ,..., Rm, we can construct this empty sequence as follows:
Check [ri] = −ri + 1 (1 I m − 1 ),
Check [RM] = −( da_size + 1)
R1 = e_head indicates the index point corresponding to the first null value. In this way, we only need to scan this sequence when determining the base value. In this way, the access time for non-empty states is saved.
This method can greatly improve the insert speed when the null status is not too large.
2. Deletion Optimization
1) useless nodes
For useless nodes generated when leaf knots are deleted, you can leave them empty by checking in sequence, so that they can be reused when new words are inserted. For example, if we delete "Argentina" in the previous example, we can see that "Agen" is not in a substate, so we can leave it empty. The "A" status cannot be empty because it has two sub-states.
2) compression of array Length
After a state is deleted, we can directly Delete the continuous null state at the end of the array. In addition, we can re-determine the base value for the maximum non-null index point, because it may have become smaller due to deletion. In this case, we may be able to delete some null states.