HashTree [], hashtree
Http://blog.csdn.net/yang_yulei/article/details/46337405
In various data structures (linear tables, trees, etc.), the relative positions of records in the structure are random. Therefore, when searching for records in an organization, you need to compare them with keywords. This type of search method is based on "comparison. The search efficiency depends on the number of comparisons during the search process.
We have introduced various comparison-based tree search algorithms. The efficiency of these algorithms will decrease with the increase in the number of data records. Only some are relatively slow (the time complexity is O (n), and some are relatively fast (the time complexity is O (logn. The average length of these search algorithms is obtained under ideal conditions. In practical applications, the frequent addition and deletion of data in the data structure will constantly change the data structure. These operations may lead to the degradation of some data structures to the linked list structure, so its performance will inevitably decline. The adjustment measures taken to avoid this situation inevitably increase the complexity of the program and the extra time for operations.
Hash table
Ideally, if you want to obtain the queried record without any comparison, you must establish a definite correspondence between the storage location of the record and Its keywords, match each keyword with a unique storage location. Therefore, you only need to find the image f (K) of the given value K based on the corresponding relationship f ). Therefore, you can directly obtain the queried records without comparing them. Here, we call this ing relationship a Hash function, and the table created based on this idea isHash table.
Different keywords in the hash table may obtain the same hash address. This phenomenon is called a conflict. In general, conflicts can only be minimized, but cannot be completely avoided. Because the hash function is an image from the set of keywords to the set of addresses. Generally, the set of keywords is relatively large. Its elements include all possible keywords, while the element of the address set is only the address value in the hash table. In general, the hash function is a compression image function, which inevitably produces conflicts.
The hash tree (HashTree) algorithm is to provide a method that can effectively handle conflicts both theoretically and practically. The general Hash algorithm is O (1), and the space is used for time. This easily leads to unlimited storage space requirements. In this article, the HashTree (HashTree) algorithm uses some techniques in actual operations to control space requirements within a certain range. That is, the space requirement is only related to the number of objects to be stored and will not "Expand" without limit.
Theoretical Basis of hash tree
[Prime Number Resolution Theorem]
Simply put, the number of consecutive integers with n different prime numbers that can be "distinguished" is equal to their product. "Resolution" means that these consecutive integers cannot have completely identical remainder sequences.
(For the proof of this theorem, see: http://wenku.baidu.com/view/16b2c7abd1f34693daef3e58.html)
For example:
From 2 consecutive prime numbers, 10 consecutive prime numbers can be distinguished about M (10) = 2*3*5*7*11*13*17*19*23*29 = 6464693230 numbers, which has exceeded the expression range of commonly used integers (32bit) in the computer. The resolution is about M (100) = 100 multiplied by 10 to the power of 4.711930.
According to the current CPU level, the Division of integers for the remainder of 100 times is hardly difficult. In practical applications, the overall operation speed usually depends on the number and time when the node loads keywords into memory. Generally, the loading time is determined by the keyword size and hardware. Under the same type of keywords and the same hardware conditions, the actual overall operation time depends on the number of loads. There is a direct relationship between them.
Insert
We chose the Prime Number Resolution Algorithm to create a hash tree.
Select the continuous prime number from 2 to create a ten-layer Hash tree. The first layer is the root node, and the root node has two nodes; the second layer has three nodes; and so on, that is, the number of subnodes at each layer is a continuous prime number. To the tenth layer, each node has 29 nodes.
The child nodes in the same node represent different remainder results from left to right.
For example, a layer 2 node has three subnodes. From left to right: Divide 3 to 0, divide 3 to 1, and divide 3 to 2.
The remainder obtained by the remainder operation on the prime number determines the processing path.
Node structure: the key word of the node (which is unique in the entire tree), the data object of the node, and whether the node is occupied by a flag (when the flag is true, keyword is considered valid), and the child node array of the node.
The node Structure of the hash tree:
1 struct Node2 {3 keyType key1 struct Node
2 {
3 keyType key;
4 ValueType value;
5 bool occupied; // Use occupied to indicate whether the node is occupied. If the node's key is valid, then occupied should be set to true, otherwise set to false.
6 struct Node * subNodes [1]; // We use subNodes [i] to represent the address of the i-th child of the node. (This technique is introduced in the jump table, you can refer to the previous blog)
7}; ; 4 ValueType value; 5 bool occupied; // occupied indicates whether the node is occupied. If the key of the node is valid, set occupied to true. Otherwise, set it to false. 6 struct Node * subNodes [1]; // we use subNodes [I] to represent the address of the Node's subnode I. (This technology is introduced in the jump table. You can refer to the previous blog) 7 };
(If all nodes are created at the beginning of creation, the computing time and disk space consumed are huge. In practice, you only need to initialize the root node to start working. The subnode is created when more data enters the hash tree. Therefore, the hash tree is a dynamic structure like other trees .)
Next, we will take a random insert of 10 numbers as an example to illustrate the Insert Process of HashTree. The clearest illustration in history is that you can see it clearly.
Some readers may have doubts. What if the conflict persists? First, if the keywords are integer, we can completely distinguish them from the layer-10 hash tree, which is determined by the prime number resolution algorithm.
(In fact, we can also place all the key-value nodes at the 10th-layer leaf node of the hash tree. The full number of nodes on the 10th-layer contains the number of All integers, however, if this is done, all non-leaf nodes are indexed as key-value nodes, which makes the tree structure huge and wastes space)
[This is not clear here. This figure is created from the continuous prime number at the beginning of 2, namely: the number of Subtrees in each node in the top-down hierarchy is 2, 3, 5, 7, 11, 13, 17, 19, 23, and 29. The number of Subtrees for each node in the first layer is 2, and the number of Subtrees for each node in the second layer is 5 .....
The number in the subtree is the index value of the subtree pointer array of its parent node]
Search
The node search process of the hash tree is similar to the node insertion process, that is, the remainder is obtained for the keyword using the prime number sequence, and the branch path of the next node is determined based on the remainder until the target node is found.
For example, the minimum hash tree (HashTree) finds the matched objects from 4G objects, and the number of comparisons cannot exceed 10. That is to say, it can belong to O (10) at most ). In actual application, the range of prime numbers is adjusted so that the number of comparisons generally does not exceed 5. That is to say: it can belong to O (5) at most ). Therefore, you can seek a balance between time and space as needed.
Delete
The process of deleting a hash tree node is also very simple. When deleting a hash tree, no structure adjustment is made.
You only need to first find the node to be deleted, and then set the "Placeholder mark" of the node to false (that is, it indicates that the node is empty but not physically deleted ).
Advantages
1. Simple Structure
The hash tree structure is very simple. The number of subnodes on each layer is a continuous prime number. Sub-nodes can be created at any time. Therefore, the hash tree structure is dynamic and does not require long initialization as some hash algorithms do. The hash tree does not need to allocate space for non-existent keywords in advance.
Note thatThe hash tree is a one-way addition structure.That is, it increases with the amount of data to be stored. Even if the data volume is reduced to the original quantity, the total number of nodes in the hash tree will not decrease. This aims to avoid additional consumption caused by structural adjustment.
2. Quick Search
From the algorithm process, we can see that for integers, the hash tree level can be increased to up to 10. Therefore, you can check whether the object exists by taking the remainder and comparing operations for at most 10 times. This logic determines the superiority of the hash tree.
A general tree structure usually leads to more comparison operations as the number of knots in the hierarchy and hierarchy increases. The maximum number of operations cannot be determined accurately. WhileThere is no relationship between the number of searches in the hash tree and the number of elements.. If the total number of continuous Keywords of an element is within the maximum range expressed by a computer INTEGER (32bit), the number of comparisons will not exceed 10 times, usually lower than this value.
3. unchanged structure
We can see from the deletion algorithm that the hash treeNo structure adjustment is made during deletion.. This is also a very good advantage. Regular tree structures make certain structural adjustments when adding and deleting elements. Otherwise, they may degrade to the linked list structure, resulting in reduced search efficiency. The hash tree uses a "plug-in" algorithm. It never worries about degradation and does not need to take additional operations to optimize the structure. Therefore, the operation time is greatly reduced.
Disadvantage 1: Non-sorting
The hash tree does not support sorting and has no ordering feature. If we try to sort data by traversal without any improvement, the operation efficiency will be far lower than that of other types of data structures.