Find--Text Xiang Hashtree (hash tree)

Last Update:2015-06-10 Source: Internet

Author: User

Tags integer division

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Cited

In various data structures (linear tables, trees, etc.), the relative position of the records in the structure is random. Therefore, it is necessary to compare a series of keywords when looking for records in an organization. This kind of search method is based on "comparison". The efficiency of the lookup depends on the number of comparisons made during the lookup process .

Before we introduce the various comparison-based tree lookup algorithms, the efficiency of these lookup algorithms will decrease as the number of data records grows . Just some are slow (time complexity is O (n)), some are faster (time complexity is O (Logn)). The average lookup length of these search algorithms is obtained in a relatively ideal situation. In the practical application, the data structure is constantly changed by the frequent increment and deletion of data. These operations will likely cause some data structures to degenerate into linked list structures, and their performance will inevitably degrade. The adjustment measures taken to avoid this situation inevitably increase the complexity of the program and the additional time of the operation.

Hash Table

The ideal situation is hope to get the records you've looked at in a single access without any comparison., it is necessary to establish a definite correspondence between the memory location and its keyword F, so that each keyword corresponds to a unique storage location. Thus, as soon as the lookup is found, as long as the corresponding relationship F to find a given value k like F (k). As a result, no comparison is required to obtain the records directly. In this case, we call this correspondence a hash function, and the table created by this idea is Hash Table。

In a hash table, the same hash address may be obtained for different keywords, and this behavior is called a conflict. In general, conflicts can only be reduced as much as possible, and not completely avoided. Because a hash function is an image from a collection of keywords to an address collection. Typically, the collection of keywords is larger, and its elements include all possible keywords, while the elements of the address collection are only the address values in the hash table. Under normal circumstances,a hash function is a compressed image function , which inevitably creates a conflict.

The hash tree (hashtree) algorithm is to provide a method that can effectively deal with the conflict in both theory and practical application. The General hash algorithm is O (1), and is basically space-changing time. This can easily lead to the need for unlimited storage space. Stated Hash Tree(hashtree) algorithm in the actual operation of the use of some techniques to control the space requirements within a certain range. Thatspace requirements are only related to the number of objects that need to be stored, and will not be "inflated" indefinitely.

The theoretical basis of hash tree

【 Prime number resolution theorem】
To put it simply:n A number of consecutive integers with different prime numbers that can be "distinguished" and their product is equal。 "Resolution" means that these successive integers cannot have exactly the same remainder sequence.
(Proof of this theorem see: http://wenku.baidu.com/view/16b2c7abd1f34693daef3e58.html)

For example:
From 2 consecutive prime numbers, 10 consecutive prime numbers can be distinguished by about M (10) =2*3*5*7*11*13*17*19*23*29= 6,464,693,230 numbers, which have exceeded the expression range of the commonly used integers (32bit) in the computer. 100 consecutive prime numbers can be distinguished by about M (100) = 4.711930 times 10 219.
With the current CPU level, the integer division operation of 100-time redundancy is hardly difficult. In real-world applications, the overall speed of operation often depends on the number and time that the node will load the keyword into memory. In general, the time to load is determined by the size of the keyword and the hardware, and the actual overall operating time, under the same type of keyword and the same hardware conditions, depends largely on the number of times the load is loaded. Between them is a proportional relationship.

Insert

We select the prime number resolution algorithm to build a hash tree.
Select a continuous prime number starting at 2 to create a 10-layer hash tree . The first layer node is the root node, the root node has 2 nodes, the second layer has 3 nodes under each node, and so on, that is, the number of child nodes of each layer node is continuous prime . To the tenth floor, there are 29 nodes under each node.
Sub-nodes in the same node, from left to right, represent different remainder results.
For example, there are three sub-nodes under the second level node. Then from left to right respectively represent: In addition to more than 3 0, in addition to more than 3 1, in addition to more than 3 2.
The remainder of the number of hits is determined by the processing path .

Node structure: The node's key word (which is unique throughout the tree), the node's data object, whether the node is occupied by the flag bit ( when the flag bit is true, the keyword is considered valid ), and the node's sub-node point group.
Node structure of hash tree

struct node{    keyType      key;    ValueType    value;    BOOL         occupied;    Use occupied to indicate whether the node is occupied. If the keyword (key) of the node is valid, then occupied should set the bit true, otherwise set to false.    struct node* subnodes[1];//We use subnodes[i] to represent the address of the node's sub-node. (This technique is introduced in the jumping table, can be viewed in front of the blog)} ;

(If all nodes are built at the outset, the computational time and disk space consumed are enormous.) In actual use, you only need to initialize the root node to start working. The establishment of child nodes is established when more data is entered into the Hashi. So it can be said that hash tree is a dynamic structure like other trees. ）

Let's take a random 10-digit insertion as an example to illustrate the Hashtree insertion process, the clearest diagram in history, and you can see it clearly ^_^

Some readers may have doubts about what to do if the conflict continues. First of all, if the keyword is an integer, our 10-layer hash tree can completely distinguish them, which is determined by the prime-number resolution algorithm.

(we can actually put all the key-value node at the 10th-level leaf node of the hash tree, this 10th layer of the full node contains all the number of integers, but if so, all non-leaf nodes as the key-value node index, so that the tree structure is huge, wasting space)

Find

The hash tree node lookup process is similar to the node insertion process, which istake the remainder of the prime number sequence for the keyword, determine the bifurcation path of the next node according to the residue, until the target node is found。
For example, the minimum "hash Tree" (Hashtree) finds the matching objects from 4G objects, not more than 10 times. That is to say: Up to O (10). In practical applications, the range of prime numbers is adjusted so that the number of comparisons is generally no more than 5 times. That is to say: Up to O (5). Therefore, we can find a balance in time and space according to our own needs.

Delete

Hash tree node Deletion process is also very simple, hash tree at the time of deletion, do not make any structural adjustment.
Just the node to be deleted, then the "placeholder token" of this node is set to False (that is, this node is an empty node, butdo not physically delete）。

Advantages

1. Simple Structure

From the hash tree structure, it's very simple. The number of child nodes per layer of nodes is a continuous prime. Child nodes can be created at any time. Therefore, the structure of hash tree is dynamic and does not require a long initialization process like some hashing algorithms. Hash tree also does not need to allocate space ahead of time for non-existent keywords.
It is important to note that hash tree is a one-way incremental structure that increases as the amount of data that needs to be stored increases. even if the amount of data is reduced to the original amount, the total number of nodes in the hash tree is not reduced . The purpose of this is to avoid the additional consumption of structural adjustments.

2. Quick Search

From the algorithm process we can see that for integers, the hash tree level can be increased up to 10. Therefore , you can know whether the object exists if you need only 10 take-up and comparison operations . This determines the superiority of the hash tree in the algorithm logic.
The general tree structure tends to result in more comparison operations as the number of nodes in the hierarchy and hierarchy increases. The number of operations can be said to not accurately determine the upper limit. there is no relationship between the number of lookups and the number of elements in hash tree . If the total number of consecutive keywords for an element is within the maximum range that the computer's integer (32bit) can express, then the number of comparisons is no more than 10 times, usually below this value.

3, the structure is unchanged

As can be seen from the deletion algorithm, hash tree does not make any structural adjustments when it is deleted . This is also a very good advantage of it. The regular tree structure should be adjusted when adding elements and deleting elements, otherwise they may degenerate into a linked list structure, resulting in a decrease in the search efficiency. Hash tree is a "jianfengchazhen" algorithm that never has to worry about degradation, and does not need to take extra action to optimize the structure, thus greatly saving the operation time.

Disadvantages

1. non-sequencing

The hash tree does not support sorting, and there are no sequential attributes. If you do not make any improvements on this basis and try to implement the sort by traversal, then the operational efficiency will be much lower than other types of data structures.

Questions about super-long strings

What should I do if it is a keyword with a very long string? If they are converted to numbers at 26, the results are too large.
We can use MD5Wait for the message compression algorithm to generate a fixed-length integer.

" about MD5"

Wiki Link: http://zh.wikipedia.org/wiki/MD5
MD5 (Message Digest Algorithm messages Digest algorithm fifth edition)
A widely used cryptographic hash function,can produce a 128-bit (16-byte) hash value(hash value) to ensure complete consistency of information transmission.

The MD5 algorithm has the following characteristics:
1. Compressibility:any length of data, the calculated length of the MD5 value is fixed。
2, easy to calculate: It is easy to calculate the MD5 value from the original data.
3, anti-modification: Any changes to the original data, even if only 1 bytes modified, the resulting MD5 value is very different.
4, weak anti-collision: known raw data and its MD5 value, it is very difficult to find a data with the same MD5 value (that is, falsification of data).
5, strong anti-collision: To find two different data, so that they have the same MD5 value, is very difficult.
(Proven weaknesses After 1996, can be cracked, for data requiring high security, experts generally recommend switching to other algorithms, such as SHA-1)

For extra-long strings,we can generate a 128bit integer with the MD5 algorithm., and then useRadixtree(Look through the previous blog) to store this large integer, or useHashi Storage, for such a large integer, we cannot simply use the computer's integer to do division, but using programs to simulate manual divisionTo do division and get the remainder.
In this way, the use of MD5 and the choice of larger prime numbers combined approach. This makes it possible to obtain a complete coverage of the keyword interval by Hashi with less layers. This reduces the number of comparison operations and improves overall productivity.

Application

Hash tree can be used in a wide range of areas where fast matching of large volumes of data is required . For example: Database indexing system, receipt matching in short messages, large number of routing matches, information filtering matching. The hash tree does not require additional balance and prevents degradation of the operation, the efficiency is very ideal.

Reference

Http://baike.baidu.com/view/10403049.htm
Http://wenku.baidu.com/view/16b2c7abd1f34693daef3e58.html

----------------------------------Thank you for your visit and hope to help you. We welcome your attention, collection and comment. ----------------------------------

Find--Text Xiang Hashtree (hash tree)

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More