Find--understanding hashing algorithms and implementing hash tables

Last Update:2018-07-24 Source: Internet

Author: User

Tags arrays hash integer division

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Find--understanding hashing algorithms and implementing hash tables

We like to use arrays for data lookup, because the array is a "random access" data structure, we can directly calculate the storage location of each array element based on the starting address and the subscript values of the arrays, so its lookup time is O (1), regardless of the number of arrays.

On the basis of this idea, we can think of, if there is a data structure, let us in the keyword search, but also like an array of random storage, so that its time complexity from O (n) to O (1), it can greatly improve the efficiency of the search. Based on this idea, our predecessors invented the hashing method, the hash or keyword address calculation method. Basic Ideas

We are trying to find a relationship that can be based on the keyword (key) we want to store and then use this relationship to directly calculate where it should be stored (p), and once this relationship is established, then once we need to find this keyword, By simply calculating the value generated by this correspondence, you can directly get the address of the keyword, then the time complexity of the lookup is reduced to O (1), and we convert what we just said into a mathematical relationship:

P (position) = H (key)

Where h is the correspondence, which we call the hash function, p becomes the hash address. Therefore, the core of the hashing algorithm is to find the hash function (H), through this function to organize the storage and to find. the method of constructing hash function

Let's look at a question first:
If we now have some words (keywords): And,cell,do,flag ... And so on, if our hash function values the dictionary order of the first letter of the keyword in the alphabet, then the keyword will be distributed in the hash address in turn. However, when we use this set of rules for and,ant,apple,cell,do ... And so on, the "address conflict" problem arises because the first three words are in the same order in the alphabet.

Because the hash function is a compressed image, so in the actual application, there are few hash functions that do not conflict, so how to construct the proper hash function, so that the node "evenly distributed", as little as possible conflict is one of the problems we have to solve.

The principle of the hash function is simple and uniform: the hash function itself is simple and easy to calculate; The hash function value must be in the hash address range, and evenly distributed, the address conflict as little as possible.

There are several common methods of constructing hash functions: 1. Residue Remainder method

This method is the simplest and most common method, assuming that the table is m (the length of the hash address) and p is the maximum number less than or equal to the table length m, then the hash function is h (key) = key% p. H is the location in the hash address.

The value of P should be a qualitative factor in order to reduce the likelihood of "address collisions". 2. Digital Analysis Method

Assuming that each keyword in the keyword collection is made up of s-bit numbers (k1, K2, ..., kn), if you can anticipate the frequency of each of the various numbers appearing on each of the keywords, analyze the whole of the keyword set and extract a number of the evenly distributed bits or their combinations as hash addresses.

For example:
H (49646542) = 465, h (49673242) = 732 ...

After analysis, the value of the 4th to 6th bit in each keyword is more uniform, then the hash function is h (key) = D4d5d6. 3. The method of square take

Since integer division is usually slower to run than multiplication, we need to consciously avoid using the redundancy method to improve the hash algorithm's run time. The square takes the method: first by the keyword's square value expands the similarity number difference, then takes the middle several numbers according to the table length as the hash function value. And because the middle number of a product is correlated with each bit of the multiplier, the resulting hash address is more uniform. 4. Piecewise Superposition method

Sometimes the keyword contains a lot of bits, using the square method to calculate too complex, you can divide the keyword into the same number of bits (the last part of the number can be different), and then take the parts to overlay, overlay and (rounding up) as the hash address.

The concrete superposition method has the shift superposition and the folding superposition. 5. Cardinal Conversion Method

First, the keyword as a different number of binary, and then converted to the original binary number, and then select several as the hash address.

For example:
Consider the decimal (362081) as the number of 13, the final result is converted to decimal (1289744), assuming that the hash length is 10000, it is advisable to lower four bits 9744 as the hash address.

In general, we should use the appropriate hashing algorithm according to the actual situation, and test its performance, generally consider the following factors: the time required to calculate the hash function, the length of the key word, the size of the hash table, the distribution of keywords, and the frequency of the record lookup. ways to handle conflicts

As we said above, we are not likely to have address conflicts in the actual situation, so once we have an address conflict, we should do. It's natural to think about finding the next hash address for the conflicting address. Open Addressing (re-hashing) method

Basic idea:
When the initial hash address of the keyword key h0=h (key) conflicts, the H0-based lookup of the next address H1, if the H1 is still in conflict, and then based on H0, another hash address is generated H2 ... Until you find a non-conflicting address Hi, save the corresponding element, this method has a general re-hash function form:

Hi= ((H (key) + di)% m

where H0=h (key), M is the table length, and di is the increment sequence. The increment sequence is taken in different ways, corresponding to different re-hashing methods: 1. Linear detection and re-hashing

DI = Cxi

The simplest case: c = 1

Features: When a conflict is released, the next cell in the table is viewed sequentially until an empty cell is found or the full table is searched. It is important to note that because the% (take rest) operator is used, it is somewhat similar to the loop queue, with the back of the footer being the table header, and the front end of the table header. 2. Two-time detection and hashing

Di = 1^2,-1^2,2^2,-2^2, ..., k^2,-k^2 (k<= (M/2))

Features: The conflict occurs when the right side of the table, the left to jump-type detection, more flexible, not easy to generate aggregation, the disadvantage is not to detect the entire hash address space. 3. Random detection and re-hashing

Di = pseudo-random number

Features: A random number generator is established and a random number is given as the starting point. Chain Address Method

Basic idea:

Chain all keywords with address conflicts in the same single-linked list.
If the hash table length is M, you can define the hash table as an array of pointers with a m head pointer. A hash address of I is inserted into a single linked list with the first cell of the pointer array as the head pointer. Performance Index

The main performance indicator for measuring lookup efficiency is the average lookup length (ASL).

ASL (SUCC) = (sum of number of comparisons)/(Number of keywords)

The number of comparisons represents the number of times a keyword is placed in a hash address in order to avoid an address conflict that needs to be judged by whether the current hash address already has a value.

The smaller the ASL, the better the performance. Hash Table

With the previous foundation, let's try to build a hash table ourselves, and implement the hash table insertion, lookup, and deletion. creation of a hash table

1. First, the key words of each node in the table are empty;
2. Insert the given keyword sequence into the hash table one at a time using the insert algorithm. inserting a hash table

1. Find the position of the record to be inserted in the table by searching algorithm;
2. If you find the record to be inserted in the table, you do not have to insert it; if not found, the lookup algorithm gives a cell-free hash address and inserts it into the address cell. Lookup of a hash table

1. Calculate the hash address based on the keywords to be found and the hash function when the table is built;
2. If the address unit is empty, the lookup fails, and if it is not, the keywords in the cell are compared to the unknown Origin record's keywords:
If they are equal, the lookup succeeds;
If not, the next address is found according to the method of handling conflicts set when the table was built.
3. Repeat step 2 above until a cell is empty, the lookup fails or is compared to the keyword for the unknown Origin record, and the lookup succeeds. deletion of a hash table

Hash table based on open address method can not be real delete, can only set the deletion flag to the deleted node, so as not to find a node later than it was inserted after the deletion and the node has been in conflict, that is, if you do a real delete operation, will break the lookup path, if the hash table must be a real delete operation, It is best to use the chain address method to handle conflicting hash tables. filling factor for a hash table

α= the number of elements that have been deposited in the hash table/The length of the Hashtable

The smaller the alpha, the less likely the conflict will be, but the less the space utilization will be;
The greater the alpha, the greater the likelihood of conflict, but the higher the space utilization. Why is the storage efficiency of a hash table generally only 50%

Based on the loading factor above, we can tell that the greater the alpha, the greater the chance of conflict, the more times it will be searched, and then we can look at the time complexity of the search formula:

1/(1-n/m)

N/m is the above-mentioned loading factor, we can find that when the loading factor is greater than 1/2, the time complexity of the lookup will be greater than two, so we generally say that the storage efficiency of the hash table is only 50%.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More