We like to use arrays for data lookup, because the array is a "random access" data structure, we can directly calculate the storage location of each array element based on the starting address and the subscript values of the arrays, so its lookup time is O (1), regardless of the number of arrays.
On the basis of this idea, we can think of, if there is a data structure, let us in the keyword search, but also like an array of random storage, so that its time complexity from O (n) to O (1), it can greatly improve the efficiency of the search. Based on this idea, our predecessors invented the hashing method, the hash or keyword address calculation method . Basic Ideas
We are trying to find a relationship that can be based on the keyword (key) we want to store and then use this relationship to directly calculate where it should be stored (p), and once this relationship is established, then once we need to find this keyword, By simply calculating the value generated by this correspondence, you can directly get the address of the keyword, then the time complexity of the lookup is reduced to O (1), and we convert what we just said into a mathematical relationship:
p (position) = H (key)
Where h is the correspondence, which we call the hash function, p becomes the hash address. Therefore, the core of the hashing algorithm is to find the hash function (H), through this function to organize the storage and to find. the method of constructing hash function
Let's look at a question first:
If we now have some words (keywords): And,cell,do,flag ... And so on, if our hash function values the dictionary order of the first letter of the keyword in the alphabet, then the keyword will be distributed in the hash address in turn. However, when we use this set of rules for and,ant,apple,cell,do ... And so on, the "address conflict" problem arises because the first three words are in the same order in the alphabet.
Because the hash function is a compressed image, so in the actual application, there are few hash functions that do not conflict, so how to construct the proper hash function, so that the node "evenly distributed", as little as possible conflict is one of the problems we have to solve.
The principle of the hash function is simple and uniform: the hash function itself is simple and easy to calculate; The hash function value must be in the hash address range, and evenly distributed, the address conflict as little as possible.
There are several common methods of constructing hash functions: 1. Residue Remainder method
This method is the simplest and most common method, assuming that the table is m (the length of the hash address) andp is the maximum number less than or equal to the table length m , then the hash function is h (key) = key% p. H is the location in the hash address.
The value of P should be a qualitative factor in order to reduce the likelihood of "address collisions". 2. Digital Analysis Method
Assuming that each keyword in the keyword collection is made up of s-bit numbers (k1, K2, ..., kn), if you can anticipate the frequency of each of the various numbers appearing on each of the keywords, analyze the whole of the keyword set and extract a number of the evenly distributed bits or their combinations as hash addresses.
For example:
H (49646542) = 465, h (49673242) = 732 ...
After analysis, the value of the 4th to 6th bit in each keyword is more uniform, then the hash function is h (key) = D4d5d6. 3. The method of square take
Since integer division is usually slower to run than multiplication, we need to consciously avoid using the redundancy method to improve the hash algorithm's run time. The square takes the method: first by the keyword's square value expands the similarity number difference, then takes the middle several numbers according to the table length as the hash function value. And because the middle number of a product is correlated with each bit of the multiplier, the resulting hash address is more uniform. 4. Piecewise Superposition method
Sometimes the keyword contains a lot of bits, using the square method to calculate too complex, you can divide the keyword into the same number of bits (the last part of the number can be different), and then take the parts to overlay, overlay and (rounding up) as the hash address.
The concrete superposition method has the shift superposition and the folding superposition . 5. Cardinal Conversion Method
First, the keyword as a different number of binary, and then converted to the original binary number, and then select several as the hash address.
For example:
Consider the decimal (362081) as the number of 13, the final result is converted to decimal (1289744), assuming that the hash length is 10000, it is advisable to lower four bits 9744 as the hash address.
In general, we should use the appropriate hashing algorithm according to the actual situation, and test its performance, generally consider the following factors: the time required to calculate the hash function, the length of the key word, the size of the hash table, the distribution of keywords, and the frequency of the record lookup. ways to handle conflicts
As we said above, we are not likely to have address conflicts in the actual situation, so once we have an address conflict, we should do. It's natural to think about finding the next hash address for the conflicting address. Open Addressing (re-hashing) method
Basic idea:
When the initial hash address of the keyword key h0=h (key) conflicts, the H0-based lookup of the next address H1, if the H1 is still in conflict, and then based on H0, another hash address is generated H2 ... Until you find a non-conflicting address Hi, save the corresponding element, this method has a general re-hash function form:
hi= ((H (key) + di)% m
where H0=h (key), M is the table length, and di is the increment sequence. The increment sequence is taken in different ways, corresponding to different re-hashing methods: 1. Linear detection and re-hashing
di = Cxi
The simplest case: c = 1
Features: When a conflict is released, the next cell in the table is viewed sequentially until an empty cell is found or the full table is searched. It is important to note that because the% (take rest) operator is used, it is somewhat similar to the loop queue, with the back of the footer being the table header, and the front end of the table header. 2. Two-time detection and hashing
di = 1^2,-1^2,2^2,-2^2, ..., k^2,-k^2 (k<= (M/2))
Features: The conflict occurs when the right side of the table, the left to jump-type detection, more flexible, not easy to generate aggregation, the disadvantage is not to detect the entire hash address space. 3. Random detection and re-hashing
di = pseudo-random number
Features: A random number generator is established and a random number is given as the starting point. Chain Address Method
Basic idea:
Chain all keywords with address conflicts in the same single-linked list.
If the hash table length is M, you can define the hash table as an array of pointers with a m head pointer. A hash address of I is inserted into a single linked list with the first cell of the pointer array as the head pointer. Performance Index
The main performance indicator for measuring lookup efficiency is the average lookup length (ASL).
ASL (SUCC) = (sum of number of comparisons)/(Number of keywords)
The number of comparisons represents the number of times a keyword is placed in a hash address in order to avoid an address conflict that needs to be judged by whether the current hash address already has a value.
The smaller the ASL, the better the performance. Hash Table
With the previous foundation, let's try to build a hash table ourselves, and implement the hash table insertion, lookup, and deletion. creation of a hash table
1. First, the key words of each node in the table are empty;
2. Insert the given keyword sequence into the hash table one at a time using the insert algorithm. inserting a hash table
1. Find the position of the record to be inserted in the table by searching algorithm;
2. If you find the record to be inserted in the table, you do not have to insert it; if not found, the lookup algorithm gives a cell-free hash address and inserts it into the address cell. Lookup of a hash table
1. Calculate the hash address based on the keywords to be found and the hash function when the table is built;
2. If the address unit is empty, the lookup fails, and if it is not, the keywords in the cell are compared to the unknown Origin record's keywords:
If they are equal, the lookup succeeds;
If not, the next address is found according to the method of handling conflicts set when the table was built.
3. Repeat step 2 above until a cell is empty, the lookup fails or is compared to the keyword for the unknown Origin record, and the lookup succeeds. deletion of a hash table
Hash table based on open address method can not be real delete, can only set the deletion flag to the deleted node, so as not to find a node later than it was inserted after the deletion and the node has been in conflict, that is, if you do a real delete operation, will break the lookup path, if the hash table must be a real delete operation, It is best to use the chain address method to handle conflicting hash tables. filling factor for a hash table
α= The number of elements that have been deposited in the hash table/The length of the Hashtable
The smaller the alpha, the less likely the conflict will be, but the less the space utilization will be;
The greater the alpha, the greater the likelihood of conflict, but the higher the space utilization. Why is the storage efficiency of a hash table generally only 50%
Based on the loading factor above, we can tell that the greater the alpha, the greater the chance of conflict, the more times it will be searched, and then we can look at the time complexity of the search formula:
1/(1-n/m)
N/m is the above-mentioned loading factor, we can find that when the loading factor is greater than 1/2, the time complexity of the lookup will be greater than two, so we generally say that the storage efficiency of the hash table is only 50%. implementation code for a hash table (C language)
GitHub Link: Hash table