I. Integer Hash Functions
There are three common methods: Direct remainder method, Product Integer method, and square method. The three methods are discussed below. The following assumes that our keyword is that the capacity of the hash table is, And the hash function is
.
1. Direct remainder Extraction
We use the keyword divided
To take the remainder as the position in the hash table. Function expressions can be written as follows:
For example, table capacity and key value
. The advantage of this method is that it is easy to implement and fast, and is a very common method. However, if the selection is not good, but the specimen is quite special, it is easy for the data to be hashed in the hash to affect the efficiency.
In terms of experience, we generally choose a prime number that is not very close to each other. If the value range of the keyword is small, we generally choose 1.1 ~ Within 1.6 times. For example, if the value range is, it is a good choice. During the competition, you can write a prime number generator or simply write a "compare the number of workers" by yourself.
I inserted a hash table with a capacity of 4000 with a quota of 701. The result is:
Test Data |
Random Data |
Continuous Data |
Minimum unit capacity: |
0 |
5 |
Maximum unit capacity: |
15 |
6 |
Expected capacity: |
5.70613 |
5.70613 |
Standard Deviation: |
2.4165 |
0.455531 |
It can be seen that for random data, the maximum unit capacity of the remainder method has nearly three times the expected capacity. Tested on my machine (Pentium III 866 MHz, 128 mb ram), the function runs for about 39ns, that is, about 35 clock cycles.
2. Product Integer Method
We use the keyword multiplied by
The real number (preferably the irrational number) in the hash table to obtain a real number between them. Take the fractional part of the number, multiply it, and then take the integer part, that is, the position in the hash table. Function expressions can be written as follows:
The decimal part, that is. For example, the table capacity, seed (a good choice), and key value.
Insert a hash table () with a capacity of 4000 with a number of 701. The result is as follows:
Test Data |
Random Data |
Continuous Data |
Minimum unit capacity: |
0 |
4 |
Maximum unit capacity: |
15 |
7 |
Expected capacity: |
5.70613 |
5.70613 |
Standard Deviation: |
2.5069 |
0.619999 |
From the formula, we can see that this method is very small, and the method is very good when the value is not suitable for the direct remainder method. However, from the test above, the performance is not very satisfactory, and the running speed is slow due to the large number of floating point operations. After repeated optimization, we still need 892ns on our machine to complete a computation, that is, 810 clock cycles, 23 times the direct remainder method.
3. China and France
We take the square of the keyword and take the intermediate bit as the hash function value to return. Because each digit is squareIntermediateSeveral digits have an impact, so the effect of this method is also good. However, it is not ideal for smaller values, and it is complicated to implement. To make full use of the space of the hash table, it is best to take the integer power of 2. For example, the table capacity and key value.
Insert a hash table with a capacity of 4000 to 512 (note that 701 is not used here to use the space of the hash table). The result is as follows:
Test Data |
Random Data |
Continuous Data |
Minimum unit capacity: |
0 |
1 |
Maximum unit capacity: |
17 |
17 |
Expected capacity: |
7.8125 |
7.8125 |
Standard Deviation: |
2.95804 |
2.64501 |
The effect is worse than we think, especially for continuous data. However, since only multiplication and bitwise operations are supported, this function is the fastest. On my machine, an operation only requires 23 NS, that is, 19 clock cycles, which is faster than the direct remainder method.
Compare the three methods:
|
Implementation difficulty |
Actual Effect |
Running Speed |
Other applications |
Direct remainder Method |
Ease |
Good |
Fast |
String |
Product Integer Method |
Ease of use |
Better |
Slow |
Floating Point Number |
China and France |
Medium |
Better |
Fast |
None |
From this table, we can easily see that the cost-effectiveness of the direct remainder method is the highest, so it is also the most used method in our competition.
For real-number hash functions, we can directly use the product to take an integer. For Hash Functions Whose specimens are other types of data, we can first convert them into integers, then insert it into the hash table. Next we will study how to convert other types of data into integers.
Ii. String Hash Functions
The string itself can be regarded as a large integer in decimal form (ANSI string is decimal form). Therefore, we can use the direct remainder method to directly calculate the hash function value in linear time. To ensure the effect, we still cannot select a number that is too close to each other. Especially when we regard a string as an hexadecimal number, if this parameter is selected, the hash function values of any sort of the string are the same. (Think About It, why ?)
Common string hash functions, such as elfhash and aphash, are simple and effective methods. These functions use bitwise operations to make every character affect the final function value. There are also Hash Functions Represented by MD5 and sha1, which are almost impossible to find a collision (MD5 was cracked some time ago ).
I randomly selected 1000 from one of Mark Twain's novels.DifferentAnd 1000 wordsDifferentAs the test data of short and long strings, then use different hash functions to convert them into integers, and then insert a hash table with a capacity of 1237 using the direct remainder method, in case of a conflict, overwrite the old string with the new string. By observing the number of "remaining" strings, We can roughly obtain the actual effects of different hash functions.
|
Short string |
Long String |
Average |
Encoding difficulty |
Get the remainder directly |
667 |
676 |
671.5 |
Ease |
P. J. Weinberger hash |
683 |
676 |
679.5 |
Hard |
Elf hash |
683 |
676 |
679.5 |
Hard |
Sdbm hash |
694 |
680 |
687.0 |
Ease |
Bkdr hash |
665 |
710 |
687.5 |
Ease of use |
Djb hash |
694 |
683 |
688.5 |
Ease of use |
AP hash |
684 |
698 |
691.0 |
Hard |
RS hash |
691 |
693 |
692.0 |
Hard |
JS hash |
684 |
708 |
696.0 |
Ease of use |
Insert 1000 random numbers into a hash table with a capacity of 1237 using the direct remainder method. The number of covered units also reaches 694. It can be seen that the following methods have reached the limit, randomness is excellent. However, it is difficult to choose because there is no perfect, simple, and practical solution. I generally choose JS hash or sdbm.
Hash is a string hash function. The code for these two functions is as follows:
unsigned int JSHash(char *str){unsigned int hash = 1315423911; // nearly a prime - 1315423911 = 3 * 438474637while (*str){hash ^= ((hash << 5) + (*str++) + (hash >> 2));}return (hash & 0x7FFFFFFF);}unsigned int SDBMHash(char *str){unsigned int hash = 0;while (*str){// equivalent to: hash = 65599*hash + (*str++);hash = (*str++) + (hash << 6) + (hash << 16) - hash;}return (hash & 0x7FFFFFFF);}
Jshash operations are complicated. sdbmhash is a good choice if the performance requirements are not particularly high.