Analysis and extension of bkdrhash algorithm of hash table

Source: Internet
Author: User

Bkdrhash is a word hash algorithm, like Bkdrhash,aphash,djbhash,jshash,rshash,sdbmhash,pjwhash,elfhash and so on, these are more classic, through HTTP// blog.csdn.net/wanglx_/article/details/40300363 (string hash function) In this article, we know that Bkdrhash is a better way to get a hash value. The following explains how this Bkdrhash function is derived and implemented.

When I see the code of Bkdrhash, I can not help but wonder, there is a constant seed, the value is 31, 131 and so on, why should I take the other values do not? And why would you add each character and multiply it by this seed? What does all this mean? Finally think of the long time are not its solution, the final round into the prime number inside not come out ... Finally in a cattle man's guidance, only enlightened, the following my ideas and the derivation process is recorded as follows.

derivation of Bkdrhash calculation formula

By a string (for example, AD) to get its hash value, in order to reduce collisions, each character in the string should be involved in the hash value calculation, so that it conforms to the avalanche effect, that is, even if a byte in the string, the final hash value will have a greater impact. The only way we can think of is to add each character in the string, get it and sum, let sum as the hash value, such as SUM (AD) = A+d, but according to the ASCII Code table know A (+d) =b (98) +c (99), then a collision occurred, We find that the direct summation of the words will be very easy to collide, then what to do? We can zoom in on the gap between characters, multiplying by a factor:

SUM (AD) = coefficient 1 * A + factor 2 * d

SUM (BC) = coefficient 1 * b + factor 2 * C

The coefficient 1 is not equal to the coefficient 2, so the probability of sum (AD) equals sum (BC) is greatly reduced.

However, our string can not be only two or three bits, we are not possible for each coefficient to the artificial assignment, but the string has the order of the number of digits, such as in "AB", B is the No. 0 position, A is the 1th position, then we may use the coefficient of N-square as the coefficient of each character, but this coefficient cannot be 1

SUM (AD) = coefficient ^1 * A + coefficient ^0 * d

SUM (BC) = Coefficient ^1 * b + coefficient ^0 * C

So that we can greatly reduce the occurrence of collisions, below we assume that there is a character array p, there are n elements, then


That


Here is the question of the "coefficient" value, what is the value? From the above analysis, to take anything other than 1, we know that the integer is not an odd number is even, in order to make it easier to calculate the number of the even divided into 2 of the power of the even and not 2 of the power of the even, that is, 3 kinds of value discussion

Derivation of coefficients
Now our task is to derive the value of the coefficients, which are discussed in even, odd three parts of the power of the 2, the even number of the power of the non-2.

A. Take a power of 2

If we take 32, which is 2^5, then we calculate the sum (AD) and sum (BC) results as follows:

The result is different and the collision is handled effectively.

But when we do further testing we will find that when we take sum (AHIJKLMN) and sum (HIJKLMN), we calculate:

SUM (ABHIJKLMN) and sum (ABCHIJKLMN) are calculated:

SUM (ABCDEFGHIJKLMN) and sum (123456HIJKLMN) are calculated as follows:

We will find that as long as the end of the "HIJKLMN" these characters unchanged, no matter how the previous changes, the resulting hash is the same, completely collide! Why is that?

What is the storage type for the hash value sum first? Of course with Unsignedint, because the value will be very large, unsigned int is 32 bits, and as long as the calculation can overflow, the CPU for overflow processing is to discard the highest bit, such as two unsigned int the value of the result is 33 bits, then the highest 33 bits will be discarded, Then we calculate the above situation:

Calculate sum (AHIJKLMN) and SUM (BHIJKLMN):

SUM (AHIJKLMN) = 32^7*a + 32^6*h + 32^5*i + 32^4*j + 32^3*k + 32^2*l + 32^1*m + 32^0*n

SUM (BHIJKLMN) = 32^7*b + 32^6*h + 32^5*i + 32^4*j + 32^3*k + 32^2*l + 32^1*m + 32^0*n

Change 32 to 2^5:

SUM (AHIJKLMN) = 2^35*a + 2^30*h + 2^25*i + 2^20*j + 2^15*k + 2^10*l + 2^5*m + 2^0*n

SUM (BHIJKLMN) = 2^35*b + 2^30*h + 2^25*i + 2^20*j + 2^15*k + 2^10*l + 2^5*m + 2^0*n

Thus sum (AHIJKLMN) and sum (BHIJKLMN) are greater than the maximum value Unsignedint can express, so it is necessary to discard the highest bit, that is, the 0x100000000 (that is, 2^33) to take the remainder, according to the same remainder theorem:

(a+b)%m= (a%m + b%m)%m

(a*b)%m= (a%m * b%m)%m

Know

SUM (AHIJKLMN)%2^33 = (2^35*a% 2^33 + 2^30*h% 2^33 + ... + 2^0*n%2^33)% 2^33

SUM (BHIJKLMN)%2^33 = (2^35*b% 2^33 + 2^30*h% 2^33 + ... + 2^0*n%2^33) 2^33

2^35*a% 2^33 and 2^35*b% 2^33 are zero, so because the overflow is discarded by the CPU,

SUM (AHIJKLMN)%2^33 = (2^30*h% 2^33 + ... + 2^0*n% 2^33) 2^33

SUM (BHIJKLMN)%2^33 = (2^30*h% 2^33 + ... + 2^0*n% 2^33) 2^33

Ultimately, their hash value is

SUM (AHIJKLMN) = 2^30*h + 2^25*i + 2^20*j + 2^15*k + 2^10*l + 2^5*m + 2^0*n

SUM (BHIJKLMN) = 2^30*h + 2^25*i + 2^20*j + 2^15*k + 2^10*l + 2^5*m + 2^0*n

So sum (AHIJKLMN) equals sum (BHIJKLMN), which is why "HIJKLMN" is not changed, no matter what string is discarded before, the same string is obtained. Here is 32=2^5, as long as you use 2^n,n no matter how much is not, will because the length of the string to achieve a certain value caused by the front is discarded, resulting in a collision.

B. Taking an even number of powers not 2

Since the power to take 2 is not, then we take the even number of the power of the non-2, if we take 6 as the coefficient, 6 is 2^2+2, we from the above take 2 of the derivation of the power, when the length of the character is greater than 33 o'clock, the coefficient will become 6^32=3*2^33, the coefficient is greater than 2^32, the 2^33, Discarded, so that as long as the latter 32 characters are unchanged, the front no matter how many of the same characters, will be discarded, the computed hash value is the same.

From the above two pieces, the coefficients are not feasible to take even

c. Take an odd number (greater than 1)

If we take 9=2^3+1,9^2=81=80+1,9^3=729=728+1, ..., 9^n=9^n-1+1, we know that the power of 9 is definitely an odd number, then 9^n-1 must be even, by the above inference that the string reaches a certain length, The characters in front of even coefficients can be discarded, but 9^n=9^n-1+1, the last 1 is never discarded, so each character will participate in the operation, take more than 1 odd-numbered feasible.

Conclusion

The derivation from the above three steps shows that this coefficient should be chosen more than 1 odd, so that it can be very good to reduce the probability of collisions, then we can according to the formula deduced above, in code to achieve:

The initial code for Bkdrhash is implemented as follows:

#include <iostream> #include <math.  h>unsigned int str_hash_1 (const char* s) {unsigned char *p = (unsigned char*) s;unsigned int hash = 0;unsigned int seed = 3;//3,5,7,9,..., etc odd unsigned int nIndex = 0;unsigned int nlen = strlen ((char*) p); while (*p) {hash = hash + POW (3,nlen-ni NDEX-1) * (*p); ++p;nindex++;} return hash;} int main (int argc, char* argv[]) {std::cout << str_hash_1 ("hijklmn") <<std::endl;std::cout << Str_hash _1 ("BHIJKLMN") <<std::endl;getchar (); return 0;}
In fact, we can simplify the code, that is, using recursion to implement, but in the use of Bkdrhash you will find that most of the source code used are special odd 2^n-1, that is because the CPU in the operation of the shift and subtraction relatively fast. The code is as follows:

#include <iostream>unsigned int Bkdr_hash (const char* key) {char* str = const_cast<char*> (key); unsigned int Seed = 31; 131 1313 13131 131313 etc.. 37unsigned int hash = 0;while (*str) {hash = hash * seed + (*str++);} return hash;} int main (int argc, char* argv[]) {std::cout << str_hash ("hijklmn") <<std::endl;std::cout << Str_hash (" BHIJKLMN ") <<std::endl;getchar (); return 0;}

Extended

Note: Even though the final bkdrhash values will almost never collide, but they are very large values, it is not possible to directly map to the hash array address, so it is generally directly to the hash array size, with the remainder as the index address, but this creates a possible address conflict. The Bkdrhash value is different, but the index address obtained after the remainder is the same as the conflict, but the probability of this conflict is very small. It is not possible to completely eliminate collisions for a hash table, only to reduce the chance of collisions. As a further familiarity with the hashing knowledge, here are a few points to note to improve the efficiency of the hash table:

1. The selected hash function

The purpose of a hash function is to produce a hash value such as a string, so that the function that makes the different hash values possible is a good hash function, and it is perfect not to produce the same hash function at all.

2. Methods of dealing with conflicts

There are many methods of dealing with conflicts, such as zipper method and linear detection, I like to use zipper method.

3. Size of the hash table

The hash table size is fixed, but can be dynamically adjusted, that is, create a new array, with the old to the new loop to recalculate key assignment, delete the old. However, it is best to set a sufficient initial value based on the volume of demand data to prevent the dynamic adjustment of the frequent, because the adjustment is very time-consuming and space. What's more, the size of this hash table is set to a prime number, why prime? Because the prime number is only 1 and it itself is two approximate, when the hash table size is bkdrhash by the key of the value, it will not narrow the remainder due to the existence of the Convention, if the remainder range shrinks, it will increase the chance of collision.

4. Load factor, that is, the saturation level of the hash table

Generally speaking, the smaller the loading factor, the smaller the load factor, the smaller the collision, the faster the hash table will be faster, but this will be a huge waste of space, if the loading factor is 0.1, then the hash table only 10% of the space is really used, the rest of the 90% are wasted, this is the time and space contradictions, in order to balance Now most of the use of 0.75 as a loading factor, the loading factor of 0.75, then dynamically increase the size of the hash table.

Analysis and extension of bkdrhash algorithm of hash table

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.