Popular Science: Why the String hashcode method selects the number 31 as a multiplier

Source: Internet
Author: User
Tags assert

Https://segmentfault.com/a/1190000010799123?utm_source=tuicool&utm_medium=referral

1. Background

One day, when I was writing code, I accidentally opened the String Hashcode method. Then a general look at the implementation of Hashcode, the discovery is not very complicated. But I found a strange number from the source code, that is, the protagonist of this article 31. This number is not a constant declaration, so it is impossible to infer the use of this number literally. Then with doubt and curiosity, go to the Internet to find information to inquire. After reading the material, silently sigh a sentence, the original is so ah. So what exactly is it? In the next chapters, please take a curiosity and I uncover the use of the number 31 puzzle. 2. Select the reason for the number 31

Before detailing the reason for the string hashcode method to select the number 31 as a multiplier, let's take a look at how the string Hashcode method is implemented, as follows:

public int hashcode () {
    int h = hash;
    if (h = = 0 && value.length > 0) {
        char val[] = value;

        for (int i = 0; i < value.length i++) {
            h = * H + val[i];
        }
        hash = h;
    }
    return h;
}

The code above is the implementation of the String Hashcode method, which is simple. In fact, the Hashcode method core has only three lines of computational logic, which is the for loop in the code. We can derive a calculation formula from the above for loop, which is already given in the Hashcode method annotation. As follows:

s[0]*31^ (n-1) + s[1]*31^ (n-2) + ... + s[n-1]

Here, the S array above, the Val array in source code, is an array of char types maintained within String. Here I'll simply deduce the formula:

Suppose n=3
i=0-> h = * 0 + val[0]
i=1-> h = * (0 + val[0]) + val[1]
i=2-> h = 31 * (31 * (3 1 * 0 + val[0]) + val[1]) + val[2]
       h = 31*31*31*0 + 31*31*val[0] + 31*val[1] + val[2]
       h = 31^ (n-1) *val[0] + 31^ (n- 2) *val[1] + val[2]

The above formula, including the derivation of the formula is not the focus of this article, we understand. Next, the focus of this article is to choose the 31 reason. According to the information on the Internet, there are generally two reasons:

First, 31 is a moderate prime number, is one of the optimal prime numbers as a hashcode multiplier. Other similar prime numbers, such as 37, 41, 43, and so on, are also good choices. So why did you choose 31? Please see the second reason.

Second, 31 can be optimized by JVM, * i = (i << 5)-I.

Of the above two reasons, the first one needs to explain, the second is simpler, do not say. Let me explain the first reason. In general, when designing a hashing algorithm, a special prime number is selected. As for the choice of prime numbers, I think it is possible to reduce the collision rate of the hashing algorithm. As for the reason, this is about to be asked by mathematicians, and the mathematical level I can scarcely ignore explains this reason. As mentioned above, 31 is a moderate prime number, is the optimal multiplier. Why is it that the same number of 2 and 101 (or larger prime numbers) is not an optimal multiplier, as analyzed below.

Here we first analyze the prime number 2. First, assume n = 6 and then bring the prime numbers 2 and n into the formula above. And only the highest number of the formula is calculated, the result is 2^5 = 32, is not very small. So it can be concluded that when the length of the string is not very long, the value of prime number 2 as a multiplier of the hash values, the value is not very large. That is, the hash value is distributed in a smaller range of values, with poor distribution, which may eventually lead to a rise in conflict rates.

It says that prime number 2 as a multiplier causes the hash value to be distributed in a smaller range, so what happens if you use a larger large prime number, 101. According to the above analysis, I think we should be able to guess the results. Just don't worry about it. The hash value is distributed in a small range because 101^5 = 10,510,100,501. But note that this calculation is too large. If a hash value is represented with an int type, the result overflows, resulting in the loss of numeric information. Although the loss of numeric information does not necessarily lead to a rise in conflict rates, we think that Prime 101 (or larger prime numbers) is not a good choice. Finally, let's look at the results of Prime 31:31^5 = 28629151, and the resulting values are relative to 32 and 10,510,100,501. Isn't it nice?

It was proved by a rudimentary mathematical method that the number 31 is a moderate prime, and is one of the optimal prime numbers of hashcode multiplier. Next I will use detailed experiments to verify the above conclusions, but before verifying, let's take a look at the discussion on this issue on Stack Overflow, Why does Java ' s hashcode () in String as a multiplier? One of the top answers quotes a passage from effective Java, which is also quoted here:

The value is chosen because it's an odd prime. If it were even and the multiplication overflowed, information would be lost, as multiplication by 2 are equivalent to Shif Ting. The advantage of using a prime is less clear and but it is traditional. A Nice property of the multiplication can be replaced by a shift and a subtraction for better performance:31 * i = = = (I << 5)-Modern VMs do this sort of optimization automatically.

Simple translation:

Select the number 31 because it is a singular prime, and if you select an even value, an overflow occurs in the multiplication operation, resulting in the loss of numeric information because multiplying by two is equivalent to a shift operation. The advantage of choosing prime numbers is not particularly obvious, but it is a tradition. At the same time, the number 31 has a good feature that multiplication can be replaced by shift and subtraction, to obtain better performance: * i = = = (I << 5)-I, modern Java virtual machine can automatically complete this optimization.

The answer to the second ranking is set out as follows:

As Goodrich and Tamassia point out, If your take over 50,000 中文版 words (formed as the Union of the word lists provided In two variants of Unix), using the constants, 7, a, and a would produce less than the collisions in each case. Knowing this, the it should come as no surprise that many Java implementations choose one of these constants.

This passage also translates:

As Goodrich and Tamassia point out, if you do hash code on more than 50,000 English words (merged by two different versions of Unix dictionaries) and use constants 31, 33, 37, 39, and 41 as the multiplier, each constant calculates the The number of hash values conflicts is less than 7, so it is not surprising that the constants 31 are selected by the Java implementation in the above several constants.

The above two answers perfectly explain the reason for the number 31 in the Java source code. Next, I will verify the second answer, please continue to look down. 3. Experimentation and visualization of data

In this section, I'll hash out more than 230,000 English words using different numbers as the multiplier, and compute the conflict rate of the hash algorithm. At the same time, I will also focus on the distribution of the hash value of different multiplier to the visual processing, so that we can visually see the data distribution. The data used in this experiment is the English dictionary file in the Unix/linux platform, and the file path is/usr/share/dict/words. 3.1 Hash value conflict rate calculation

The

Compute hash algorithm conflict rate is not difficult, for example, you can work out the hash code of all the words at once, and put the duplicate values in the Set. Then take the word number minus set.size () to get the number of conflicts, with the number of conflicts, the conflict rate can be calculated. Of course, if you use the streaming computing API provided by JDK8, it is more convenient to work out the code snippet as follows:

public static Integer hashcode (String str, Integer multiplier) {int hash = 0;
    for (int i = 0; i < str.length (); i++) {hash = multiplier * hash + str.charat (i);
return hash; /** * Compute the hash code conflict rate, by the way the hash code maximum and minimum value, and output * @param multiplier * @param hashs * * public static void CA Lculateconflictrate (Integer multiplier, list<integer> hashs) {comparator<integer> CP = (x, y)-> x ; Y?
    1: (x < y -1:0);
    int maxhash = Hashs.stream (). MAX (CP). get ();

    int minhash = Hashs.stream (). MIN (cp). get ();
    Compute conflict number and conflict rate int uniquehashnum = (int) hashs.stream (). Distinct (). Count ();
    int conflictnum = Hashs.size ()-uniquehashnum;

    Double conflictrate = (Conflictnum * 1.0)/hashs.size (); System.out.println (String.Format, multiplier=%4d, minhash=%11d, maxhash=%10d, conflictnum=%6d, conflictrate=%.4f%% ", multiplier, Minhash, Maxhash, Conflictnum, conflictrate * 100));}

The results are as follows:

As you can see from the above picture, the conflict rate is high when you use a smaller prime number as a multiplier. Especially the prime number 2, the conflict rate reached 55.14%. At the same time we observe the distribution of the hash value of prime number 2 as the multiplier. It can be seen that the hash value distribution is not very wide and is distributed only in the positive half of the entire hash space, that is, [0, 2^31-1]. The negative half axis [ -2^31 ~ 1] is not distributed. This also proves that we assert that prime number 2 is the multiplier, and for short strings, the resulting hash is poorly distributed. Then we'll look at the 31, 37, 41 of these three prime numbers, which are all good, with fewer than 7 conflicts. And the prime Numbers 101 and 199 are also very good, the conflict rate is very low, it also shows that the hash value overflow does not necessarily lead to a rise in conflict rate. But these two guys spill over, we think they're not the optimal multiplier of the hash algorithm. Finally, let's look at the 32 and 36 of these two-even performance, the result is not good, especially 32, the conflict rate of more than 50%. Although the 36 performance is better, but compared with 31,37, the conflict rate is relatively high. Of course, not all even as the times, the conflict rate will be relatively high, we are interested to be able to verify their own. 3.2 Hash Value distribution visualization

The

Previous section analyzes the conflict rate of different numbers as a multiplier, which is an analysis of the distribution of the hash values for different numbers as a multiplier. Before we analyze it in detail, let me talk about the process of visualizing the hash value. I was going to visualize all the hash values in one-dimensional scatter chart, but I didn't find a suitable drawing tool on the Internet for a lap. Plus, after thinking about it, a one-dimensional scatter chart might not be suitable for hashing, because there are more than 230,000 hash values. This means that more than 230,000 scattered points will be shown on the graph, and if not, these 230,000 scattered points will gather very close, and may become a large black block, may lose the meaning of visualization. So I chose another chart with a better visualization, a two-dimensional scatter chart of smooth curves in Excel (hereinafter referred to as scatter graphs). Of course, there is also no 230,000 scatter points on the chart, too much data. So in the actual drawing process, I divide the hash space into 64 subgroups, and count the number of hash values within each interval. Finally, the partition number is the x axis, the number of hashes is Y axis, and the two-dimensional scatter graph I want is plotted. Here is an example to illustrate the drawing process, take the NO. 0 partition as an example. The No. 0 partition numerical interval is [-2147483648,-2080374784), we count the number of hashes that fall within the range, get the < partition number, the number of hashes > numerical pairs, and finally display the value pair in the coordinate system, and the drawing is complete. The partition code is as follows:

/** * Divides the entire hash space into 64 parts, counting the number of hash values in each space * @param hashs/public static Map<integer, integer> partition (List<integer
    > Hashs) {//step = 2^32/64 = 2^26 Final int step = 67108864;
    list<integer> nums = new arraylist<> ();
    Map<integer, integer> statistics = new linkedhashmap<> ();
    int start = 0;
        for (Long i = integer.min_value I <= integer.max_value i + = Step) {Final long MIN = i;
        Final long max = min + step;

        int num = (int) hashs.parallelstream (). filter (x-> x >= min && x < max). Count ();
        Statistics.put (start++, num);
    Nums.add (num);
    //To prevent a calculation error, here verify that int hashnum = Nums.stream (). Reduce ((x, y)-> x + y). get ();

    Assert hashnum = = Hashs.size ();
return statistics; }

The hash value in this article is represented by plastic, and the numerical interval of the shaping is [-2147483648, 2147483647], and the interval size is 2^32. So here we can divide the interval into 64 subgroups, each of which is of 2^26 size. Detailed zoning tables are as follows:

Partition number Lower Zone Partition Cap Partition number Lower Zone Partition Cap
0 -2147483648 -2080374784 32 0 67108864
1 -2080374784 -2013265920 33 67108864 134217728
2 -2013265920 -1946157056 34 134217728 201326592
3 -1946157056 -1879048192 35 201326592 268435456
4 -1879048192 -1811939328 36 268435456 335544320
5 -1811939328 -1744830464 37 335544320 402653184
6 -1744830464 -1677721600 38 402653184 469762048
7 -1677721600 -1610612736 39 469762048 536870912
8 -1610612736 -1543503872 40 536870912 603979776
9 -1543503872 -1476395008 41 603979776 671088640
10 -1476395008 -1409286144 42 671088640 738197504
11 -1409286144 -1342177280 43 738197504 805306368
12 -1342177280 -1275068416 44 805306368 872415232
13 -1275068416 -1207959552 45 872415232 939524096

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.