Popular Science: Why the String hashcode method selects the number 31 as a multiplier

Last Update:2018-07-27 Source: Internet

Author: User

Tags assert

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Https://segmentfault.com/a/1190000010799123?utm_source=tuicool&utm_medium=referral

1. Background

One day, when I was writing code, I accidentally opened the String Hashcode method. Then a general look at the implementation of Hashcode, the discovery is not very complicated. But I found a strange number from the source code, that is, the protagonist of this article 31. This number is not a constant declaration, so it is impossible to infer the use of this number literally. Then with doubt and curiosity, go to the Internet to find information to inquire. After reading the material, silently sigh a sentence, the original is so ah. So what exactly is it? In the next chapters, please take a curiosity and I uncover the use of the number 31 puzzle. 2. Select the reason for the number 31

Before detailing the reason for the string hashcode method to select the number 31 as a multiplier, let's take a look at how the string Hashcode method is implemented, as follows:

public int hashcode () {
    int h = hash;
    if (h = = 0 && value.length > 0) {
        char val[] = value;

        for (int i = 0; i < value.length i++) {
            h = * H + val[i];
        }
        hash = h;
    }
    return h;
}

The code above is the implementation of the String Hashcode method, which is simple. In fact, the Hashcode method core has only three lines of computational logic, which is the for loop in the code. We can derive a calculation formula from the above for loop, which is already given in the Hashcode method annotation. As follows:

s[0]*31^ (n-1) + s[1]*31^ (n-2) + ... + s[n-1]

Here, the S array above, the Val array in source code, is an array of char types maintained within String. Here I'll simply deduce the formula:

Suppose n=3
i=0-> h = * 0 + val[0]
i=1-> h = * (0 + val[0]) + val[1]
i=2-> h = 31 * (31 * (3 1 * 0 + val[0]) + val[1]) + val[2]
       h = 31*31*31*0 + 31*31*val[0] + 31*val[1] + val[2]
       h = 31^ (n-1) *val[0] + 31^ (n- 2) *val[1] + val[2]

The above formula, including the derivation of the formula is not the focus of this article, we understand. Next, the focus of this article is to choose the 31 reason. According to the information on the Internet, there are generally two reasons:

First, 31 is a moderate prime number, is one of the optimal prime numbers as a hashcode multiplier. Other similar prime numbers, such as 37, 41, 43, and so on, are also good choices. So why did you choose 31? Please see the second reason.

Second, 31 can be optimized by JVM, * i = (i << 5)-I.

Of the above two reasons, the first one needs to explain, the second is simpler, do not say. Let me explain the first reason. In general, when designing a hashing algorithm, a special prime number is selected. As for the choice of prime numbers, I think it is possible to reduce the collision rate of the hashing algorithm. As for the reason, this is about to be asked by mathematicians, and the mathematical level I can scarcely ignore explains this reason. As mentioned above, 31 is a moderate prime number, is the optimal multiplier. Why is it that the same number of 2 and 101 (or larger prime numbers) is not an optimal multiplier, as analyzed below.

Here we first analyze the prime number 2. First, assume n = 6 and then bring the prime numbers 2 and n into the formula above. And only the highest number of the formula is calculated, the result is 2^5 = 32, is not very small. So it can be concluded that when the length of the string is not very long, the value of prime number 2 as a multiplier of the hash values, the value is not very large. That is, the hash value is distributed in a smaller range of values, with poor distribution, which may eventually lead to a rise in conflict rates.

It says that prime number 2 as a multiplier causes the hash value to be distributed in a smaller range, so what happens if you use a larger large prime number, 101. According to the above analysis, I think we should be able to guess the results. Just don't worry about it. The hash value is distributed in a small range because 101^5 = 10,510,100,501. But note that this calculation is too large. If a hash value is represented with an int type, the result overflows, resulting in the loss of numeric information. Although the loss of numeric information does not necessarily lead to a rise in conflict rates, we think that Prime 101 (or larger prime numbers) is not a good choice. Finally, let's look at the results of Prime 31:31^5 = 28629151, and the resulting values are relative to 32 and 10,510,100,501. Isn't it nice?

It was proved by a rudimentary mathematical method that the number 31 is a moderate prime, and is one of the optimal prime numbers of hashcode multiplier. Next I will use detailed experiments to verify the above conclusions, but before verifying, let's take a look at the discussion on this issue on Stack Overflow, Why does Java ' s hashcode () in String as a multiplier? One of the top answers quotes a passage from effective Java, which is also quoted here:

The value is chosen because it's an odd prime. If it were even and the multiplication overflowed, information would be lost, as multiplication by 2 are equivalent to Shif Ting. The advantage of using a prime is less clear and but it is traditional. A Nice property of the multiplication can be replaced by a shift and a subtraction for better performance:31 * i = = = (I << 5)-Modern VMs do this sort of optimization automatically.

Simple translation:

Select the number 31 because it is a singular prime, and if you select an even value, an overflow occurs in the multiplication operation, resulting in the loss of numeric information because multiplying by two is equivalent to a shift operation. The advantage of choosing prime numbers is not particularly obvious, but it is a tradition. At the same time, the number 31 has a good feature that multiplication can be replaced by shift and subtraction, to obtain better performance: * i = = = (I << 5)-I, modern Java virtual machine can automatically complete this optimization.

The answer to the second ranking is set out as follows:

As Goodrich and Tamassia point out, If your take over 50,000 中文版 words (formed as the Union of the word lists provided In two variants of Unix), using the constants, 7, a, and a would produce less than the collisions in each case. Knowing this, the it should come as no surprise that many Java implementations choose one of these constants.

This passage also translates:

As Goodrich and Tamassia point out, if you do hash code on more than 50,000 English words (merged by two different versions of Unix dictionaries) and use constants 31, 33, 37, 39, and 41 as the multiplier, each constant calculates the The number of hash values conflicts is less than 7, so it is not surprising that the constants 31 are selected by the Java implementation in the above several constants.

The above two answers perfectly explain the reason for the number 31 in the Java source code. Next, I will verify the second answer, please continue to look down. 3. Experimentation and visualization of data

In this section, I'll hash out more than 230,000 English words using different numbers as the multiplier, and compute the conflict rate of the hash algorithm. At the same time, I will also focus on the distribution of the hash value of different multiplier to the visual processing, so that we can visually see the data distribution. The data used in this experiment is the English dictionary file in the Unix/linux platform, and the file path is/usr/share/dict/words. 3.1 Hash value conflict rate calculation

The

Compute hash algorithm conflict rate is not difficult, for example, you can work out the hash code of all the words at once, and put the duplicate values in the Set. Then take the word number minus set.size () to get the number of conflicts, with the number of conflicts, the conflict rate can be calculated. Of course, if you use the streaming computing API provided by JDK8, it is more convenient to work out the code snippet as follows:

public static Integer hashcode (String str, Integer multiplier) {int hash = 0;
    for (int i = 0; i < str.length (); i++) {hash = multiplier * hash + str.charat (i);
return hash; /** * Compute the hash code conflict rate, by the way the hash code maximum and minimum value, and output * @param multiplier * @param hashs * * public static void CA Lculateconflictrate (Integer multiplier, list<integer> hashs) {comparator<integer> CP = (x, y)-> x ; Y?
    1: (x < y -1:0);
    int maxhash = Hashs.stream (). MAX (CP). get ();

    int minhash = Hashs.stream (). MIN (cp). get ();
    Compute conflict number and conflict rate int uniquehashnum = (int) hashs.stream (). Distinct (). Count ();
    int conflictnum = Hashs.size ()-uniquehashnum;

    Double conflictrate = (Conflictnum * 1.0)/hashs.size (); System.out.println (String.Format, multiplier=%4d, minhash=%11d, maxhash=%10d, conflictnum=%6d, conflictrate=%.4f%% ", multiplier, Minhash, Maxhash, Conflictnum, conflictrate * 100));}

The results are as follows:

As you can see from the above picture, the conflict rate is high when you use a smaller prime number as a multiplier. Especially the prime number 2, the conflict rate reached 55.14%. At the same time we observe the distribution of the hash value of prime number 2 as the multiplier. It can be seen that the hash value distribution is not very wide and is distributed only in the positive half of the entire hash space, that is, [0, 2^31-1]. The negative half axis [ -2^31 ~ 1] is not distributed. This also proves that we assert that prime number 2 is the multiplier, and for short strings, the resulting hash is poorly distributed. Then we'll look at the 31, 37, 41 of these three prime numbers, which are all good, with fewer than 7 conflicts. And the prime Numbers 101 and 199 are also very good, the conflict rate is very low, it also shows that the hash value overflow does not necessarily lead to a rise in conflict rate. But these two guys spill over, we think they're not the optimal multiplier of the hash algorithm. Finally, let's look at the 32 and 36 of these two-even performance, the result is not good, especially 32, the conflict rate of more than 50%. Although the 36 performance is better, but compared with 31,37, the conflict rate is relatively high. Of course, not all even as the times, the conflict rate will be relatively high, we are interested to be able to verify their own. 3.2 Hash Value distribution visualization

The

Previous section analyzes the conflict rate of different numbers as a multiplier, which is an analysis of the distribution of the hash values for different numbers as a multiplier. Before we analyze it in detail, let me talk about the process of visualizing the hash value. I was going to visualize all the hash values in one-dimensional scatter chart, but I didn't find a suitable drawing tool on the Internet for a lap. Plus, after thinking about it, a one-dimensional scatter chart might not be suitable for hashing, because there are more than 230,000 hash values. This means that more than 230,000 scattered points will be shown on the graph, and if not, these 230,000 scattered points will gather very close, and may become a large black block, may lose the meaning of visualization. So I chose another chart with a better visualization, a two-dimensional scatter chart of smooth curves in Excel (hereinafter referred to as scatter graphs). Of course, there is also no 230,000 scatter points on the chart, too much data. So in the actual drawing process, I divide the hash space into 64 subgroups, and count the number of hash values within each interval. Finally, the partition number is the x axis, the number of hashes is Y axis, and the two-dimensional scatter graph I want is plotted. Here is an example to illustrate the drawing process, take the NO. 0 partition as an example. The No. 0 partition numerical interval is [-2147483648,-2080374784), we count the number of hashes that fall within the range, get the < partition number, the number of hashes > numerical pairs, and finally display the value pair in the coordinate system, and the drawing is complete. The partition code is as follows:

/** * Divides the entire hash space into 64 parts, counting the number of hash values in each space * @param hashs/public static Map<integer, integer> partition (List<integer
    > Hashs) {//step = 2^32/64 = 2^26 Final int step = 67108864;
    list<integer> nums = new arraylist<> ();
    Map<integer, integer> statistics = new linkedhashmap<> ();
    int start = 0;
        for (Long i = integer.min_value I <= integer.max_value i + = Step) {Final long MIN = i;
        Final long max = min + step;

        int num = (int) hashs.parallelstream (). filter (x-> x >= min && x < max). Count ();
        Statistics.put (start++, num);
    Nums.add (num);
    //To prevent a calculation error, here verify that int hashnum = Nums.stream (). Reduce ((x, y)-> x + y). get ();

    Assert hashnum = = Hashs.size ();
return statistics; }

The hash value in this article is represented by plastic, and the numerical interval of the shaping is [-2147483648, 2147483647], and the interval size is 2^32. So here we can divide the interval into 64 subgroups, each of which is of 2^26 size. Detailed zoning tables are as follows:

Partition number	Lower Zone	Partition Cap	Partition number	Lower Zone	Partition Cap
0	-2147483648	-2080374784	32	0	67108864
1	-2080374784	-2013265920	33	67108864	134217728
2	-2013265920	-1946157056	34	134217728	201326592
3	-1946157056	-1879048192	35	201326592	268435456
4	-1879048192	-1811939328	36	268435456	335544320
5	-1811939328	-1744830464	37	335544320	402653184
6	-1744830464	-1677721600	38	402653184	469762048
7	-1677721600	-1610612736	39	469762048	536870912
8	-1610612736	-1543503872	40	536870912	603979776
9	-1543503872	-1476395008	41	603979776	671088640
10	-1476395008	-1409286144	42	671088640	738197504
11	-1409286144	-1342177280	43	738197504	805306368
12	-1342177280	-1275068416	44	805306368	872415232
13	-1275068416	-1207959552	45	872415232	939524096

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More