HashCode Performance Optimization

Last Update:2014-04-04 Source: Internet

Author: User

Tags bitset crc32

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

This article mainly discusses the impact of implementation of different hashCode () on application performance.

The main purpose of the hashCode () method is to make an object a hashMap key or store it in a hashset. In this case, the Object must implement the equals (Object) method, and its implementation must be consistent with hashCode:

If a. equals (B) Then a. hashCode = B. hashCode ()
If hashCode () is called twice on the same object, it should return the same value, which indicates that this object has not been modified.

HashCode Performance

From the performance perspective, the main goal of the hashCode () method is to make different objects have different hashcodes. All hash-based collections in JDK are stored in arrays. Hashcode is used to calculate the initial search position in the array. Call the equals method to compare the given value with the value of the object stored in the array. If all values have different hashcodes, this reduces the collision probability of hash. In other words, if all values share a hashCode, hashmap (or hashset) is merged into a list, and the time complexity of the operation changes to O (n2 ).

For more details, see the hash map collision solution. JDK uses an open addressing method, but there is also a zipper method. All values with the same hashcode are stored in a linked list.

Let's take a look at what is the difference between hashcode of different quality. We compare a normal String with a String packaging class. it overwrites the hashCode method, and all objects return the same hashCode.

private static class SlowString{    public final String m_str;     public SlowString( final String str ) {        this.m_str = str;    }     @Override    public int hashCode() {        return 37;    }     @Override    public boolean equals(Object o) {        if (this == o) return true;        if (o == null || getClass() != o.getClass()) return false;        final SlowString that = ( SlowString ) o;        return !(m_str != null ? !m_str.equals(that.m_str) : that.m_str != null);    }}

The following is a test method. We will use it again later, so we will give a brief introduction here. It receives an object list and calls the Map. put () and Map. containsKey () methods in sequence for each element in the list.

private static void testMapSpeed( final List lst, final String name ){    final Map map = new HashMap( lst.size() );    int cnt = 0;    final long start = System.currentTimeMillis();    for ( final Object obj : lst )    {        map.put( obj, obj );        if ( map.containsKey( obj ) )            ++cnt;    }    final long time = System.currentTimeMillis() - start;    System.out.println( "Time for "  + name + " is " + time / 1000.0 + " sec, cnt = " + cnt );}

Both the String and SlowString objects are generated in a "ABCD" + I format. Processing 100000 String objects takes 0.041 seconds. It takes 82.5 seconds to process the SlowString object.

The results show that the quality of the hashCode () method of the String class is obvious. Let's perform another test. Create a string list in the format of "ABCdef * &" + I, the second half is "ABCdef * &" + I + "ghi" (ensure that the middle part of the string changes without changing the end, without affecting the quality of hashCode ). We will create 1 M, 5 M, 10 M, and 20 M strings to see how many strings share the hashcode and how many strings share the same hashcode. The test result is as follows:

Number of duplicate hashCodes for 1000000 strings = 0Number of duplicate hashCodes for 5000000 strings = 196Number of hashCode duplicates = 2 count = 196Number of duplicate hashCodes for 10000000 strings = 1914Number of hashCode duplicates = 2 count = 1914Number of duplicate hashCodes for 20000000 strings = 17103Number of hashCode duplicates = 2 count = 17103

It can be seen that few strings share the same hashCode, and the probability that a hashcode is shared by more than two strings is very small. Of course, your test data may be different-use this test program to test your given key.

HashCode Method for Automatically Generating long Fields

It is worth mentioning how most ides automatically generate long-type hashcode methods. The following is a generated hashCode method. This class has two long fields.

public int hashCode() {    int result = (int) (val1 ^ (val1 >>> 32));    result = 31 * result + (int) (val2 ^ (val2 >>> 32));    return result;}

The method for generating classes with only two int types is as follows:

public int hashCode() {    int result = val1;    result = 31 * result + val2;    return result;}

We can see that the processing of the long type is different. Java. util. Arrays. hashCode (long a []) uses the same method. In fact, if you treat the high 32-bit long type and the low 32-bit split as int, the generated hashCode distribution will be much better. The following is the improved hasCode method for the classes of the two long fields (note that this method runs slowly than the original method, but the quality of the new hashCode will be much higher, in this way, the execution efficiency of the hash set will be improved, although the hashCode itself is slow ).

public int hashCode() {    int result = (int) val1;    result = 31 * result + (int) (val1 >>> 32);    result = 31 * result + (int) val2;    return 31 * result + (int) (val2 >>> 32);}

The following are the results of the testMapSpeed Method for Testing 10 m of these three objects. They are all initialized with the same value.

Two longs with original hashCode	Two longs with modified hashCode	Two ints
2.596 sec	1.435 sec	0.737 sec

We can see that the updated hashCode method has different effects. Although it is not obvious, you can consider the performance requirements.

What can high-quality String. hashCode () do?

Suppose we have a map, which is directed by the String identifier to a specific value. The key (String identifier) of a map is only stored in this map (at most some of the keys are stored in other places at the same time ). Suppose we have collected all the map entries, for example, the first stage in a two-phase algorithm. In the second stage, we need to use the key to find the map. We will only use the key that already exists in the map for search.

How can we improve the performance of map? As you can see earlier, String. hashCode () returns almost different values. We can scan all the keys and calculate their hashCode to find out which hashcodes are not unique:

Map cnt = new HashMap( max );for ( final String s : dict.keySet() ){    final int hash = s.hashCode();    final Integer count = cnt.get( hash );    if ( count != null )        cnt.put( hash, count + 1 );    else        cnt.put( hash, 1 );} //keep only not unique hash codesfinal Map mult = new HashMap( 100 );for ( final Map.Entry entry : cnt.entrySet() ){    if ( entry.getValue() > 1 )        mult.put( entry.getKey(), entry.getValue() );}

Now we can create two new maps. For the sake of simplicity, assume that the stored value in map is Object. Here, we have created two maps: Map and Map (TIntObjectHashMap is recommended in the production environment. The first map stores the unique hashcode and corresponding values, and the second map stores strings that are not unique in the hashCode and their corresponding values.

final Map unique = new HashMap( 1000 );final Map not_unique = new HashMap( 1000 );//dict - original mapfor ( final Map.Entry entry : dict.entrySet() ){    final int hashCode = entry.getKey().hashCode();    if ( mult.containsKey( hashCode )         not_unique.put( entry.getKey(), entry.getValue() );    else        unique.put( hashCode, entry.getValue() );}//keep only not unique hash codesfinal Map mult = new HashMap( 100 );for ( final Map.Entry entry : cnt.entrySet() ){    if ( entry.getValue() > 1 )        mult.put( entry.getKey(), entry.getValue() );}

Now, to find a value, we need to first find the unique map of the first hashcode. If not, we need to find the second non-unique map:

public Object get( final String key ){    final int hashCode = key.hashCode();    Object value = m_unique.get( hashCode );    if ( value == null )        value = m_not_unique.get( key );    return value;}

In some rare cases, there may be many objects in your non-unique map. In this case, we should first try java.util.zip.crc32.pdf to replace java.util.zip. Adler32 with the hashCode implementation method (Adler32 is faster than CRC32, but its distribution is poorer ). If not, use two different functions to calculate the hashcode: low 32-bit and high 32-bit. The hash function uses Object. hashCode, java.util.zip.crc32?java.util.zip. Adler32.

The advantage of doing so is to compress the storage space of map. For example, if you have a map whose KEY stores 1 million strings, after compression, only the long type and few strings are left)

Set COMPRESSION is more effective

In the previous example, we discussed how to remove the key value in map. In fact, the effect of optimizing the set is more obvious. Set has two application scenarios: one is to split the original set into multiple sub-sets, and then query whether the identifier belongs to a sub-set in sequence; another one is to write a spellchecker -- some of the values to be queried are unexpected values (such as misspelling ), if some errors are calculated, the impact is not very great (if another word has the same hashCode, you will think the word is spelled correctly ). Both scenarios are applicable.

If we extend the preceding method, we will get a Set consisting of a unique hashcode. what we get from a non-unique hashCode is a Set. At least the space of many strings can be optimized here.

If we can limit the value of hashCode to a certain range (for example, 2 ^ 20), we can use a BitSet instead of Set, which has been mentioned in the BitSet article. Generally, if we know the size of the original set in advance, the hashcode range has enough space for optimization.

The next step is to determine how many identifiers share the same hashcode. If there are many hashcodes in the collision, improve your hashcode method or increase the value range of hashCode. The perfect situation is that all your tokens have unique hashcode (which is not hard to implement ). The advantage of optimization is that you only need a BitSet, instead of storing a large string set.

Summary

Improve the distribution of your hashCode algorithm. Optimizing it is much more important than optimizing the execution speed of this method. Do not write a hashCode method that returns constants.

The implementation of String. hashCode is already quite perfect, so you can often use String hashCode to replace the String itself. If you are using a string set, try to optimize it to a BitSet. This will greatly improve the performance of your program.

Original article reprinted please indicate the source: http://it.deepinmind.com

Original English text

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More