A deep discussion on the working principle of Java HashMap

Last Update:2015-09-06 Source: Internet

Author: User

Tags bitwise exit in rehash

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Most Java developers are using maps, especially HashMap. HashMap is a simple but powerful way to store and retrieve data. But how many developers know how HashMap works inside? A few days ago, I read a lot of Java.util.HashMap's source code (including Java 7 and Java 8) to get a deeper understanding of the underlying data structure. In this article, I'll explain the implementation of JAVA.UTIL.HASHMAP, describe the new features added in the Java 8 implementation, and discuss performance, memory, and some known issues when using HashMap.

Internal storage

The Java HashMap class implements the Map<k, v> interface. The main methods in this interface are:

V Put (K key, V value)
V get (Object key)
V Remove (Object key)
Boolean ContainsKey (Object key)

HashMap uses an internal class entry<k, v> to store data. This inner class is a simple key-value pair with an additional two data:

A reference to another portal (translator Note: Reference object) so that HashMap can store objects like a linked list.
A hash value that is used to represent the key, which can be stored to avoid hashmap the hash value of the key each time it needs to be regenerated.

Below is a section of code for ENTRY<K, v> under Java 7:

Static class Implements Map.entry<k,v> {        final  K key;        V value;        Entry<K,V> next;         int hash, ...}

HashMap stores data in multiple unidirectional entry lists (sometimes referred to as bucket buckets or container orbins). All lists are registered in a Entry array (entry<k, v>[] array, the default length of this internal array is 16.

The following illustration depicts an internal storage of an HashMap instance that contains an array of nullable objects. Each object is connected to another object, which makes up a linked list.

All keys that have the same hash value will be placed in the same list (bucket). Keys that have different hash values may end up in the same bucket.

When the user calls put (K key, V value) or get (object key), the program calculates the index of the bucket that the object should be in. The program then iterates through the corresponding list to find the entry object with the same key (the Equals () method using the key).

For cases where get () is called, the program returns the entry object that corresponds to the value if the entry object exists.

In the case of a call to put (K key, V value), if the entry object already exists, the program will replace the value with the new value, otherwise the program will create a new entry (key and value from the parameter) in the table header of the unidirectional list.

The index of the bucket (linked list) is generated by the 3 steps of Map:

First get the hash code of the key.
The program repeats the hash code to block the bad hash function against the key, because it is possible to put all the data on the same index (bucket) of the internal array.
The program gets the duplicated hash code and uses a bitmask (bit-mask) of the array length (min. 1). This operation ensures that the index is not larger than the size of the array. You can consider it as a calculated optimization modulus function.

The following is the source code that generates the index:

//the "Rehash" function in JAVA 7 that takes the hashcode of the keyStatic intHashinth) {h^= (H >>>) ^ (H >>> 12); returnH ^ (H >>> 7) ^ (H >>> 4);}//the "Rehash" function in JAVA 8 that directly takes the keyStatic Final intHash (Object key) {inth; return(Key = =NULL) ? 0: (H = key.hashcode ()) ^ (H >>> 16); }//The function that returns the index from the rehashed hashStatic intIndexfor (intHintlength) {    returnH & (length-1);}

To work more efficiently, the size of the internal array must be a power value of 2. Let's take a look at why:

Assuming that the length of the array is 17, the mask value is 16 (array length-1). The binary representation of 16 is 0 ... 010000, so for any value H, the result of "H & 16" is 16 or 0. This means that an array of length 17 can only be applied to two buckets: one is 0 and the other is 16, which is not very efficient. But if you set the length of the array to a power of 2, such as 16, then the work of the bitwise index becomes "H & 15". The binary representation of 15 is 0 ... 001111, the value of the index formula output can be from 0 to 15, so that the length of the array of 16 can be fully used. For example:

If H = 952, the binary representation of it is 0. 01110111000, the corresponding index is 0 ... 01000 = 8
If H = 1576, the binary representation of it is 0. 011000101000, the corresponding index is 0 ... 01000 = 8
If H = 12356146, the binary representation of it is 0. 0101111001000101000110010, the corresponding index is 0 ... 00010 = 2
If H = 59843, the binary representation of it is 0. 01110100111000011, it corresponds to the index is 0 ... 00011 = 3

This mechanism is transparent to the developer: if he chooses a hashmap,map length of 37, it automatically selects the next power value of 2 (64), which is greater than 37, as the length of the internal array.

Auto Size adjustment

After getting the index, the Get (), put (), or remove () method accesses the corresponding linked list to see if the entry object for the specified key already exists. This mechanism can cause performance problems without modification, because this method needs to iterate over the entire list to see if the entry object exists. Suppose that the length of the internal array takes the default value of 16, and you need to store 2,000,000 records. In the best case, each linked list will have 125,000 entry objects (2,000,000/16). The Get (), remove (), and put () methods require 125,000 iterations each time they are executed. To avoid this, hashmap can increase the length of the internal array to ensure that only a small number of entry objects remain in the list.

When you create a hashmap, you can specify an initial length by using the following constructor, and a loadfactor:

</pre>public HashMap (int initialcapacity, float loadfactor) <pre>

If you do not specify a parameter, the default value of Initialcapacity is Loadfactor, which is 0.75. Initialcapacity represents the length of the list of internal arrays.

When you use put (...) every time method to add a new key-value pair to a map, the method checks whether the length of the internal array needs to be increased. To achieve this, map stores 2 of data:

Map Size: It represents the number of bars recorded in the HashMap. We update the value when we insert or delete it into the hashmap.
Threshold: It is equal to the length of the internal array *loadfactor, which is also updated each time the internal array length is adjusted.

Before adding a new entry object, put (...) Method checks whether the current map size is greater than the threshold value. If it is greater than the threshold, it creates a new array with an array length of twice times the current internal array. Because the size of the new array has changed, the index function (the result of the bitwise operation that returns the hash value of the key & (array length-1) also changes. Resizing the array creates two new buckets (linked lists), and all existing entry objects are reassigned to the bucket. The goal of resizing the array is to reduce the size of the list, thereby reducing the execution time of the put (), remove (), and get () methods. For all entry objects that correspond to keys that have the same hash value, they are assigned to the same bucket after resizing. However, if the hash values for the keys of the two entry objects are not the same, but they were on the same bucket before, they are not guaranteed to remain on the same bucket after the adjustment.

This picture describes the pre-and post-adjustment internal arrays. Before adjusting the array length, in order to get the entry object E,map need to iterate through a list of 5 elements. After adjusting the array length, the same get () method only needs to traverse a list of 2 elements, so that the Get () method runs twice times faster after adjusting the array length.

Thread Safety

If you're already familiar with HashMap, you know it's not thread-safe, but why? For example, suppose you have a writer thread that simply inserts the existing data into the map, a reader thread that reads the data from the map, so why doesn't it work?

Because, under the automatic resizing mechanism, if a thread tries to add or get an object, the map may use the old index value so that it does not find the new bucket where the entry object resides.

In the worst case, when 2 threads insert data at the same time, and 2 put () calls are set to automatically resize the array at the same time. Now that two threads are modifying the list at the same time, it is possible for a map to exit in the inner loop of a linked list. If you try to get a list of data with an internal loop, the get () method never ends.

Hashtable provides a thread-safe implementation that can prevent this from happening. However, since all of the simultaneous crud operations are very slow. For example, if thread 1 calls get (Key1), and then thread 2 calls get (Key2), thread 2 calls get (Key3), then at the specified time, only 1 threads can get its value, but 3 threads can access the data concurrently.

Starting with Java 5, we have a better HASHMAP implementation that ensures thread safety: Concurrenthashmap. For Concurrentmap, only buckets are synchronized so that if multiple threads do not use the same bucket or adjust the size of the internal array, they can call the get (), remove (), or put () methods at the same time. In a multithreaded application, this approach is a better choice.

Invariance of Keys

Why is it a good implementation to use strings and integers as HashMap keys? The main reason is that they are immutable! If you choose to create a class as a key, but there is no guarantee that the class will be immutable, you may lose data inside HashMap.

Let's look at the following use cases:

You have a key whose internal value is "1".
You insert an object into the HashMap, and its key is "1".
HashMap generates a hash value from the hash code of the key (that is, "1").
Map stores this hash value in the newly created record.
You change the internal value of the key to change it to "2".
The hash value of the key has changed, but HashMap does not know this (because the old hash value is stored).
You try to get the corresponding object by the modified key.
Map calculates the hash value of the new key (that is, "2") to find the linked list (bucket) where the entry object resides.
Scenario 1: Since you have modified the key, map tries to find the entry object in the wrong bucket, not found.
Scenario 2: You are lucky that the modified key generated by the bucket and the old key generated by the bucket is the same. The map then iterates through the list and finds the entry object with the same key. But in order to find the key, the map first compares the hash value of the key by calling the Equals () method. Since the modified key generates a different hash value (the old hash value is stored in the record), map has no way to find the corresponding entry object in the linked list.

Here is a java example, we insert two key-value pairs into a map, and then I modify the first key and try to get the two objects. You will find that only the second object returned from the map, the first object has been "lost" in HashMap:

 Public classMutablekeytest { Public Static voidMain (string[] args) {classMyKey {Integer i;  Public voidSetI (Integer i) { This. i =i; }             PublicMyKey (Integer i) { This. i =i; } @Override Public inthashcode () {returni; } @Override Public Booleanequals (Object obj) {if(objinstanceofMyKey) {                    returnI.equals (((MyKey) obj). } Else                    return false; }} Map<mykey, string> MyMap =NewHashmap<>(); MyKey Key1=NewMyKey (1); MyKey Key2=NewMyKey (2); Mymap.put (Key1,"Test" + 1); Mymap.put (Key2,"Test" + 2); //modifying Key1Key1.seti (3); String test1=Mymap.get (Key1); String test2=Mymap.get (Key2); System.out.println ("test1=" + test1 + "test2=" +test2); }}

The output of the above code is "Test1=null test2=test 2". As we expected, map does not have the ability to obtain the corresponding string 1 for the modified key 1.

Improvements in Java 8

In Java 8, there are many modifications to the internal implementation in HashMap. Indeed, Java 7 is implemented using 1000 lines of code, while Java 8 uses 2000 lines of code. Much of what I described earlier is still right in Java 8, in addition to using a linked list to save entry objects. In Java 8, we still use an array, but it is saved in node, and node contains the same information as the previous entry object, and the linked list is also used:

The following is part of the code in Java 8 implemented by node:

   Static class Implements Map.entry<k,v> {        finalint  hash;         Final K key;        V value;        Node<K,V> Next;

So what's the big difference compared to Java 7? Well, node can be expanded into TreeNode. TreeNode is a data structure of a red-black tree that can store more information so that we can add, delete, or retrieve an element in the complexity of O (log (n)). The following example describes all the information that TreeNode saves:

Static Final classTreenode<k,v>extendsLinkedhashmap.entry<k,v> {    Final intHash//inherited from Node<k,v>    FinalK key;//inherited from Node<k,v>V value;//inherited from Node<k,v>Node<k,v> Next;//inherited from Node<k,v>Entry<k,v> before, after;//inherited from Linkedhashmap.entry<k,v>Treenode<k,v>parent; TreeNode<K,V>Left ; TreeNode<K,V>Right ; TreeNode<K,V>prev; BooleanRed

The red and black tree is a self-balancing two-fork search tree. Its internal mechanism ensures that its length is always log (n), whether we add or remove nodes. The main benefit of using this type of tree is the fact that many of the data in the internal table has the same index (bucket), when the complexity of searching the tree is O (log (n)), and for the linked list, the same operation and the complexity of O (n).

As you can see, we do store more data in the tree than the linked list. Based on the inheritance principle, the internal table can contain node (linked list) or TreeNode (red-black tree). Oracle decides to use these two data structures according to the following rules:

-For the specified index (bucket) in the internal table, if the number of node is more than 8, then the linked list will be converted into a red black tree.

-For the specified index (bucket) in the internal table, if the number of node is less than 6, then the red-black tree will be converted into a linked list.

This image depicts an internal array in Java 8 hashmap that contains both a tree (bucket 0) and a list of links (buckets 1, 2, and 3). Bucket 0 is a tree structure because it contains more than 8 nodes.

Memory Overhead Java 7

Using HashMap consumes some memory. In Java 7, HashMap encapsulates key-value pairs into entry objects, and a Entry object contains the following information:

Reference to Next record
A pre-computed hash value (integer)
A reference to a key
A reference to a value

In addition, HashMap in Java 7 uses an internal array of entry objects. Assuming that a Java 7 HashMap contains n elements, the capacity of its internal array is capacity, then the additional memory consumption is approximately:

sizeof (integer) * N + sizeOf (Reference) * (3*N+C)

which

The size of an integer is 4 bytes
The size of the reference depends on the JVM, the operating system, and the processor, but is typically 4 bytes.

This means that the total memory cost is usually 4 * N + * Capacity bytes.

Note: After the map is automatically resized, the value of capacity is the power of the next smallest 2 greater than N.

Note: Starting with Java 7, HashMap uses a lazy-load mechanism. This means that even if you specify a size for HashMap, the internal array used (consuming 4*capacity bytes) will not allocate space in memory until the first time we use the put () method.

JAVA 8

In the Java 8 Implementation, computing memory usage becomes more complicated because node may store the same data as entry or, on that basis, add 6 more references and a Boolean property (specifying whether it is TreeNode).

If all nodes are just node, then Java 8 HashMap consumes the same amount of memory as Java 7 HashMap consumes.

If all the nodes are TreeNode, then the memory consumed by Java 8 HashMap becomes:

n * sizeof (integer) + N * sizeof (Boolean) + sizeof (Reference) * (9*n+capacity)

In most standard JVMs, the result of the above formula is the number of * N + 4 * Capacity bytes.

Performance issues Asymmetric HashMap vs equalization HashMap

In the best case, the get () and put () methods have only the complexity of O (1). However, if you do not care about the hash function of the key, then your put () and get () methods may perform very slowly. The efficient execution of the put () and get () methods depends on the data being assigned to different indexes on the internal array (bucket). If the hash function of the key is not designed properly, you will get an asymmetric partition (regardless of how large the internal data is). All of the put () and get () methods use the largest linked list, which can be performed slowly because it requires all the records in the linked list to be iterated. In the worst case scenario (if most of the data is on the same bucket), your time complexity will change to O (n).

The following is an example of a visualization. The first diagram depicts an asymmetric hashmap, and the second picture depicts a balanced hashmap.

In this asymmetric hashmap, it takes time to run the Get () and put () methods on bucket 0. Getting Records K takes 6 iterations.

In this equilibrium hashmap, it takes only 3 iterations to get the record K. These two HashMap store the same amount of data, and the internal array is the same size. The only difference is the hash function of the key, which is used to distribute the records to different buckets.

Here is an extreme example written in Java, in which I use a hash function to put all the data into the same list (bucket), and then I add 2,000,000 data.

 Public classTest { Public Static voidMain (string[] args) {classMyKey {Integer i;  PublicMyKey (Integer i) { This. i =i; } @Override Public inthashcode () {return1; } @Override Public Booleanequals (Object obj) {...} } Date begin=NewDate (); Map<MyKey,String> mymap=NewHashmap<> (2_500_000,1);  for(inti=0;i<2_000_000;i++) {Mymap.put (NewMyKey (i), "test" +i); } Date End=NewDate (); System.out.println ("Duration (ms)" + (End.gettime ()-begin.gettime ())); }}

My machine configuration is core i5-2500k @ 3.6G, which takes more than 45 minutes to run in Java 8u40 (I stopped the process after 45 minutes). If I run the same code, but I use the following hash function:

    @Override    publicint  hashcode () {        int key = 2097152-1;         return key+2097152*i;}

It takes 46 seconds to run it, and it's a lot better than before! The new hash function is more reasonable than the old hash function when processing a hash partition, so it is quicker to call the put () method. If you are running the same code now, but use the following hash function, it provides a better hash partition:

 Public int  return  i;}

It only takes 2 seconds!

I want you to realize how important a hash function is. If you run the same test on Java 7, the first and second scenarios will be worse (because the put () method complexity in Java 7 is O (n) and the complexity in Java 8 is O (log (n)).

When using HashMap, you need to find a hash function for the key, which spreads the key to the most likely bucket. To do this, you need to avoid hash collisions. The string object is a very good key because it has a good hash function. The integer is also good, because its hash value is its own value.

Cost of resizing

If you need to store large amounts of data, you should specify an initial capacity when creating HashMap, which should be close to the size you expect.

If you do not do this, the map will use the default size, that is, the value of 16,factorload is 0.75. The first 11 calls to the put () method can be very fast, but the 12th time (16*0.75) call creates a new internal array of length 32 (and the corresponding linked list/tree), and the 13th to 22nd call to the put () method will be quick, but 23rd times (32* 0.75) when called, a new internal array is recreated (again), and the length of the array is doubled. The internal resizing operation is then triggered on the 48th, 96, 192 ... when the put () method is called. If the amount of data is small, rebuilding an internal array is fast, but when the amount of data is large, the time may take from seconds to minutes. By specifying the size that map expects when initializing, you can avoid the consumption of resizing operations.

But here's one drawback: if you set the array to a very large size, such as 2^28, but you just use the 2^26 buckets in the array, you'll waste a lot of memory (in this case, about 2^30 bytes).

Conclusion

For simple use cases, you don't need to know how HashMap works, because you don't see the difference between O (1), O (n), and O (log (n)). But it's always good to understand the mechanics behind this often-used data structure. In addition, this is a typical interview problem for Java developer positions.

For large data volumes, it is important to understand how HASHMAP works and understand the importance of the key's hash function.

I hope this article will help you to have an in-depth understanding of HashMap's implementation.

A deep discussion on the working principle of Java HashMap

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More