Most Java developers are using maps, especially HashMap. HashMap is a simple but powerful way to store and retrieve data. But how many developers know how to work inside HashMap? A few days ago, I read a lot of Java.util.HashMap's source code (including Java 7 and Java 8) to get a deeper understanding of the underlying data structure. In this article, I'll explain the implementation of JAVA.UTIL.HASHMAP, describe the new features added in the Java 8 implementation, and discuss performance, memory, and some known issues when using HashMap.
Internal storage
The Java HashMap class implements the Map<k, v> interface. The main methods in this interface include:
V-Put (K key, V-value)
v Get (object key)
v Remove (object key)
Boolean ContainsKey (object key)
HashMap uses an internal class entry<k, v> to store the data. This inner class is a simple key-value pair with an additional two data:
A reference to another entry (the translator: a reference object) so that HashMap can store objects like a list of links.
A hash value that is used to represent the key, and storing this value avoids hashmap the hash value of the key each time it is needed.
Here is Entry<k, v> part of the code under Java 7:
Static Class Entry<k,v> implements Map.entry<k,v> {
final K key;
V value;
Entry<k,v> Next;
int hash;
..
}
HashMap stores data in multiple one-way entry lists (sometimes referred to as bucket bucket or container orbins). All lists are registered in a Entry array (entry<k, v>[), and the default length of this internal array is 16.
The following diagram depicts the internal storage of a HashMap instance that contains an array of nullable objects. Each object is connected to another object, which makes up a linked list.
All keys with the same hash value will be placed in the same list (bucket). A key with a different hash value may eventually be in the same bucket.
When the user invokes put (K key, V value) or get (object key), the program calculates the index of the bucket to which the object should be. The program then iterates through the corresponding list to find the entry object with the same key (the Equals () method with the key).
For a call to get (), the program returns the entry object (if the entry object exists) that corresponds to the value.
In the case of invoking put (K key, V value), if the entry object already exists, the program replaces the value with the new value, otherwise the program creates a new entry (from the key and value in the parameter) to the table header of the one-way list.
The index of the bucket (list) is generated by the 3 steps of the map:
Gets the hash code of the key first.
The program repeats the hash code to block the bad hash function for the key, because it is possible to put all the data on the same index (bucket) of the internal array.
The program gets the repeated hash code and uses the bit mask (bit-mask) of the array length (min. 1). This operation guarantees that the index will not be larger than the size of the array. You can consider it as a calculated optimization modulus function.
The following is the source code that generated the index:
The "Rehash" function in JAVA 7, that takes the hashcode of the key
static int hash (int h) {
h ^= (H >>> ; ^ (h >>>);
Return h ^ (H >>> 7) ^ (H >>> 4);
}
The "Rehash" function in JAVA 8, that directly takes the key
static final int hash (Object key) {
int h;
return (key = = null)? 0: (H = key.hashcode ()) ^ (h >>>);
}
The function that returns the "index from" rehashed hash
static int indexfor (int h, int length) {return
H & (Length-1);
}
In order to work more efficiently, the size of the internal array must be a power value of 2. Let's take a look at why:
Assuming that the length of the array is 17, the mask value is 16 (array length-1). The binary representation of 16 is 0 ... 010000, so for any value H, the result of "H & 16" is 16 or 0. This means that an array of length 17 can only be applied to two barrels: one is 0 and the other is 16, which is not very efficient. But if you set the length of the array to a power of 2, such as 16, then the work of bitwise indexing becomes "H & 15". The binary representation of 15 is 0 ... 001111, the value of the index formula output can be from 0 to 15, so that an array of length 16 can be fully used. For example:
If H = 952, its binary representation is 0. 01110111000, the corresponding index is 0 ... 01000 = 8
If H = 1576, its binary representation is 0. 011000101000, the corresponding index is 0 ... 01000 = 8
If H = 12356146, its binary representation is 0. 0101111001000101000110010, the corresponding index is 0 ... 00010 = 2
If H = 59843, its binary representation is 0. 01110100111000011, its corresponding index is 0 ... 00011 = 3
This mechanism is transparent to developers: if he chooses a hashmap,map of length 37, it automatically chooses the next power value of 2 (64) greater than 37 as the length of the internal array.
Automatic sizing
After the index is fetched, the get (), put (), or remove () method accesses the corresponding list to see if the entry object for the specified key already exists. Without modification, this mechanism can cause performance problems because this method needs to iterate through the entire list to see if the entry object exists. Assume that the length of the internal array takes the default value of 16, and you need to store 2,000,000 records. In the best case, each linked list will have 125,000 entry objects (2,000,000/16). The Get (), remove (), and put () methods need to be iterated 125,000 times each time they are executed. To avoid this, hashmap can increase the length of the internal array to ensure that only a small number of entry objects are retained in the linked list.
When you create a hashmap, you can specify an initial length by using the following constructor, and a loadfactor:
</pre> public
HashMap (int initialcapacity, float loadfactor)
<pre>
If you do not specify a parameter, the default initialcapacity value is 0.75, and the default value is Loadfactor. Initialcapacity represents the length of the linked list of internal arrays.
When you use put (...) every time method to add a new key-value pair to the map, the method checks to see if the length of the internal array needs to be increased. To achieve this, the map stores 2 of data:
The size of the map: it represents the number of bars recorded in the HashMap. We update the value when we insert or delete it into the hashmap.
Threshold: It equals the length of the internal array, which is also updated each time the length of the internal array is adjusted *loadfactor.
Before adding a new entry object, put (...) Method checks whether the current map size is greater than the threshold. If it is greater than the threshold, it creates a new array with an array length of twice times the current internal array. Because the size of the new array has changed, the index function (that is, the bitwise result of the hash value & (array length-1) of the key) also changes. Resizing the array creates two new buckets (linked lists) and assigns all existing entry objects to the bucket. The goal of resizing an array is to reduce the size of the linked list, thereby reducing the execution time of the put (), remove (), and get () methods. For all entry objects corresponding to keys with the same hash value, they are assigned to the same bucket after resizing. However, if the keys of two entry objects are not the same, but they are on the same bucket, they are not guaranteed to remain on the same bucket after the adjustment.
This picture describes the adjustment before and after the adjustment of the internal array. Before adjusting the array length, in order to get the entry object E,map need to iterate through a list of 5 elements. After adjusting the length of the array, the same get () method only needs to traverse a list of 2 elements, so that the get () method can run twice times faster after adjusting the length of the array.
Thread Safety
If you are already familiar with HashMap, then you certainly know it is not thread safe, but why? For example, if you have a writer thread that only inserts existing data into the map, a reader thread that reads data from a map, why doesn't it work?
Because, under an automatic sizing mechanism, if a thread tries to add or retrieve an object, the map may use the old index value, so that the entry object's new bucket is not found.
In the worst-case scenario, when 2 threads insert data at the same time, and 2 put () calls are automatically resized at the same time. Now that two threads are modifying the list at the same time, it is possible for the map to exit in the inner loop of a linked list. If you try to get the data from a list with an inner loop, the Getting () method never ends.
Hashtable provides a thread-safe implementation that can prevent this from happening. However, since all of the synchronized crud operations are very slow. For example, if thread 1 calls get (Key1), then thread 2 invokes get (Key2), and thread 2 calls gets (Key3), only 1 threads can obtain its value at a specified time, but 3 threads can access it at the same time.
Starting with Java 5, we have a better HASHMAP implementation that guarantees thread safety: Concurrenthashmap. For Concurrentmap, only barrels are synchronized so that if multiple threads do not use the same bucket or resize an internal array, they can call either get (), remove () or put () methods. In a multithreaded application, this approach is a better choice.
Invariance of Keys
Why is it a good implementation to use strings and integers as HashMap keys? Mainly because they are immutable! If you choose to create a class as a key, but you do not guarantee that the class is immutable, you may lose data within the HashMap.
Let's look at the following use cases:
You have a key, and its internal value is "1".
You insert an object into the HashMap, and its key is "1".
HashMap generates a hash value from the hash code of the key (that is, "1").
The map stores this hash value in the newly created record.
You change the intrinsic value of the key and turn it into "2".
The hash value of the key has changed, but HashMap does not know this (because the old hash value is stored).
You try to get the corresponding object with the modified key.
The map calculates the hash value of the new key (that is, "2") to find the list (bucket) where the entry object is located.
Scenario 1: Now that you've modified the key, map tries to find the entry object in the wrong bucket, not found.
Situation 2: You are lucky, the modified key generated bucket and the old key generated bucket is the same one. The map then iterates through the linked list and finds the entry object with the same key. But in order to find the key, the map first compares the hash value of the key by calling the Equals () method. Because the modified key produces a different hash value (the old hash value is stored in the record), then map has no way to find the corresponding entry object in the list.
Here is a java example where we insert two key-value pairs into the map, and then I modify the first key and try to get the two objects. You will find only a second object returned from the map, and the first object has been "lost" in HashMap:
public class Mutablekeytest {public
static void Main (string[] args) {
class MyKey {
Integer i;
public void SetI (Integer i) {
this.i = i;
}
Public MyKey (Integer i) {
this.i = i;
}
@Override public
int hashcode () {return
i;
}
@Override public
boolean equals (Object obj) {
if (obj instanceof MyKey) {return
i.equals ((MyKey) obj). I );
} else return
false;
}
Map<mykey, string> mymap = new hashmap<> ();
MyKey key1 = new MyKey (1);
MyKey Key2 = new MyKey (2);
Mymap.put (Key1, "test" + 1);
Mymap.put (Key2, "test" + 2);
modifying Key1
Key1.seti (3);
String test1 = Mymap.get (key1);
String test2 = Mymap.get (key2);
System.out.println ("test1=" + test1 + "test2=" + test2);
}
The output of the above code is "Test1=null test2=test 2". As we would expect, the map does not have the ability to get the corresponding string 1 of the modified key 1.
Improvements in Java 8
In Java 8, the internal implementation in HashMap has been modified a lot. Indeed, Java 7 uses 1000 lines of code to implement, while Java 8 uses 2000 lines of code. Most of what I described earlier is still right in Java 8, in addition to using linked lists to save entry objects. In Java 8, we still use arrays, but they are saved in node, node contains the same information as the previous entry object, and the linked list is also used:
The following is part of the node implementation code in Java 8:
Static Class Node<k,v> implements map.entry<k,v> {
final int hash;
Final K key;
V value;
Node<k,v> Next;
So what's the big difference compared to Java 7? Well, node can be expanded into TreeNode. TreeNode is a data structure of a red-black tree that can store more information so that we can add, delete, or get an element in the complexity of O (log (n)). The following example describes all the information saved by TreeNode:
Static final class Treenode<k,v> extends linkedhashmap.entry<k,v> {
final int hash;//inherited from Node <K,V>
final K key;//inherited from node<k,v>
V value;//inherited from Node<k,v>
node< ; K,v> Next; Inherited from Node<k,v>
entry<k,v> before, after;//inherited from linkedhashmap.entry<k,v>< c6/>treenode<k,v> parent;
Treenode<k,v> left;
Treenode<k,v> right;
Treenode<k,v> prev;
Boolean red;
The red-black tree is a self balanced two-fork search tree. Its internal mechanism ensures that its length is always log (n), whether we add or remove nodes. Using this type of tree, the main benefit is that for many of the data in the internal table that have the same index (bucket), the complexity of the search for the tree is O (log (n)), and for the linked list, the complexity is O (n), which performs the same operation.
As you can see, we do store more data in the tree than the linked list. According to the inheritance principle, the internal table can contain node (linked list) or TreeNode (red-black tree). Oracle decided to use the two data structures according to the following rules:
-For the specified index (bucket) in the internal table, if the number of node is more than 8, the linked list is converted to a red-black tree.
-For the specified index (bucket) in the internal table, if the number of node is less than 6, the red-black tree is converted to a linked list.
This picture depicts an internal array in Java 8 HashMap, which contains both a tree (bucket 0) and a list (buckets 1, 2, and 3). Bucket 0 is a tree structure because it contains more than 8 nodes.
Memory Overhead
JAVA 7
Using HashMap will consume some memory. In Java 7, HashMap encapsulates key value pairs as entry objects, and a Entry object contains the following information:
A reference to the next record
A computed hash value (integer)
A reference that points to a key
A reference that points to a value
In addition, the HashMap in Java 7 uses an internal array of entry objects. Assuming that a Java 7 HashMap contains n elements, and its internal array capacity is capacity, the additional memory consumption is approximately:
SizeOf (integer) * N + sizeOf (Reference) * (3*N+C)
which
The size of an integer is 4 bytes
The size of the reference depends on the JVM, the operating system, and the processor, but is typically 4 bytes.
This means that the total memory overhead is usually a * N + 4 * capacity byte.
Note: After the map is automatically resized, the value of the capacity is the next power value of the smallest 2 greater than N.
Note: Starting with Java 7, HashMap has a mechanism for delaying loading. This means that even if you specify a size for HashMap, the internal array used by the record (4*capacity bytes) will not allocate space in memory until the first time we use the put () method.
JAVA 8
In the Java 8 Implementation, computing memory usage becomes more complex because node may store the same data as entry, or add 6 more references and a Boolean attribute (specify whether it is TreeNode).
If all nodes are node only, then the Java 8 HashMap consumes the same memory as the Java 7 HashMap consumes.
If all nodes are TreeNode, then the memory consumed by the Java 8 HashMap becomes:
n * sizeOf (integer) + N * SizeOf (Boolean) + sizeOf (Reference) * (9*n+capacity)
In most standard JVMs, the result of this formula is N + 4 * CAPACITY bytes.
Performance issues
Asymmetric HashMap vs Equilibrium hashmap
In the best case, the get () and put () methods have only the complexity of O (1). However, if you do not care about the hash function of the key, then your put () and get () methods may perform very slowly. The efficient execution of the put () and get () methods depends on the data being assigned to different indexes on the internal array (bucket). If the hash function of the key is not designed properly, you will get an asymmetric partition (regardless of how large the internal data is). All put () and get () methods use the largest list, which is slow because it needs to iterate through all the records in the linked list. In the worst case scenario (if most of the data is on the same bucket), your time complexity will change to O (n).
The following is an example of visualization. The first picture depicts an asymmetric hashmap, and the second graph describes a balanced hashmap.
Skewedhashmap
In this asymmetric hashmap, it can take time to run get () and put () methods on bucket 0. Getting the record K takes 6 iterations.
In this equilibrium hashmap, it takes only 3 iterations to get the record K. These two HashMap store the same amount of data and have the same size as the internal array. The only difference is the hash function of the key, which is used to distribute records to different buckets.
Here's an extreme example of Java writing, in which I use the hash function to put all the data in the same list (bucket), and then I add 2,000,000 data.
public class Test {public
static void Main (string[] args) {
class MyKey {
Integer i;
Public MyKey (Integer i) {
this.i =i;
}
@Override public
int hashcode () {return
1;
}
@Override public
boolean equals (Object obj) {
...
}
}
Date begin = new Date ();
Map <MyKey,String> mymap= New hashmap<> (2_500_000,1);
for (int i=0;i<2_000_000;i++) {
mymap.put (new MyKey (i), "test" +i);
}
Date end = new Date ();
System.out.println ("Duration (ms)" + (End.gettime ()-begin.gettime ());
}
My machine configuration is core i5-2500k @ 3.6G, which takes more than 45 minutes to run under the Java 8u40 (I stopped the process after 45 minutes). If I run the same code, but I use the following hash function:
@Override public
int hashcode () {
int key = 2097152-1;
return key+2097152*i;
}
It takes 46 seconds to run it, and it's a lot better than before, this way! The new hash function is more reasonable than the old hash function when processing the hash partition, so it is quicker to invoke the put () method. If you run the same code now, but use the hash function below, it provides a better hash partition:
@Override public
int hashcode () {return
i;
}
Now it only takes 2 seconds!
I want you to realize how important hash functions are. If you run the same test on Java 7, the first and second will be worse (because the put () method complexity in Java 7 is O (n) and the complexity in Java 8 is O (log (n)).
When using HashMap, you need to find a hash function for the key to spread the key to the most likely bucket. To do this, you need to avoid hash conflicts. A string object is a very good key because it has a good hash function. The integer is also good because its hash value is its own value.
Cost of resizing
If you need to store large amounts of data, you should specify an initial capacity when creating HashMap, which should be close to the size you expect.
If you do not, map will use the default size, that is, the 16,factorload value is 0.75. The first 11 calls to the put () method will be very fast, but the 12th time (16*0.75) call creates a new internal array (and corresponding List/tree) of length 32, and the call to the put () method is quick, but 23rd (32* 0.75 The call is recreated (again) with a new internal array, and the length of the array doubles. Then the internal resizing operation will be triggered on the 48th, 96, 192 ... when the put () method is invoked. If the amount of data is small, rebuilding an internal array is quick, but when the volume of data is large, the time spent may be from the second level to the minute level. You can avoid the cost of resizing operations by specifying the size that the map expects when initializing.
But there is also a drawback: if you set the array to be very large, such as 2^28, but you just use the 2^26 bucket in the array, you will waste a lot of memory (approximately 2^30 bytes in this example).
Conclusion
For simple use cases, you don't need to know how HashMap works, because you don't see the difference between O (1), O (n), and O (log (n)). But it's always good to be able to understand the mechanism behind this often-used data structure. In addition, this is a typical interview question for a Java developer position.
For large data volumes, it is important to understand how HASHMAP works and the importance of understanding the hash function of a key.
I hope this article will help you to have a deep understanding of HashMap's implementation.