Implementation principle of HashMap and HashMap

Last Update:2015-07-31 Source: Internet

Author: User

Tags concurrentmodificationexception

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Implementation principle of HashMap and HashMap

1. HashMap Overview:
HashMap is a non-synchronous implementation based on the Map interface of the hash table. This implementation provides all optional ing operations and allows the use of null values and null keys. This class does not guarantee the order of mappings, especially it does not guarantee that the order remains unchanged.

2. Data Structure of HashMap:
In java programming language, the most basic structure is two types, one is array, the other is analog pointer (reference ), all data structures can be constructed using these two basic structures, and HashMap is no exception. HashMap is actually a "linked list hash" data structure, that is, a combination of arrays and linked lists.

/*** The table, resized as necessary. length MUST Always be a power of two. */transient Entry [] table; static class Entry <K, V> implements Map. entry <K, V> {final K key; V value; Entry <K, V> next; final int hash ;...... }

It can be seen that the Entry is an element in the array, and each Map. Entry is actually a key-value pair. It holds a reference pointing to the next element, which constitutes a linked list.

3. Access Implementation of HashMap:

1) Storage:

Public V put (K key, V value) {// HashMap allows null keys and null values to be stored. // When the key is null, call the putForNullKey method to place the value in the first position of the array. If (key = null) return putForNullKey (value); // recalculate the hash value based on the key's keyCode. Int hash = hash (key. hashCode (); // search for the index of the specified hash value in the corresponding table. Int I = indexFor (hash, table. length); // If the Entry at the I index is not null, The next element of e is continuously traversed through loops. For (Entry <K, V> e = table [I]; e! = Null; e = e. next) {Object k; if (e. hash = hash & (k = e. key) = key | key. equals (k) {V oldValue = e. value; e. value = value; e. recordAccess (this); return oldValue ;}// if the Entry at the I index is null, no Entry exists. ModCount ++; // Add the key and value to the I index. AddEntry (hash, key, value, I); return null ;}

From the source code above, we can see that when we put the elements in HashMap, we first re-calculate the hash value based on the hashCode of the key, according to the hash value, the position of this element in the array (I .e. the subscript). If the array already contains other elements, the elements at this position will be stored in the form of a linked list. Newly Added elements will be placed in the chain head, and the first elements will be placed at the end of the chain. If no element exists in the position of the array, the element is directly placed in this position of the array.
The addEntry (hash, key, value, I) method places the key-value pair in the I index of the array table based on the calculated hash value. AddEntry is a package access permission method provided by HashMap. The Code is as follows:

Void addEntry (int hash, K key, V value, int bucketIndex) {// obtain the Entry at the specified bucketIndex <K, V> e = table [bucketIndex]; // place the newly created Entry to the bucketIndex index and point the new Entry to the original Entry table [bucketIndex] = new Entry <K, V> (hash, key, value, e); // if the number of key-value pairs in the Map exceeds the limit if (size ++> = threshold) // The length of the table object is doubled. Resize (2 * table. length );}

When the system decides to store the key-value Pair in HashMap, the value in the Entry is not considered at all. It only calculates and determines the storage location of each Entry based on the key. We can regard the value in the Map set as a subsidiary of the key. When the system determines the storage location of the key, the value will be saved there.

The hash (int h) method recalculates a hash Based on the hashCode of the key. This algorithm is added to high-level computing to prevent hash conflicts when the low-level remains unchanged and the high-level changes.

static int hash(int h) {      h ^= (h >>> 20) ^ (h >>> 12);      return h ^ (h >>> 7) ^ (h >>> 4);  }

We can see that to find an element in HashMap, we need to obtain the position in the corresponding Array Based on the hash value of the key. The hash algorithm is used to calculate the location. As mentioned above, the data structure of HashMap is the combination of arrays and linked lists. Therefore, we certainly hope that the element positions in the HashMap should be evenly distributed as much as possible, so that the number of elements at each position is only one, when we use the hash algorithm to obtain this position, we can immediately know that the elements in the corresponding position are what we want, instead of having to go through the traversal chain table, which greatly improves the query efficiency.
For any given object, as long as its hashCode () returns the same value, the hash code value calculated by the Program Calling the hash (int h) method is always the same. The first thing we think of is the modulo operation of the hash value on the array length. In this way, the element distribution is relatively even. However, the consumption of the "modulo" operation is still relatively large. In HashMap, this is done by calling indexFor (int h, int length) to calculate the index of the table array. The code for indexFor (int h, int length) is as follows:

static int indexFor(int h, int length) {      return h & (length-1);  }

This method is very clever. It uses h & (table. length-1) to obtain the storage space of the object, while the length of the underlying array of HashMap is always the n power of 2, which is the Speed Optimization of HashMap. The HashMap constructor has the following code:

int capacity = 1;      while (capacity < initialCapacity)          capacity <<= 1;

This Code ensures that the capacity of HashMap during initialization is always the Npower of 2, that is, the length of the underlying array is always the Npower of 2.

When length is always the N power of 2, the h & (length-1) operation is equivalent to the modulo of length, that is, h % length, but & is more efficient than %.

This seems very simple. Actually, it is quite mysterious. Here is an example to illustrate:

Assume that the array length is 15 and 16, and the optimized hash code is 4 and 5, respectively. The result of the & operation is as follows:

H & (table. length-1) hash table. length-1

4 & (15-1): 0100 & 1110 = 0100

5 & (15-1): 0101 & 1110 = 0100

Bytes -----------------------------------------------------------------------------------------------------------------------

4 & (16-1): 0100 & 1111 = 0100

5 & (16-1): 0101 & 1111 = 0101

As shown in the preceding example, when they are "and" 15-1 (1110) ", the same results are generated, that is, they are located at the same position in the array, this produces a collision. 4 and 5 will be placed in the same position in the array to form a linked list, so you need to traverse the chain table during query to get 4 or 5, this reduces the query efficiency. At the same time, we can also find that when the array length is 15, the hash value will be "and" with 15-1 (1110), then the last bit will always be 0, in the case of 0001,0011, 0101,1001, 1011,0111, and 1101, elements can never be stored. The waste of space is quite large. Worse, in this case, the positions available for the array are much smaller than the array length, which means the collision probability is further increased and the query efficiency is slowed down! When the length of the array is 16, that is, the Npower of 2, the value of each bit of the binary number obtained by 2n-1 is 1, which makes the & at the low position, the obtained result is the same as the original hash's low position. In addition, the hash (int h) method is used to further optimize the key's hashCode, so that only two values with the same hash value will be placed in the same position in the array to form a linked list.

Therefore, when the array length is 2 to the n power, the probability that different keys calculate the same index is smaller, the data distribution on the array is relatively even, that is to say, the probability of collision is small. Relatively, you do not need to traverse the linked list at a certain position during the query, so the query efficiency is high.

According to the source code of the put method, when the program tries to put a key-value pair into HashMap, the program first determines the storage location of the Entry based on the return value of the key hashCode: if the hashCode () values of the keys of the two entries are the same, they are stored in the same location. If the keys of the two entries return true through equals comparison, the value of the newly added Entry will overwrite the value of the original Entry in the set, but the key will not overwrite. If the keys of these two entries are compared by equals, false is returned. The newly added Entry forms an Entry chain with the original Entry in the set, the newly added Entry is in the header of the Entry chain. For more information, see the description of the addEntry () method.

2) Read:

public V get(Object key) {      if (key == null)          return getForNullKey();      int hash = hash(key.hashCode());      for (Entry<K,V> e = table[indexFor(hash, table.length)];          e != null;          e = e.next) {          Object k;          if (e.hash == hash && ((k = e.key) == key || key.equals(k)))              return e.value;      }      return null;  }

With the hash algorithm stored above as the basis, it is easy to understand this code. From the source code above, we can see that when we get elements from HashMap, we first calculate the hashCode of the key and find an element at the corresponding position in the array, then, use the equals method of the key to find the required elements in the linked list at the corresponding position.

3) To sum up, HashMap treats key-value as a whole at the underlying layer, which is an Entry object. At the underlying layer of HashMap, an Entry [] array is used to store all key-value pairs. When an Entry object needs to be stored, its storage location in the array is determined based on the hash algorithm, the storage location of the linked list on the array location is determined based on the equals method. When an Entry needs to be retrieved, the storage location in the array is also found based on the hash algorithm, then, the Entry is retrieved from the linked list at the position according to the equals method.

4. resize (rehash) of HashMap ):

When there are more and more elements in HashMap, the probability of hash conflicts increases, because the length of the array is fixed. Therefore, to improve the query efficiency, we need to resize the HashMap array. The array expansion operation will also appear in the ArrayList. This is a common operation. After the HashMap array is expanded, the most performance-consuming point appears: the data in the original array must be recalculated and placed in the new array. This is resize.

When Will HashMap be resized? When the number of elements in HashMap exceeds the array size * loadFactor, array expansion is performed. The default value of loadFactor is 0.75, which is a compromise value. That is to say, by default, the array size is 16. When the number of elements in HashMap exceeds 16*0.75 = 12, the array size is expanded to 2*16 = 32, that is, double the number of elements, and re-calculate the position of each element in the array. This is a very performance-consuming operation. If we have predicted the number of elements in the HashMap, the number of Preset elements can effectively improve the performance of HashMap.

5. Performance Parameters of HashMap:

HashMap contains the following constructor:

HashMap (): Construct a HashMap with an initial capacity of 16 and a load factor of 0.75.

HashMap (int initialCapacity): constructs a HashMap with an initial capacity of initialCapacity and a load factor of 0.75.

HashMap (int initialCapacity, float loadFactor): Creates a HashMap with the specified initial capacity and load factor.

HashMap's basic constructor HashMap (int initialCapacity, float loadFactor) has two parameters: initial capacity initialCapacity and load factor loadFactor.

InitialCapacity: the maximum capacity of HashMap, that is, the length of the underlying array.

LoadFactor: the load factor loadFactor is defined as: the actual number of elements in the hash list (n)/the capacity of the hash list (m ).

The load factor is used to measure the space usage of a hash table. The larger the load factor is, the higher the loading level of the hash table. The smaller the load factor is. For the hash list using the linked list method, the average time for searching an element is O (1 + a). Therefore, if the load factor is greater, the space utilization is more adequate, however, the result is reduced search efficiency. If the load factor is too small, the data in the scattered list will be too sparse, causing serious waste of space.

In the implementation of HashMap, the maximum capacity of HashMap is determined through the threshold field:

threshold = (int)(capacity * loadFactor);

According to the definition formula of the load factor, threshold is the maximum number of elements allowed under the corresponding loadFactor and capacity. If this number is exceeded, re-resize the element to reduce the actual load factor. The default load factor 0.75 is a balance between space and time efficiency. When the capacity exceeds the maximum capacity, the HashMap capacity after resize is twice the capacity:

if (size++ >= threshold)         resize(2 * table.length);

6. Fail-Fast mechanism:

We know that java. util. HashMap is NOT thread-safe. Therefore, if other threads modify map during the use of the iterator, ConcurrentModificationException will be thrown, which is the so-called fail-fast policy.

The implementation of this policy in the source code is implemented through the modCount field. As the name suggests, modCount is the number of modifications, and the modification to the HashMap content will increase this value, this value will be assigned to the expectedModCount of the iterator during the iterator initialization process.

HashIterator() {      expectedModCount = modCount;      if (size > 0) { // advance to first entry      Entry[] t = table;      while (index < t.length && (next = t[index++]) == null)          ;      }  }

During iteration, determine whether modCount and expectedModCount are equal. If they are not equal, other threads have modified Map:

Note that modCount is declared as volatile to ensure the visibility of modifications between threads.

final Entry<K,V> nextEntry() {         if (modCount != expectedModCount)             throw new ConcurrentModificationException();

In the HashMap API, it is pointed out that:

The iterator returned by the "collection view method" of all HashMap classes fails quickly: After the iterator is created, if the ing is modified from the structure, the iterator throws ConcurrentModificationException unless it is modified in any way at any time through the remove Method of the iterator itself. Therefore, in the face of concurrent modifications, the iterator will soon fail completely, without the risk of any uncertain behavior at an uncertain time in the future.

Note: The Fast failure behavior of the iterator cannot be guaranteed. In general, it is impossible to make any firm guarantee when there are non-synchronous concurrent modifications. The quick failure iterator tries its best to throw ConcurrentModificationException. Therefore, writing a program dependent on this exception is incorrect. The correct practice is that the fast failure behavior of the iterator should only be used to detect program errors.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More