HashMap implementation principle and source code analysis, hashmap principle source code

Last Update:2016-11-16 Source: Internet

Author: User

Tags rehash

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

HashMap implementation principle and source code analysis, hashmap principle source code

A hash table is also called a hash table. It is a very important data structure. It is used in a wide range of scenarios and many caching technologies (such as memcache) the core is to maintain a large hash table in the memory, and the implementation principle of HashMap often appears in various interview questions. The importance is evident. This article will explain the implementation principle of corresponding implementation of HashMap in the java Collection framework, and then analyze the HashMap source code of JDK 7.

What is a hash table?

Before discussing hash tables, let's take a look at the performance of other basic operations, such as adding and searching data structures.

　　Array: Uses a continuous storage unit to store data. For the search of the specified lower mark, the time complexity is O (1). For the search by the given value, you need to traverse the array and compare the given keywords and array elements one by one. The time complexity is O (n ), of course, for ordered arrays, binary search, interpolation search, and Fibonacci search can be used to increase the search complexity to O (logn). For general insert and delete operations, the average complexity of moving array elements is O (n)

　　Linear Linked List: For addition, deletion, and other operations on the linked list (after finding the specified operation location), you only need to handle the reference between nodes. the time complexity is O (1 ), however, the search operation requires comparison of the records in the traversal chain tables. The complexity is O (n)

　　Binary Tree: Insert, search, and delete a relatively balanced ordered binary tree. The average complexity is O (logn ).

　　Hash table: Compared to the preceding data structures, the performance of adding, deleting, and searching in the hash table is very high. If hash conflicts are not considered, you only need to locate them once, the time complexity is O (1). Next we will look at how the hash table achieves the amazing constant order O (1.

We know that there are only two physical storage structures for data structures:Sequential Storage StructureAndChained Storage Structure(Stacks, queues, trees, graphs, etc. are abstracted from the logical structure and mapped to the memory. These two physical organization forms are also mentioned above, you can search for an element by subscript in an array and locate it at a time. The hash table uses this feature,The trunk of a hash table is an array..

For example, if we want to add or search for an element, we can map the keyword of the current element to a position in the array through a function and locate it once through the array subscript.

　　　　　　　　Storage location = f (keyword)

　Specifically, this function f is generally calledHash FunctionsThe design of this function directly affects the hash table. For example, we want to perform the insert operation in the hash table:

Similarly, you can use the hash function to calculate the actual storage address, and then retrieve the corresponding address from the array.

　　Hash conflict

However, nothing is perfect. What if the actual storage address obtained by the hash function is the same for two different elements? That is to say, when we perform a hash operation on an element, get a storage address, and then insert it, we find that it has been occupied by other elements. This is actually calledHash conflict, Also called hash collision. As we mentioned earlier, the design of hash functions is crucial, and good hash functions will be guaranteed as much as possible.Simple computingAndUniform Distribution of hash addresses,However, we need to be clear that the array is a continuous fixed-length memory space, and even a good hash function cannot guarantee that the obtained storage address will never conflict with each other. So how can hash conflicts be solved? Hash conflicts can be solved in a variety of ways: the open addressing method (conflict occurs, continue to find the next unoccupied storage address), the hash function method, the link address method, hashMap adopts the link address method, that isArray + linked listMethod,

HashMap implementation principle

The trunk of HashMap is an Entry array. An Entry is the basic unit of HashMap. Each Entry contains a key-value pair.

// The main array of HashMap. It can be seen that it is an Entry array. The initial value is an empty array {}. The length of the main array must be a power of 2. Why, detailed analysis will be provided later. Transient Entry <K, V> [] table = (Entry <K, V> []) EMPTY_TABLE;

Entry is a static internal class in HashMap. The Code is as follows:

Static class Entry <K, V> implements Map. entry <K, V> {final K key; V value; Entry <K, V> next; // store the reference pointing to the next Entry. The single-chain table structure is int hash; // The value obtained after hash operation on the hashcode value of the key is stored in the Entry to avoid repeated calculation/*** Creates new entry. */Entry (int h, K k, V v, Entry <K, V> n) {value = v; next = n; key = k; hash = h ;}

Therefore, the overall structure of HashMap is as follows:

　　In short, HashMap is composed of arrays and linked lists. arrays are the subject of HashMap, and linked lists exist mainly to solve hash conflicts, if the position of the located array does not include the Linked List (next of the current entry points to null), you only need to address the search and add operations once. If the located array contains the linked list, for the add operation, the time complexity is still O (1), because the latest Entry will insert the head of the linked list, it is urgent to simply change the reference chain, and for the search operation, in this case, you need to traverse the chain table and then compare and search one by one using the equals method of the key object. Therefore, for performance consideration, the fewer linked lists appear in HashMap, the better the performance.

Other important fields

// The number of actually stored key-value pairs transient int size; // threshold value, when table ==, this value is the initial capacity (16 by default). When the table is filled, that is, the memory space allocated to the table, threshold is usually capacity * loadFactory. When resizing HashMap, You need to refer to threshold, which will be detailed later on int threshold; // load factor, which represents the table's fill degree. The default value is 0.75 final float loadFactor; // used for fast failure. Due to the non-thread safety of HashMap, during the iteration of HashMap, the structure of HashMap changes due to the participation of other threads during the iteration (such as put, remove), you need to throw an exception ConcurrentModificationExceptiontransient int modCount;

HashMap has four constructors. If the initialCapacity and loadFactor parameters are not input to other constructors, the default value is used.

InitialCapacity is 16 by default, and loadFactory is 0.75 by default.

Let's take a look at one of them.

Public HashMap (int initialCapacity, float loadFactor ){
// Check the input initial capacity. The maximum capacity cannot exceed MAXIMUM_CAPACITY = 1 <30 (230) if (initialCapacity <0) throw new IllegalArgumentException ("Illegal initial capacity: "+ initialCapacity); if (initialCapacity> MAXIMUM_CAPACITY) initialCapacity = MAXIMUM_CAPACITY; if (loadFactor <= 0 | Float. isNaN (loadFactor) throw new IllegalArgumentException ("Illegal load factor:" + loadFactor); this. loadFactor = loadFactor; threshold = initialCapacity;
Init (); // The init method is not actually implemented in HashMap, but its subclass, such as linkedHashMap, has a corresponding Implementation}

From the above code, we can see that,In the regular constructor, no memory space is allocated to the array table (except for the constructor with an input parameter specified for Map). Instead, the table array is built only when the put operation is executed.

OK. Let's take a look at the implementation of the put operation.

Public V put (K key, V value) {// If the table array is an empty array {}, fill in the array (allocate the actual memory space for the table), the input parameter is threshold, in this case, if threshold is initialCapacity, the default value is 1 <4 (24 = 16) if (table = EMPTY_TABLE) {inflateTable (threshold);} // if the key is null, if (key = null) return putForNullKey (value); int hash = hash (key ); // calculate the hashcode of the key to ensure that the hash is even int I = indexFor (hash, table. length); // obtain the actual position in the table for (Entry <K, V> e = table [I] ; E! = Null; e = e. next) {// if the corresponding data already exists, overwrite the data. Replace the old value with the new value, and return the old value Object k; if (e. hash = hash & (k = e. key) = key | key. equals (k) {V oldValue = e. value; e. value = value; e. recordAccess (this); return oldValue ;}} modCount ++; // when concurrent access is ensured, if the internal structure of the HashMap changes, the addEntry (hash, key, value, i); // Add an entry return null ;}

Let's take a look at the inflateTable method.

Private void inflateTable (int toSize) {int capacity = roundUpToPowerOf2 (toSize); // capacity must be 2 power threshold = (int) Math. min (capacity * loadFactor, MAXIMUM_CAPACITY + 1); // assign a value for threshold here. Take the minimum value of capacity * loadFactor and MAXIMUM_CAPACITY + 1. capaticy will not exceed MAXIMUM_CAPACITY, unless loadFactor is greater than 1 table = new Entry [capacity]; initHashSeedAsNeeded (capacity );}

The inflateTable method is used to allocate storage space to the primary-stem array table in the memory. With roundUpToPowerOf2 (toSize), the capacity is the second power closest to or equal to toSize, for example, if toSize = 13, capacity = 16; to_size = 16, capacity = 16; to_size = 17, capacity = 32.

 private static int roundUpToPowerOf2(int number) {        // assert number >= 0 : "number must be non-negative";        return number >= MAXIMUM_CAPACITY                ? MAXIMUM_CAPACITY                : (number > 1) ? Integer.highestOneBit((number - 1) << 1) : 1;    }

In roundUpToPowerOf2, the length of the array must be the power of 2. Integer. highestOneBit is used to obtain the value represented by the leftmost bit (other bit is 0.

Hash Function

// This is a magic function that uses many operations such as XOR and shift, further calculate the hashcode of the key and adjust the binary bit to ensure that the final obtained storage location is evenly distributed as much as possible. final int hash (Object k) {int h = hashSeed; if (0! = H & k instanceof String) {return sun. misc. hashing. stringHash32 (String) k);} h ^ = k. hashCode (); h ^ = (h >>> 20) ^ (h >>> 12); return h ^ (h >>> 7) ^ (h >>> 4 );}

The value calculated by the hash function above can be further processed through indexFor to obtain the actual storage location.

/*** Returns the array subscript */static int indexFor (int h, int length) {return h & (length-1 );}

H & (length-1) Ensure that the obtained index must be within the array range. For example, the default capacity is 16, length-1 = 15, h = 18, and is converted to binary and calculated

        1  0  0  1  0    &   0  1  1  1  1    __________________        0  0  0  1  0    = 2

The final calculated index is 2. In some versions, the modulo operation is used for the calculation here, and the index must be within the array range. However, bitwise operations have higher performance for computers (there are a lot of bitwise operations in HashMap)

Therefore, the final process for determining the storage location is as follows:

Let's take a look at the implementation of addEntry:

Void addEntry (int hash, K key, V value, int bucketIndex) {if (size> = threshold) & (null! = Table [bucketIndex]) {resize (2 * table. length); // when the size exceeds the critical threshold value threshold, and a hash conflict is about to occur, expand hash = (null! = Key )? Hash (key): 0; bucketIndex = indexFor (hash, table. length);} createEntry (hash, key, value, bucketIndex );}

The code above shows that when a hash conflict occurs and the size exceeds the threshold value, you need to expand the array. When resizing, you need to create a new array that is twice the length of the previous array, then, all the elements in the current Entry array are transmitted. The length of the new array after expansion is twice that of the previous one. Therefore, expansion is a resource-consuming operation.

Why must the table array length of HashMap be the power of 2?

Let's continue to look at the resize method mentioned above.

 void resize(int newCapacity) {        Entry[] oldTable = table;        int oldCapacity = oldTable.length;        if (oldCapacity == MAXIMUM_CAPACITY) {            threshold = Integer.MAX_VALUE;            return;        }        Entry[] newTable = new Entry[newCapacity];        transfer(newTable, initHashSeedAsNeeded(newCapacity));        table = newTable;        threshold = (int)Math.min(newCapacity * loadFactor, MAXIMUM_CAPACITY + 1);    }

If the array is expanded, the length of the array changes, and the storage location index is h & (length-1), the index may also change. You need to re-calculate the index, let's take a look at the transfer method.

Void transfer (Entry [] newTable, boolean rehash) {int newCapacity = newTable. length;
// For the code in the loop, traverse the table one by one, re-calculate the index location, and copy the old array data to the new array (the array does not store the actual data, so it is just a copy reference) for (Entry <K, V> e: table) {while (null! = E) {Entry <K, V> next = e. next; if (rehash) {e. hash = null = e. key? 0: hash (e. key);} int I = indexFor (e. hash, newCapacity );
// Point the next link of the current entry to the new index location. newTable [I] may be empty, or it may also be an entry chain. If it is an entry chain, insert directly in the head of the linked list. E. next = newTable [I]; newTable [I] = e; e = next ;}}}

This method traverses the data in the old array one by one, and throws it into the new expanded array, the calculation of the index position of an array is based on the hash disrupt operation of the hashcode of the key value, and then bitwise operation with length-1 to obtain the final index position of the array.

The length of the hashMap array must be the power of 2. For example, if the binary value of 16 is 10000, the length-1 is 15, the binary value is 01111, and the length of the expanded array is 32, binary is 100000, length-1 is 31, and binary is 011111. We can also see that this will ensure that all the low positions are 1, and there is only one difference after expansion, that is, there is 1 with the leftmost bits, so that h & (length-1) as long as the leftmost difference of h is 0, this ensures that the new array index is consistent with the old array index (which greatly reduces the Data Location change of the old array that has been hashed well before.

In addition, if the length of the array is the power of 2 and the low position of length-1 is 1, the index of the obtained array is more even, for example:

We can see that the preceding & operation does not affect the result (the hash function uses various bitwise operations to make the low position more hashed). We only focus on the low bit, if all the low positions are 1, any change in the lower position of h will affect the result. That is to say, You need to obtain the storage location of index = 21, h's low position only has this combination. This is also the reason why the array length is designed to be a power of 2. If it is not the power of 2, that is, the low position is not all 1

At this point, we need to make the low part of index = 21 and h no longer unique, and the probability of hash conflicts increases. At the same time, the bit corresponding to the index will not be equal to 1 in any way, the corresponding array locations are wasted.

Get Method

Public V get (Object key ){
// If the key is null, you can retrieve it directly from table [0. If (key = null) return getForNullKey (); Entry <K, V> entry = getEntry (key); return null = entry? Null: entry. getValue ();}

The get method returns the corresponding value through the key value. If the key is null, It is retrieved directly from table [0. Let's take a look at the getEntry method.

Final Entry <K, V> getEntry (Object key) {if (size = 0) {return null ;} // calculate the hash value int hash = (key = null) through the hashcode value of the key )? 0: hash (key); // indexFor (hash & length-1) obtain the final array index, and then retrieve the corresponding record for (Entry <K, v> e = table [indexFor (hash, table. length)]; e! = Null; e = e. next) {Object k; if (e. hash = hash & (k = e. key) = key | (key! = Null & key. equals (k) return e;} return null ;}

It can be seen that the get method is relatively simple, key (hashcode) --> hash --> indexFor --> the final index location, find the corresponding location table [I], and then check whether there is a linked list, for a traversal table, you can use the equals method of the key to find the corresponding record. It should be noted that some people think that e. hash = hash is not necessary when the array is located and then the traversal table is located. It can be determined only through equals. In fact, if the Input key object overwrites the equals method but does not overwrite the hashCode, it happens that the object locates the array location. If equals is used only, it may be equal, but its hashCode is inconsistent with the current Object. In this case, according to the hashCode conventions of the Object, the current Object cannot be returned, but null should be returned. The following example will explain it further.

To override the equals method, you must overwrite the hashCode method at the same time.

The source code analysis of HashMap is introduced here. At last, let's talk about a common problem. We will mention all kinds of materials: "hashcode must be overwritten during equals rewriting ", let's take a small example to see what problems will happen if equals is rewritten instead of hashcode.

/*** Created by chengxiao on 2016/11/15. */public class MyTest {private static class Person {int idCard; String name; public Person (int idCard, String name) {this. idCard = idCard; this. name = name ;}@ Override public boolean equals (Object o) {if (this = o) {return true;} if (o = null | getClass ()! = O. getClass () {return false;} Person person = (Person) o; // whether the two objects are equivalent. Use idCard to determine return this. idCard = person. idCard ;}} public static void main (String [] args) {HashMap <Person, String> map = new HashMap <Person, String> (); person person = new Person (1234, "Qiao Feng"); // put to hashmap to map. put (person, "tianlong Babu"); // get to output the "tianlong Babu" System logically. out. println ("Result:" + map. get (new Person (1234, "Xiaofeng ")));}}

Actual output result:

Result: null.

If we have a certain understanding of the principles of HashMap, this result is not hard to understand. Although the key used for get and put operations is logically equivalent (equals is equivalent), but the hashCode method is not overwritten, so during the put operation, the key (hashcode1) --> hash --> indexFor --> the final index location, and the key (hashcode1) when the value is retrieved through the key) --> hash --> indexFor --> the final index location, because hashcode1 is not equal to hashcode2, as a result, if the location of an array is not located, the system returns the logically incorrect null value (it may also happen to be located at an array location, but it will also determine whether the hash value of its entry is equal, as mentioned in the get method above .)

Therefore, when rewriting the equals method, you must overwrite the hashCode method, and ensure that two equal objects are judged through equals. The hashCode method must return the same integer value. However, if equals judges two unequal objects, its hashCode can be the same (but there will be Hash conflicts, so avoid them as much as possible ).

Summary

This article describes the implementation principle of HashMap, and further analysis based on the source code. It also involves some detailed source code design reasons. Finally, it briefly introduces why the hashCode method needs to be rewritten when equals is rewritten. I hope this article will help you, and you are welcome to discuss and correct it. Thank you for your support!

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More