Transferred from: http://annegu.iteye.com/blog/539465
HashMap is a very common and widely used data type, and recently researched into relevant content, just review it. Online about HashMap many articles, but in the end is the summary of their own study, issued to share with you, to discuss together.
1. Data structure of HashMap
To know what HashMap is, first of all to figure out its data structure, in the Java programming language, the most basic structure is two, one is an array, the other is an analog pointer (reference), all the data structures can be constructed with these two basic structure, HASHMAP is no exception. HashMap is actually a combination of an array and a linked list (in the data structure, commonly referred to as "chain-table hashing"), see (horizontal means array, vertical means array element "is actually a linked list").
We can see that a hashmap is an array structure, and when a new HashMap is created, an array is initialized. Let's look at the Java code:
/** * the table, resized as necessary. Length must always be a power of. * Fixme here to pay attention to this sentence, as for the reason behind will be said */ transien T entry[] table; static class entry<k,v> implements map.entry<k,v> { final K K ey V value; final int hash; Entry <k,v> next; .......... }
The entry on the
is the element in the array, which holds a reference to the next element, which forms the list.
When we put elements into the hashmap, it is worthwhile to place the element in the array (i.e. subscript) according to the hash of the key. Then you can put this element in the corresponding position. If there are other elements in the seat of this element, then the elements in the same seat will be stored in the form of a list, and the newly added ones are placed in the chain head, and the first ones are placed at the end of the chain. When a get element is obtained from HashMap, the hashcode of the key is computed first, an element in the corresponding position in the array is found, and the required element is found in the linked list of the corresponding position through the Equals method of key. From here we can imagine, if each location on the list of only one element, then the hashmap get efficiency will be the highest, but the ideal is always good, the reality is always difficult to overcome, haha ~
2, hash algorithm
We can see that in the HashMap to find an element, we need to base the hash value of the key to obtain the position in the corresponding array. How to calculate this position is the hash algorithm. Previously said HASHMAP data structure is the combination of arrays and linked lists, so we certainly hope that the hashmap inside the element location as far as possible, so that the number of elements in each position is only one, then when we use the hash algorithm to obtain this position, It is immediately possible to know that the element of the corresponding position is what we want, without having to traverse the linked list.
So the first thing we think about is to hashcode the logarithm of the length of the array, so that the distribution of the elements is relatively uniform. However, the "modulo" operation of the consumption is relatively large, can you find a faster, less expensive way that? When you do this in Java, the
Static int indexfor (intint length) { return H & (Length-1); }
First the key is Hashcode value, and then the length of the array-1 to do the "and" Operation (&). Looks very simple, actually has the mystery. For example, the length of the array is 2 of 4, then hashcode and 2 of 4 square-1 Do "and" operation. Many people have this question, why HashMap array initialization size is 2 of the size of the sub-square, HashMap is the most efficient, I 2 4 times to explain why the power of the array size of 2 HashMap access to the highest performance.
Look, the two groups on the left are array lengths of 16 (2 of 4), and the right two groups are array lengths of 15. Both groups of Hashcode are 8 and 9, but it is clear that when they and 1110 "and" are produced the same result, that is, they will be positioned in the same position in the array, which creates a collision, 8 and 9 will be placed on the same linked list, then the query will need to traverse the list, Get 8 or 9, which reduces the efficiency of the query. At the same time, we can also find that when the array length is 15, the value of hashcode will be "with" with 14 (1110), then the last one will always be 0, and 0001,0011,0101,1001,1011,0111,1101 these positions will never be able to store elements. , the space waste is quite large, and worse, in this case, the array can be used in a much smaller position than the array length, which means that further increase the probability of collisions, slowing down the efficiency of the query!
So, when the array length is 2 of the power of n times, the different key is the same probability of the index is smaller, then the data in the array distribution on the more uniform, that is, the probability of collision is small, relative, when the query does not have to traverse a position on the list, so query efficiency is higher.
Speaking of this, we look back at the default array size in HashMap, see the source code can be known to be 16, why is 16, instead of 15, nor 20, see the above Annegu explanation after we clear it, obviously because 16 is 2 is the power of the whole number of reasons, In the case of small data volume 16:15 and 20 can reduce the collision between key, and speed up the efficiency of the query.
Therefore, when storing bulk data, it is advisable to specify the HashMap size as an integer power of 2 in advance. If not specified, it is initialized with a power of 2 powers greater than and closest to the specified value, as follows (in the HashMap constructor):
// Find a power of 2 >= initialcapacity int capacity = 1; while (Capacity < initialcapacity) <<= 1;
3, HashMap's resize
When there are more and more elements in the HashMap, the probability of collisions becomes higher (because the length of the array is fixed), so in order to improve the efficiency of the query, we need to expand the HashMap array, the array expansion will also appear in the ArrayList, so this is a common operation , many people have expressed doubts about its performance, but to think about our "averaging" principle, we are relieved, and after the HashMap array is expanded, the most performance-consuming point arises: the data in the original array must recalculate its position in the new array and put it in, which is resize.
So when is the hashmap going to be enlarged? When the number of elements in HashMap exceeds the array size *loadfactor, the array is expanded, the default value of Loadfactor is 0.75, that is, by default, the array size is 16, then when the number of elements in HashMap exceeds 16*0.75= At 12, it expands the size of the array to 2*16=32, which is one-fold, then recalculates the position of each element in the array, which is a very performance-intensive operation, so if we have predicted the number of elements in HashMap, Then the number of preset elements can effectively improve the performance of HashMap. For example, we have 1000 elements new HashMap (1000), but in theory new HASHMAP (1024) is more appropriate, but above Annegu has said that even 1000,hashmap will automatically set it to 1024. But new HashMap (1024) is not more suitable, because 0.75*1000 < 1000, that is, in order to let 0.75 * size > 1000, we must so new HashMap (2048) is the most suitable, both to consider the problem of & , but also avoids the problem of resize.
4, key hashcode and Equals method rewrite
In the first part of the HASHMAP data structure, Annegu writes the process of getting the method: first computes the hashcode of the key, finds an element in the corresponding position in the array, and then finds the desired element in the linked list of the corresponding position through the Equals method of key. Therefore, the hashcode and equals methods are two key methods for finding the corresponding element.
HashMap key can be any type of object, such as user object, in order to ensure that two of the same properties of the user's hashcode is the same, we need to rewrite the Hashcode method, for example, the calculation of the Hashcode value associated with the ID of the user object, So as long as the user object has the same ID, then their hashcode will be consistent, so that you can find the position in the HashMap array. If there are multiple elements in this position, you also need to use the Equals method of key to find the required element in the linked list of the corresponding position, so it is not enough to rewrite the Hashcode method, and the Equals method needs to be rewritten, of course, according to the normal thinking logic, The Equals method is typically defined according to the actual business content, such as determining whether the two user is equal based on the ID of the user object.
When rewriting the Equals method, the following three points need to be met:
(1) Reflexivity: That is, a.equals (a) must be true.
(2) Symmetry: Meaning a.equals (b) =true, B.equals (a) must also be true.
(3) transitivity: Meaning a.equals (b) =true, and B.equals (c) =true, A.equals (c) must also be true.
By overwriting the Equals and hashcode methods of the Key object, we can use any business object as a key to the map (provided that you do have one).
Summarize:
This paper mainly describes the structure of HashMap, and the implementation of hash function in HashMap, as well as the characteristics of the implementation, and describes the root cause of the performance consumption of resize in HashMap, as well as the basic requirements of the common domain model object as key. In particular, the implementation of the hash function, can be said to be the essence of the whole hashmap, only really understand the hash function, it can be said to HashMap have a certain understanding.
Deep understanding of HashMap