The solution of hash conflict

Source: Internet
Author: User
Tags array length

In the Java programming language, the most basic structure is two kinds, one is an array, one is the analog pointer (reference), all the data structure can use these two basic structure constructs, HashMap also. When the program tries to put multiple key-value into the HASHMAP, take the following snippet as an example:

Hashmap<string,object> m=new hashmap<string,object> (); 
M.put ("A", "rrr1"); 
M.put ("B", "TT9"); 
M.put ("C", "Tt8"); 
M.put ("D", "G7"); 

HashMap uses a so-called "Hash algorithm" to determine where each element is stored. When the program executes the Map.put (String,obect) method, the system will call the String's Hashcode () method to get its hashcode value-each Java object has a hashcode () method, which can be obtained by the method's Hashcode Value. After the hashcode value of the object is obtained, the system will determine the storage location of the element based on the hashcode value. The source code is as follows:
   public v put (k key, v value)  {            if  (key == null)                 return putfornullkey (value);            int hash = hash (Key.hashcode ());            int i = indexfor (hash, table.length);            for  (entry<k,v> e = table[i]; e !=  null; e = e.next)  {                Object k;                //determines whether an element of the same hashcode and the same key exists at the currently determined index position, and if there are elements of the same hashcode and the same key, the new value overwrites the original old value and returns the old value.                //If there is the same hashcode, then they determine the same index position, then judge whether their key is the same, if not the same, then there is a hash conflict.                //hash after the conflict, So HashMap's single bucket store is not a  entry, but a  Entry  chain.                //systems must traverse each  entry sequentially Until you find the  Entry  that you want to search-if the  Entry  that you're searching for is at the very end of the  Entry  chain (which is the first place to put the  bucket ),               // The system must loop to the last to find the element.                if  (e.hash ==  hash &&  ((k = e.key)  == key | |  key.equals (k))  {                    V oldValue = e.value;   &NBSP;&NBSP;&NBSP;&Nbsp;            e.value = value;                    return  oldValue;               }            }            modCount++;           addentry (hash, key,  Value, i);           return null;        }      

The above program uses an important internal interface: Map.entry, each map.entry is actually a key-value pair. As can be seen from the above program: when the system decides the key-value pair in the storage HashMap, it does not consider the value of Entry, but only calculates and determines the storage location of each Entry according to the key. This also explains the previous conclusion: we can completely take the value of the MAP set as the subordinate of the key, and when the system determines where the key is stored, the value is stored there. HashMap program after my transformation, I deliberately constructed a hash conflict phenomenon, because the initial size of HashMap 16, but I put more than 16 elements inside the HashMap, and I blocked its resize () method. Don't let it go to expansion. The underlying array of HashMap entry[] table structure is as follows:

HashMap inside the bucket appeared in the form of a single linked list, the hash table to solve a problem is the hash value of the conflict, usually two methods: Linked list method and open address method. The list method is to organize the object of the same hash value into a slot with the hash value; The open address method is a detection algorithm that continues to look for the next available slot when a slot is already occupied. Java.util.HashMap using the method of the linked list, the linked list is one-way linked list. The core code to form a single linked list is as follows:

void AddEntry (int hash, K key, V value, int bucketindex) {entry<k,v> e = Table[bucketindex];       Table[bucketindex] = new entry<k,v> (hash, key, value, E);   if (size++ >= threshold) Resize (2 * table.length); Bsp

The code for the above method is simple, but it contains a design: The system always puts the newly added Entry object into the Bucketindex index of the table array-if there is already a Entry object at the Bucketindex index, the newly added Entry object points to the original Entry object (produces a Entry chain), if there is no Entry object at Bucketindex Index, that is, the e variable of the above program code is NULL, that is, the newly placed Entry object points to null, which means that no Entry chain is generated.

HashMap There is no hash conflict, not form a single linked list, HashMap lookup element quickly, get () method can directly navigate to elements, but a single linked list, a single bucket store is not a Entry, but a Entry chain, The system must iterate through each Entry in order until it finds the Entry to search for-if the Entry that is to be searched is at the very end of the Entry chain (the Entry was first placed in the bucket), the system must loop to the last to find the element.

When creating HashMap, there is a default load factor (load factor) with a default value of 0.75, which is a tradeoff between time and space costs: increasing the load factor reduces the memory footprint of the Hash table (that is, the Entry array), but increases the time cost of querying the data , and queries are the most frequent (HashMap get () and put () methods use queries), and reducing load factors increases the performance of data queries, but increases the amount of memory space occupied by the Hash table.

I. Overview of HASHMAP

HashMap the implementation of MAP interface based on hash table. This implementation provides all the optional mapping operations and allows NULL values and NULL keys to be used. (in addition to being unsynchronized and allowing null, the HashMap class is roughly the same as Hashtable.) This class does not guarantee the order of mappings, especially if it does not guarantee that the order is immutable.

It is worth noting that HashMap is not thread-safe, and if you want thread-safe hashmap, you can synchronizedmap get thread-safe HashMap through the static method of the Collections class.

Map map = Collections.synchronizedmap (new HashMap ());

II. data structure of HashMap

The bottom layer of the hashmap is mainly based on arrays and linked lists, and it has a fairly fast query speed because it determines the location of the storage by calculating the hash code. HashMap is mainly through the hashcode of key to calculate the hash value, as long as hashcode the same, calculated hash value is the same. If there are more objects stored, it is possible that the hash values of the different objects are the same, and there is a so-called hash conflict. The students who have studied the data structure know that there are many methods to solve the hash conflict, hashmap the bottom is to solve the hash conflict through the linked list.

In the figure, the purple part represents a hash table, also known as a hash array, each element of the array is a single linked list of the head node, linked list is used to resolve the conflict, if the different key map to the same position in the array, put it into a single list.

Let's look at the code for the entry class in HashMap:

/** entry is a one-way linked list.    
     * It is the "HashMap chain Storage Method" corresponding to the list. * It implements the Map.entry interface, that is, implementing Getkey (), GetValue (), SetValue (V value), Equals (Object O), hashcode () These functions **/static clas    
        S entry<k,v> implements map.entry<k,v> {final K key;    
        V value;    
        Point to next node entry<k,v> next;    
   
        final int hash;    
        The constructor function. The input parameters include "hash value (h)", "key (k)", "Value (v)", "Next node (n)" Entry (int h, K K, v V, entry<k,v> N) {Valu    
            e = v;    
            Next = n;    
            key = k;    
        hash = h;    
        Public final K Getkey () {return key;    
        Public Final v. GetValue () {return value;    
            Public final V SetValue (v newvalue) {v oldValue = value;    
            value = newvalue;    
        return oldValue; //Determine whether two entry are equal   
        Returns true if the "key" and "value" of the two entry are equal. Otherwise, return false to public final Boolean equals (Object o) {if (!) (    
            o instanceof Map.entry) return false;    
            Map.entry e = (map.entry) o;    
            Object K1 = Getkey ();    
            Object K2 = E.getkey (); if (k1 = = K2 | | (K1!= null && k1.equals (K2)))    
                {Object V1 = GetValue ();    
                Object v2 = E.getvalue (); if (v1 = = V2 | |    
                    (v1!= null && v1.equals (v2)))    
            return true;    
        return false; //Implement Hashcode () public final int hashcode () {return (key==null? 0:key).    
        Hashcode ()) ^ (Value==null 0:value.hashcode ());    
        Public final String toString () {return getkey () + "=" + GetValue ();    
   
}        When an element is added to the HashMap, the Recordaccess () is painted. This does not do any processing void recordaccess (hashmap<k,v> m) {}//When the element is deleted from HashMap, the drawing calls the Reco    
        Rdremoval (). No handling void Recordremoval (hashmap<k,v> m) {}}


HashMap is actually a entry array, entry objects contain keys and values, where next is also a entry object, which is used to handle hash conflicts and form a list.

Third, HashMap source analysis

1. Key attributes

Let's take a look at some of the key properties in the HashMap class:

1 transient entry[] table;//The entity array of the storage elements
2  
3 transient int size;//The number of elements
4  
5 int threshold;//Critical value   When the actual size exceeds the critical value, the expansion threshold = load factor * Capacity
6 
7  final float loadfactor;//load Factor
8  
9 transient int modcount ;//number of times modified


Where the loadfactor load factor represents the extent to which the elements in the Hsah table are filled.

If: The larger the load factor, the more filling of elements, the advantage is that the space utilization is high, but: the chance of conflict increased. The length of the linked list will grow longer and the search efficiency is reduced.

Conversely, the smaller the load factor, the less the filling, the advantage is that the chance of conflict is reduced, but: the space is much more wasted. The data in the table will be too sparse (lots of space is useless, and it starts to expand)

The greater the chance of a conflict, the higher the cost of finding it.

Therefore, a balance and compromise must be found between "opportunities for Conflict" and "space utilization". This balance and compromise is essentially a balance and tradeoff between the well-known "time-space" contradictions in the data structure.

If the machine has enough memory and you want to increase the speed of the query, you can set the load factor to a smaller size, but if the machine memory is tight and there is no requirement for query speed, you can set the load factor up a bit. But generally we do not have to set it, let it take the default value of 0.75 good.

2. Construction method

Here's a look at some of the HashMap's construction methods:

 1 public HashMap (int initialcapacity, float loadfactor) {2//Ensure the number is valid 3 if (initialcapacity < 0) 4                                               throw new IllegalArgumentException ("Illegal initial capacity:" + 5
 initialcapacity);
 6 if (initialcapacity > maximum_capacity) 7 initialcapacity = maximum_capacity; 8 if (loadfactor <= 0 | |                                               Float.isnan (Loadfactor)) 9 throw new IllegalArgumentException ("Illegal load factor:" + 10
Loadfactor);   One//Find a power of 2 >= initialcapacity int capacity = 1;             Initial capacity (capacity < initialcapacity)//ensures that the capacity is 2 n power, so that the capacity is greater than the initialcapacity of the smallest 2 of the N power 15
Capacity <<= 1;
This.loadfactor = Loadfactor;
threshold = (int) (capacity * loadfactor);
Table = new Entry[capacity];
Init (); The public HashMap (int initialcapacity) {This (initialcapacity, default_load_factor), HashMap () {28
This.loadfactor = Default_load_factor;
threshold = (int) (default_initial_capacity * default_load_factor);
Table = new Entry[default_initial_capacity];
to Init (); 32}

We can see that the first constructor is invoked if we specify the load factor and the initial capacity when constructing the HashMap, otherwise the default is used. The default initial capacity is 16, and the default load factor is 0.75. We can see in the code above 13-15 lines, this piece of code is to ensure that the capacity of N power of 2, so that capacity is greater than the initialcapacity of the smallest 2 of the N-Power, as for why the capacity to set to 2 of the N-Power, we wait to see.

Most of the two methods used in HashMap are put and get in focus analysis

3, storage data

Let's look at how the HashMap stored the data, first look at the HashMap put method:

Public V-Put (K key, V value) {
     //if "key is null", add the key value pair to table[0].
         if (key = = null) return 
            Putfornullkey (value);
     If "key is not NULL," the hash value of the key is computed and then added to the list corresponding to the hash value.
         int hash = hash (Key.hashcode ());
     Searches for the index of the specified hash value in the corresponding table
         int i = indexfor (hash, table.length);
     Iterate through the entry array, and replace the old value with the new value if the corresponding key value pair already exists. And then exit. For
         (entry<k,v> e = table[i]; e!= null; e = e.next) { 
             Object K;
              if (E.hash = = Hash && ((k = e.key) = = Key | | key.equals (k))) {//If the key is the same, overwrite and return the old value
                  V oldValue = e.value;
                 E.value = value;
                 E.recordaccess (this);
                 Return OldValue
              }
         }
     Modification times +1
         modcount++;
     Add Key-value to Table[i] at
     addentry (hash, key, value, I);
     return null;
}

The above program uses an important internal interface: Map.entry, each map.entry is actually a key-value pair. As can be seen from the above program: when the system decides the key-value pair in the storage HashMap, it does not consider the value of Entry, but only calculates and determines the storage location of each Entry according to the key. This also explains the previous conclusion: we can completely take the value of the MAP set as the subordinate of the key, and when the system determines where the key is stored, the value is stored there.

We slowly analyze this function, and line 2nd and 3 is to handle the case where the key value is NULL, let's look at the Putfornullkey (value) method:

1 Private v Putfornullkey (v value) {
 2 for         (entry<k,v> e = table[0]; e!= null; e = e.next) {
 3             if ( E.key = = null) {   //If there is an object with the key null, overwrite
 4                 V oldValue = e.value;
 5                 e.value = value;
 6                 e.recordaccess (this);
 7 return                 OldValue;
 8            }
 9        }         modcount++;
One         addentry (0, NULL, value, 0);//If the key is null, the hash value is 0
return         null;     }


Note: If the key is null, the hash value is 0, and the object is stored in the array in the position indexed by 0. IE Table[0]

Let's go back and look at line 4th of the Put method, which calculates the hash code through the hashcode value of the key, and the following is the function that calculates the hash code:

1//The  method of calculating the hash value computes the
2     static int hash (int h) {
3         //This function ensures that hashcodes T by the hashcode of the key Hat differ only by
4         //constant multiples in each bit position have a bounded
5         //Number of collisions ( Approximately 8 at default load factor).
6         H ^= (H >>>) ^ (h >>>);
7 return         H ^ (H >>> 7) ^ (H >>> 4);
8     }


After getting the hash code, the hash code is used to compute the index that should be stored in the array, and the function of the index is as follows:

1     static int indexfor (int h, int length) {//calculated index value based on hash value and array length
2 return         H & (length-1);  This can not be arbitrarily counted, with hash& (length-1) for a reason, so as to ensure that the calculated index is within the array size range, not exceeding
3     }

This we want to focus on the next, we generally hash table hash is naturally thought of using the hash value of the length modulus (that is, Division hash method), Hashtable is also implemented in this way, this method can basically ensure that the elements in the hash table hash more evenly, but the modulo will be used in division operations, inefficient, In HashMap, the method of h& (length-1) is substituted for the modulo, and the uniform hashing is achieved, but the efficiency is much higher, which is also an improvement of HashMap to Hashtable.

Next, we analyze why the hash table capacity must be 2 of the integer power. First, the,h& (length-1) is equivalent to the length modulo when the integer power of length is 2, thus ensuring the uniformity of the hash and increasing the efficiency; second, the integer power of length 2 is even, so that the length-1 is odd, The last odd one is 1, this ensures that the last one of h& (length-1) may be 0, or 1 (depending on the value of h), that is, the result may be even or odd, so that the uniformity of the hash can be guaranteed, and if length is odd, It's obvious that length-1 is even, and the last one is 0, so that the last one of h& (length-1) is definitely 0, which is only an even number, so that any hash value will only be hashed to the even subscript position of the array, which wastes nearly half of the space, so Length takes the integer power of 2, so that the probability of collision of different hash values is smaller, so that the elements can be uniformly hashed in the hash table.

This looks very simple, in fact, more mysterious, we give an example to illustrate:

Assuming that the array lengths are 15 and 16 respectively, and the optimized hash code is 8 and 9 respectively, then the result of the & operation is as follows:

       H & (table.length-1)                     hash                             table.length-1
       8 & (15-1):                                 0100                   &              1110                   =                0100
       9 & (15-1):                                 0101                   &              1110                   =                0100
       ---------------------------------- -------------------------------------------------------------------------------------
       8 & (16-1):                                 0100                   &              1111                   =                0100
       9 & (16-1):                                 0101                   &              1111                   =                0101

As you can see from the above example: when they are 15-1 (1110) "and", produce the same result, that is, they will be positioned in the same position in the array, which produces collisions, 8 and 9 will be placed in the same position in the array to form a linked list, then the query will need to traverse the list , get 8 or 9, which reduces the efficiency of the query. At the same time, we can also find that when the length of the array is 15, the hash value and 15-1 (1110) "and", then the last one is always 0, and 0001,0011,0101,1001,1011,0111,1101 these positions will never be able to store elements, The space waste is quite large, and worse, in this case, the array can be used in a much smaller position than the length of the array, which means further increase the chance of collisions and slow down the efficiency of the query. And when the array length is 16 o'clock, that is 2 of n times, the binary number of 2n-1 obtained by the value of each bit is 1, which makes the lower &, the same as the original hash, and the hash (int h) method for key hashcode further optimization, Adding a high-order calculation allows only two values of the same hash value to be placed in the same position in the array to form a linked list.

So, when the length of the array is 2 n times, different key is the same probability of the index is small, then the data distributed on the array is more uniform, that is, the collision probability is small, relative, the query is not to traverse a certain position of the linked list, so the query efficiency is higher.

   

According to the source code of the Put method above, when the program attempts to place a key-value pair into the HashMap, the program first determines where the Entry is stored based on the hashcode () return value of the key: if the Entry of two hashcode key ( Returns the same value, they are stored in the same location. If the key of these two Entry returns true through Equals, the value of the newly added Entry will overwrite the value of Entry in the collection, but the key will not overwrite. If these two Entry keys return false through Equals, the newly added Entry will form a Entry chain with Entry in the set, and the newly added Entry is located in the head of the Entry chain-specific instructions continue to see AddEntry () Description of the method.

1 void addentry (int hash, K key, V value, int bucketindex) {
2         entry<k,v> e = Table[bucketindex];//If you want to join the location There is a value, the original value of the position is set to the new Entry next, that is, the next node of the new Entry list
3         Table[bucketindex] = "New entry<>" (hash, key, value, e);
4 if         (size++ >= threshold)///If greater than the critical value, expand
5             Resize (2 * table.length);//Enlarge by 2 multiplier
6     }

Parameter bucketindex is the Indexfor function to calculate the index value, the 2nd line of code is to get the index in the array of Bucketindex entry object, the 3rd line is the hash, key, Value constructs a new entry object where the index is Bucketindex, and sets the original object of the position to the next constituent list of the new object.

Line 4th and 5th is to determine whether the size has reached the critical value threshold, if the critical value is to be expanded, hashmap expansion is twice times the original.

4, adjust the size

The resize () method is as follows:

Resize the HashMap, newcapacity is the adjusted unit

1     void Resize (int newcapacity) {
 2         entry[] oldtable = table;
 3         int oldcapacity = oldtable.length;
 4         if (oldcapacity = = maximum_capacity) {
 5             threshold = Integer.max_value;
 6 return             ;
 7        }
 8 
 9         entry[] newtable = new entry[newcapacity];         Transfer (newtable)//To move the elements of the original table all to the newtable inside one         table = newtable;  Then assign the newtable to table         threshold = (int) (newcapacity * loadfactor);//Recalculate critical value     }


A new HashMap array is created, and the 10th behavior in the code above calls the transfer method, adds all the elements of HashMap to the new HashMap, and recalculates the index position of the element in the new array

When the elements in the HashMap are more and more, the probability of hash conflict is increasing, because the length of the array is fixed. Therefore, in order to improve the efficiency of the query, it is necessary to expand the array of HashMap, array expansion of the operation will also appear in the ArrayList, this is a common operation, and after the HashMap array expansion, The most performance-consuming point appears: the data in the original array must recalculate its position in the new array and put it in, which is resize.

So when will hashmap be enlarged? When the number of elements in the HashMap exceeds the array size *loadfactor, the array expands, and the default value of Loadfactor is 0.75, which is a compromise value. That is, by default, the array size is 16, then when the number of elements in the HashMap exceeds the 16*0.75=12, the size of the array is expanded to 2*16=32, that is, to expand by one time, and then recalculate the position of each element in the array, expansion is required for array replication, Copying an array is a very performance-consuming operation, so if we have already foreseen the number of elements in HashMap, the number of preset elements can effectively improve the performance of HashMap.

5. Data reading

1.public V get (Object key) {   
2.    if (key = = null)   
3.        return Getfornullkey ();   
4.    int hash = hash (Key.hashcode ());   
5.    for (entry<k,v> e = table[indexfor (hash, table.length)];   
6.        e!= null;   
7.        E = e.next) {   
8.        Object K;   
9.        if (E.hash = = Hash && ((k = e.key) = = Key | | key.equals (k)))   
.            return e.value;   
One.    return null;   
13.}  

With the above stored hash algorithm as the basis, the understanding of this code is very easy. From the above source code can be seen: from the HashMap get elements, first calculate the key hashcode, find the corresponding position in the array of elements, and then through the key of the Equals method in the corresponding location of the linked list to find the necessary elements.

6, HashMap Performance parameters:

The HASHMAP contains several constructors as follows:

HashMap (): Build a HashMap with an initial capacity of 16 and a load factor of 0.75.

HashMap (int initialcapacity): Constructs a HashMap with an initial capacity of initialcapacity and a load factor of 0.75.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.