Java data structures and algorithms: HASHMAP, hash table, hash function

Source: Internet
Author: User
Tags array length modulus rehash static class
1. HashMap Overview

HashMap is an asynchronous implementation of the map interface based on a hash table (Hashtable is similar to HashMap, the only difference being that the method in Hashtalbe is thread-safe, that is, synchronous). This implementation provides all the optional mapping operations and allows NULL values and NULL keys to be used. This class does not guarantee the order of mappings, especially if it does not guarantee that the order is immutable. 2. Four points of attention on HashMap answers

Attention Point Conclusions
HashMap whether to allow null Both key and value are allowed to be null
HashMap whether to allow duplicate data Key repeat will overwrite, value allow duplicate
Whether the HashMap is orderly disorder, especially to indicate that this disorder refers to the time to traverse the hashmap, the resulting elements of the order is basically impossible to put the order
HashMap is thread safe Non-thread safe
3. Data structure of HashMap

In the Java programming language, the most basic structure is two kinds, one is an array, the other is an analog pointer (reference), all data structures can be constructed with these two basic structures, HashMap is no exception. HashMap is actually an "array of linked lists" of data structures, with each element storing the array of chain header nodes, that is, the combination of the array and the linked list.

As can be seen from the above figure, the HashMap bottom is an array structure, and each item in the array is a linked list. When a new HashMap is created, an array is initialized. The source code is as follows:

/**
 * The table, resized as necessary. Length must Always is a power of two.
 * *
transient entry[] table;

Static Class Entry<k,v> implements Map.entry<k,v> {
    final K key;
    V value;
    Entry<k,v> Next;
    final int hash;
    ...
}

As you can see, Entry is an array of elements, each map.entry is actually a key-value pair, it holds a reference to the next element, which constitutes a linked list.

The bottom layer of the hashmap is mainly based on arrays and linked lists, and it has a fairly fast query speed because it determines the location of the storage by calculating the hash code. HashMap is mainly through the hashcode of key to calculate the hash value, as long as hashcode the same, calculated hash value is the same. If there are more objects stored, it is possible that the hash values of the different objects are the same, and there is a so-called hash conflict. The students who have studied the data structure know that there are many methods to solve the hash conflict, hashmap the bottom is to solve the hash conflict through the linked list.

In the figure, the purple part represents a hash table, also known as a hash array, each element of the array is a single linked list of the head node, linked list is used to resolve the conflict, if the different key map to the same position in the array, put it into a single list.

For HashMap and their subclasses, they use the Hash algorithm to determine where the elements in the collection are stored. When the system starts initializing HASHMAP, the system creates a Entry array of length capacity, where the location of the element can be stored as "bucket (bucket)", each bucket has its specified index, and the system can quickly access the bucket based on its index The elements stored in the.

Whenever a HashMap's bucket stores only one element (that is, a Entry), because the Entry object can contain a reference variable (the last parameter of the Entry constructor) is used to point to the next Entry, it is possible that the HASHMAP There is only one Entry in the bucket, but the Entry points to another entry--which forms a Entry chain.

4. HashMap's constructor

HASHMAP provides three constructors: HashMap (): Constructs an empty HashMap with a default initial capacity (16) and a default load factor (0.75). HashMap (int initialcapacity): Constructs an empty HashMap with a specified initial capacity and a default load factor (0.75). HashMap (int initialcapacity, float loadfactor): Constructs an empty HashMap with specified initial capacity and load factor.

Two parameters are mentioned here: initial capacity, load factor. These two parameters are important parameters that affect the performance of HashMap, where the capacity represents the number of buckets in the hash table, the initial capacity is the capacity to create a hashtable, and the load factor is a measure of how full a hashtable can be before its capacity automatically increases, which measures how much space is used for a Hashtable, The larger the load factor indicates the higher the reload of the hash table, the smaller the vice. For a hash table using the list method, the average time to find an element is O (1+a), so if the load factor is larger, the space is more fully utilized, but the result is a decrease in lookup efficiency, and if the load factor is too small, the data in the hash table will be too sparse to cause a serious waste of space. The system default load factor is 0.75, and in general we do not need to modify it.

If: The larger the load factor, the more filling of elements, the advantage is that the space utilization is high, but: the chance of conflict increased. The length of the linked list will grow longer and the search efficiency is reduced.
Conversely, the smaller the load factor, the less the filling, the advantage is that the chance of conflict is reduced, but: the space is much more wasted. The data in the table will be too sparse (lots of space is useless, and it starts to expand)

When the number of entries in the hash table exceeds the current capacity * load factor (which is actually the actual capacity of the HashMap), the Hashtable is rehash and the Hashtable is expanded to twice times the number of buckets. 5. HashMap Access Implementation 5.1 storage

public v put (K key, V value) {//When key is NULL, invoke the Putfornullkey method, save null with the first position in the table, this is the HashMap allow
    The reason for null if (key = null) return Putfornullkey (value);                  Calculates the hash value of the key int hash = hash (Key.hashcode ());             ------(1)//calculates the position of the key hash value in the table array int i = indexfor (hash, table.length); 
        ------(2)//starts from I out iteration E, finds the location of the key save for (entry<k, v> e = table[i]; e!= null; e = e.next) {Object K;  To determine if the same key value exists on the chain or if there is the same, then overwrite value directly and return the old value if (E.hash = = Hash && (k = e.key) = key ||    Key.equals (k))) {V oldValue = E.value;
            Old value = new value E.value = value;
            E.recordaccess (this);     return oldValue;
    Return the old value}//change the number of times increase by 1 modcount++;
    Add key, value to I location addentry (hash, key, value, I);
return null; }

Through the source we can see clearly hashmap the process of saving data is: first to determine whether the key is NULL, if NULL, then call the Putfornullkey method directly, place value in the first position of the array. If not NULL, the hash value is recalculated according to the Key's hashcode, and then the hash is worth the position of the element in the table array (subscript), if the table array has other elements at that position, then by comparing the presence of the same key, If present, the value of the original key is overwritten, otherwise the element is saved in the chain header (the first saved element is placed at the end of the chain). If the table does not have an element at that point, place the element directly in that position in the array. This process seems to be relatively simple, in fact, deep inside. There are several points:

1, first look at the iteration. The reason for this iteration is that in order to prevent the existence of the same key value, if two key values are found, HashMap's approach is to replace the old value with the new value, and there is no handling key, which explains that there are no two identical keys in the HashMap. In addition, note that the comparison key is the same, is first than Hashcode is the same, hashcode the same again to determine whether equals is true, which greatly increased the efficiency of the hashmap.

2, in the Look (1), (2) place. Here is the essence of HashMap. The first is the hash method, which is a pure mathematical calculation, which computes the hash value of H. This algorithm adds a high level calculation to prevent the hash conflict caused by low level constant and high change.

static int hash (int h) {
    h ^= (H >>>) ^ (h >>>);
    Return h ^ (H >>> 7) ^ (H >>> 4);
}

Why go through such an operation. This is HashMap's brilliant place. Take a look at the example, a decimal number 32768 (binary 1000 0000 0000 0000), after the calculation of the above formula, the result is 35080 (binary 1000 1001 0000, 1000). Can you see it? Maybe this is not what we see, and then a number 61440 (binary 1111 0000 0000 0000), the result of the operation is 65263 (binary 1111 1110 1110, 1111), it should be obvious now, its purpose is to let "1" become even a bit, The intention of hashing is to distribute as evenly as possible.

We know that for HashMap table, the data distribution needs to be evenly distributed (it is best to have only one element per item, so you can find it directly), not too tight or too loose, too tight will lead to slow query speed, Taisong waste space. After calculating the hash value, how to ensure that the table element distribution is the same. We will think of modulo, but because of the large consumption of the modulus, the hashmap is handled like this: Call the Indexfor method.

static int indexfor (int h, int length) {return
    H & (length-1);
}

The HashMap's underlying array length is always 2 n-th and exists in the constructor: capacity <<= 1; This always guarantees that the HashMap's underlying array is 2 of the N-second party. ,h& (length-1) is equivalent to the length modulo, i.e. h%length, when the N-second square of length is 2, but the & ratio is more efficient than the direct modulus, which is an optimization of hashmap speed. As for why it is 2 of the N-Times side explained below.

We return to the Indexfor method, which has only one statement:h& (LENGTH-1), which has a very important responsibility in addition to the above modulo operation: evenly distribute table data and make full use of space. Here we assume length (2^n) and 15,h are 5, 6, 7.

When the length=15, 6 and 7 of the same result, so that they are stored in the table is the same location, that is, the collision, 6, 7 will be in a position to form a linked list, which will cause the query speed down. It is true that there are only three numbers here, so we'll look at 0-15.

From the chart above we saw that there were 8 of these collisions, and that there was a lot of wasted space, 1, 3, 5, 7, 9, 11, 13, 15, no records, that is, no data stored. This is because they are in the & operations with 14, the results of the last one is always 0, that is 0001, 0011, 0101, 0111, 1001, 1011, 1101, 1111 location is impossible to store data, less space, further increase the chance of collision, This can result in slow query speed. And when the array length is 16 o'clock, is 2 n-th, the value of the binary number obtained by 2n-1 is 1 (for example (24-1) 2=1111), which makes the same low as the original hash when & is low, plus the hash (int h) method to further optimize the hashcode of the key, add the High-order calculation, so that only two values of the same hash value will be placed in the same position in the array to form a linked list. So when length = 2^n, different hash value collision probability is relatively small, this will make the data in the table array distributed more evenly, the query speed is also faster.

Here we review the put process: when we want to add a pair of key-value to a hashmap, the system first calculates the hash value of the key and then confirms the location stored in the table based on the hash value. If the position has no elements, it is inserted directly. Otherwise, the hash value of the key is compared by iterating over the element linked list. If two hash values are equal and the key value is equal (E.hash = = Hash && ((k = e.key) = = Key | | key.equals (k)), the value of the original node is overwritten with the new entry value. If two hash values are equal but the key value is unequal, the node is inserted into the chain of the list. The specific implementation process see AddEntry method, as follows:

void AddEntry (int hash, K key, V value, int bucketindex) {
    //Get Entry entry<k at Bucketindex
    , v> e = Table[buck Etindex];
    Place the newly created Entry into the Bucketindex index, and let the new Entry point to the original Entry 
    Table[bucketindex] = to new entry<k, v> (hash, key, value, E );
    If the number of elements in the hashmap exceeds the limit, the capacity is enlarged twice times
    if (size++ >= threshold)
        Resize (2 * table.length);
}

There are two points to note in this method: 6. The production of the chain

This is a very elegant design. The system always adds a new entry object to the Bucketindex. If an object is already in place at Bucketindex, the newly added entry object will point to the original entry object, forming a entry chain, but if Bucketindex does not have a entry object, that is, E==null, The newly added entry object points to null and does not produce a entry chain. 7. Expansion issues

With the number of elements in the HashMap more and more, the probability of collision will become more and more, the resulting chain table length will become longer, this will inevitably affect the speed of hashmap, in order to ensure the efficiency of HASHMAP, the system must be at a certain critical point of expansion processing. The critical point in the HashMap is equal to the number of elements in the table array length * load factor. But expansion is a time-consuming process because it needs to recalculate the location of the data in the new table array and replicate it. So if we have predicted the number of elements in HashMap, then the number of preset elements can effectively improve the performance of HashMap.

According to the source code of the Put method above, when the program attempts to place a key-value pair into the HASHMAP, the program first determines the storage location of the Entry according to the hashcode () return value of the key: If two Entry key HASHC Ode () Return the same value, they are stored in the same location. If the key of these two Entry returns true through Equals, the value of the newly added Entry will overwrite the value of Entry in the collection, but the key will not overwrite. If the key of these two Entry returns false through Equals, the newly added Entry will form a Entry chain with Entry in the collection, and the newly added Entry is located in the head of the Entry chain. 8. Read

Compared to the hashmap, it is simpler to take. The hash value of the key is used to find the entry at the index in the table array, and then the value corresponding to the key is returned.

Public V get (Object key) {
    //If NULL, invoke the Getfornullkey method returns the corresponding value
    if (key = = null) return
        Getfornullkey (); C3/>//calculates its hash code  
    int hash = hash (Key.hashcode ()) According to the hashcode value of the key;
    Removes the value for the specified index at the table array for
    (entry<k, v> e = table[indexfor (hash, table.length)]; e!= null; e = e.next) {
        Ob Ject K;
        If the search key is the same as the lookup key, returns the corresponding value
        if (E.hash = Hash && (k = e.key) = = Key | | key.equals (k))) return
            E.val UE;
    }
    return null;
}

With the above stored hash algorithm as the basis, the understanding of this code is very easy. From the above source code can be seen: from the HashMap get elements, first calculate the key hashcode, find the corresponding position in the array of elements, and then through the key of the Equals method in the corresponding location of the linked list to find the necessary elements.

When the HashMap of each bucket stored in the Entry is only a single entry--that is not through the pointer to generate Entry chain, at this time the HASHMAP has the best performance: When the program through key out of the corresponding value, the system as long as the first calculation of the key Hashcode () Returns a value that finds the index of the key in the table array based on the Hashcode return value, then takes out the Entry at the index, and returns the value of the key.

As you can see from the code above, if there is only one Entry in each bucket of the HASHMAP, HASHMAP can quickly retrieve bucket from the Entry by index, and in the case of a "Hash conflict", a single bucket store is not an E Ntry, instead of a Entry chain, the system must iterate through each Entry in order until it finds the Entry to search for--if the Entry that is to be searched is at the very end of the Entry chain (the Entry was first placed in the bucket), the system must loop to the most To find the element.

3) To sum up simply, HashMap at the bottom of the key-value as a whole to deal with, this whole is a Entry object. The HashMap bottom uses a entry[] array to hold all the key-value pairs, and when a Entry object needs to be stored, the hash algorithm is used to determine where it is stored in the array, and where it is stored in the list at the array location based on the Equals method When a entry is needed, it also finds its storage location in the array according to the hash algorithm, and then removes the entry from the list in that location according to the Equals method. 9. Further discussion on the importance of Hashcode

As mentioned earlier, the hashcode of key in HashMap to do a rehash, to prevent some bad hash algorithm generated bad hashcode, then why to prevent bad hashcode.

Bad hashcode means a hash conflict, that is, multiple different keys may get the same hashcode, bad hash algorithm means that the probability of hash conflict increases, which means that the performance of HashMap will decline, in two ways:

(1), there are 10 keys, maybe 6 key hashcode are the same, the other four key entry evenly distributed in the location of the table, and a location is connected to 6 entry. This loses the meaning of HashMap, hashmap this kind of data structural high-performance premise is that entry evenly distributed in the table position, but now is 1 1 1 1 6 distribution. Therefore, we require hashcode have very strong randomness, so as possible to ensure the entry distribution of randomness, improve the efficiency of the hashmap.

(2), hashmap the code that traverses the list at a table location:

if (E.hash = = Hash && ((k = e.key) = = Key | | key.equals (k)))

See, because of the "&&" operator, so first compare hashcode,hashcode are not the same direct pass, will not be compared to the equals. Hashcode is very fast because it is an int value, and the Equals method tends to compare a series of content, slower. The probability of a hash conflict is large, means that the number of equals comparison is bound to increase, it will reduce the efficiency of hashmap.

Original link: Blog Park @ ordinary Greek http://www.cnblogs.com/xiaoxi/p/5822209.html

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.