Analyze the source code and performance optimization of HASHMAP data structure in Java _java

Source: Internet
Author: User
Tags benchmark comparable data structures hash int size static class intel core i7

Storage structure
First, the HashMap is stored based on a hash table. There is an array inside it, and when the element is to be stored, the hash value of its key is computed first, and the corresponding subscript of the element in the array is found based on the hash value. If there is no element in this position, put the current element directly in, and if there is an element (here is a), link the current element to the front of element A, and then put the current element into the array. So in HashMap, the array actually holds the first node of the linked list. Here is a picture of Baidu Encyclopedia:

As shown above, each element is a entry object in which the key and value of the element are saved, and a pointer can be used to point to the next object. All keys with the same hash value (that is, the conflict) are strung together by a list, which is the zipper method.

Internal variables

Default initial capacity
static final int default_initial_capacity =;
Maximum capacity
static final int maximum_capacity = 1 <<;
Default load factor
static final float default_load_factor = 0.75f;
Hash table
transient entry<k,v>[] table;
The number of key-value pairs
transient int size;
Expansion of the threshold
int threshold;
The load factor of the hash array is
final float loadfactor;

In the above variable, capacity refers to the length of the hash table, which is the size of the table, which defaults to 16. The load factor loadfactor is the "full level" of the hash table, as the JDK documentation says:

The load factor is a measure of how full the hash table is allowed to get before it capacity is automatically. When the number of entries in the hash table exceeds the product of the load factor and the current capacity, the Hash tab Le is rehashed (which, internal data structures are rebuilt) so this hash table has approximately twice the number O F Buckets.
The general meaning is that the load factor is the amount of measure that a hash table can fill before it is enlarged. When the number of "key pairs" in the hash table exceeds the product of the current capacity (capacity) and the load factor, the Hashtable is hashed (that is, the internal data structure is rebuilt), and the Hashtable's capacity is approximately twice times the size of the original.

From the above variable definition, we can see that the default load factor Default_load_factor is 0.75. The larger the value, the higher the space utilization, but the query speed (including get and put) will slow down. After you understand the load factor, threshold can also understand that it is actually equal to the capacity * load factor.

Construction device

 public HashMap (int initialcapacity, float loadfactor) {if (Initialcapacity < 0)
  throw new IllegalArgumentException ("Illegal initial capacity:" + initialcapacity);
  if (initialcapacity > maximum_capacity) initialcapacity = maximum_capacity; if (loadfactor <= 0 | | Float.isnan (Loadfactor)) throw new IllegalArgumentException ("Illegal load factor:" + loadfactor

  );
  Find a power of 2 >= initialcapacity int capacity = 1;

  while (capacity < initialcapacity)//calculates the power capacity <<= 1 of the smallest 2, which is greater than the specified capacity;
  This.loadfactor = Loadfactor;
  threshold = (int) math.min (capacity * Loadfactor, maximum_capacity + 1);  Table = new Entry[capacity]; Allocate space to a hash table usealthashing = sun.misc.VM.isBooted () && (capacity >= holder.alternative_hashing_threshold)
  ;
Init (); }

There are several constructors that will eventually invoke this on the top. It accepts two parameters, one is the initial capacity, and one is the load factor. First of all, judge the value of the wrong method, if there are problems will throw an exception. What is important is the calculation of the following capacity, whose logic is to compute the power of the smallest 2 greater than the initialcapacity. The goal is to make the capacity more than the specified initial capacity, but this value is 2 of the number of times, which is a key issue. The main reason for doing this is to map the hash value. Let's take a look at the HashMap method for hashing:

Final int hash (Object k) {
  int h = 0;
  if (usealthashing) {
    if (k instanceof String) {return
      Sun.misc.Hashing.stringHash32 ((String) k);
    }
    h = hashseed;
  }
  H ^= K.hashcode ();
  This function ensures so hashcodes that differ only by
  //constant multiples in each bit position have a bounded
  //number of collisions (approximately 8 at default load factor).
  H ^= (H >>>) ^ (h >>>);
  Return h ^ (H >>> 7) ^ (H >>> 4);
}
static int indexfor (int h, int length) {return
  H & (length-1);
}

Hash () method to recalculate the hash value of the key, with a more complex bit operation, the specific logic I am not clear, anyway, it is certainly a better way to reduce the conflict or something.

The following indexfor () is based on a hash that is worth the subscript of the element in the hash table. In the hash table, it is generally used to get the table long modulo by the hash value. H & (LENGTH-1) is the same effect when length (that is, capacity) is the power of 2. And, the power of 2 must be even, then minus 1 is odd, the last one of the binary must be 1. Then the last of H & (LENGTH-1) may be 1, or 0, and it can be hashed evenly. If length is an odd number, then length-1 is even, and the last one is 0. At this point H & (Length-1) The last one is only 0, all the resulting subscript are even, then the hash table wasted half of the space. So the capacity in the HashMap (capacity) must be a power of 2. You can see that the default default_initial_capacity=16 and maximum_capacity=1<<30 are the same.

Entry objects
the key value pairs in HashMap are encapsulated into entry objects, which is an internal class in HashMap, and look at its implementation:

Static Class Entry<k,v> implements Map.entry<k,v> {final K key;
  V value;
  Entry<k,v> Next;

  int hash;
    Entry (int h, K K, v V, entry<k,v> N) {value = V;
    Next = n;
    key = k;
  hash = h;
  Public final K Getkey () {return key;
  Public Final v. GetValue () {return value;
    Public final V SetValue (v newvalue) {v oldValue = value;
    value = newvalue;
  return oldValue; Public final Boolean equals (Object o) {if (!) (
    o instanceof Map.entry) return false;
    Map.entry e = (map.entry) o;
    Object K1 = Getkey ();
    Object K2 = E.getkey (); if (k1 = = K2 | | (K1!= null && k1.equals (K2)))
      {Object V1 = GetValue ();
      Object v2 = E.getvalue (); if (v1 = = V2 | |
        (v1!= null && v1.equals (v2)))
    return true;
  return false; Public final int hashcode () {return (Key==null 0:key.hashcode ()) ^ (value==null? 0:value.hashcod
  E ());

  }Public final String toString () {return getkey () + "=" + GetValue ();

 } void Recordaccess (hashmap<k,v> m) {} void Recordremoval (hashmap<k,v> m) {}}

The implementation of this class is simple and understandable. Getkey (), GetValue () and other methods are provided for the invocation, and the equality of the key and value is required.

Put action
put it in order to get it, so look at it first:

Public V-Put (K key, V value) {
  if (key = null) return
    Putfornullkey (value);
  int hash = hash (key);
  int i = indexfor (hash, table.length);
  for (entry<k,v> e = table[i]; e!= null; e = e.next) {
    Object K;
    if (E.hash = = Hash && ((k = e.key) = = Key | | key.equals (k))) {
      V oldValue = e.value;
      E.value = value;
      E.recordaccess (this);
      Return OldValue
    }
  }

  modcount++;
  AddEntry (hash, key, value, I);
  return null;
}

In this method, the first decision is whether the key is null or not, and the Putfornullkey () method is invoked, which means that HashMap allows the key to be null (in fact, value can). If not NULL, the hash value is computed and the subscript is obtained in the table. then query the corresponding list to see if the same key already exists, and then update the value directly if it already exists (value). Otherwise, the AddEntry () method is invoked to insert.

Take a look at the Putfornullkey () method:

Private v Putfornullkey (v value) {for
  (entry<k,v> e = table[0]; e!= null; e = e.next) {
    if (E.key = null) {
      V oldValue = e.value;
      E.value = value;
      E.recordaccess (this);
      Return OldValue
    }
  }
  modcount++;
  AddEntry (0, NULL, value, 0);
  return null;
}

As you can see, when the key is null, it is inserted directly at subscript 0, and the value is updated as it exists, otherwise the call to AddEntry () is inserted.

The following is the implementation of the AddEntry () method:

void AddEntry (int hash, K key, V value, int bucketindex) {
  if (size >= threshold) && (null!= table[bucket Index]) {
    Resize (2 * table.length);
    hash = (null!= key)? Hash (key): 0;
    Bucketindex = Indexfor (hash, table.length);
  }

  Createentry (hash, key, value, Bucketindex);
void Createentry (int hash, K key, V value, int bucketindex) {
  entry<k,v> e = Table[bucketindex];
  Table[bucketindex] = new entry<> (hash, key, value, e);
  size++;
}

First, determine if you want to expand (the expansion will recalculate the subscript, and copy the elements), and then compute the array subscript, and finally in Createentry () using the head interpolation method to insert the element.

Get operation

Public V get (Object key) {
  if (key = null) return
    getfornullkey ();
  entry<k,v> Entry = Getentry (key);

  return NULL = = entry? Null:entry.getValue ();
}
Private V Getfornullkey () {for
  (entry<k,v> e = table[0]; e!= null; e = e.next) {
    if (E.key = null)
      RE Turn e.value;
  }
  return null;
}
Final entry<k,v> getentry (Object key) {
  int hash = (key = = null)? 0:hash (key)
  ; for (entry<k,v> e = table[indexfor (hash, table.length)];
     e!= null;
     E = e.next) {
    Object k;
    if (E.hash = = Hash &&
      (k = e.key) = = Key | | (Key!= null && key.equals (k)
      )) return e;
  }
  return null;
}

This is simpler than put (), which also determines whether the key is null and then the traversal query of the linked list.

Performance optimization
HashMap is an efficient and versatile data structure that can be seen everywhere in every Java program. Let's introduce some basic knowledge first. As you may know, HashMap uses the key's Hashcode () and Equals () methods to divide the values into different buckets. The number of buckets is usually slightly larger than the number of records in the map, so that each bucket contains fewer values (preferably one). When searching through key, we can quickly locate a bucket (using hashcode () to modulo the number of buckets) and the object to look for in a constant time.

You should have known all these things. You may also know that a hash collision can have a disastrous effect on the performance of a hashmap. If multiple hashcode () values fall into the same bucket, the values are stored in a list. In the worst case, all the keys are mapped to the same bucket, so the hashmap degenerate into a list-the lookup time from O (1) to O (n). Let's first test the performance of HashMap in Java 7 and Java 8 under normal circumstances. In order to complete the behavior of controlling the Hashcode () method, we have defined a key class as follows:

Class Key implements comparable<key> {
private final int value;
Key (int value) {
this.value = value;
}
@Override public
int compareTo (Key o) {return
integer.compare (This.value, O.value);
}
@Override public
Boolean equals (Object o) {
if (this = O) return true;
if (o = = NULL | | getclass ()!= O.getclass ()) return
false;
Key key = (key) o;
return value = = Key.value;
}
@Override public
int hashcode () {return
value;
}
}

The implementation of the key class is a good one: it rewrites the Equals () method and provides a hashcode () method that is also counted. To avoid the excessive GC, I cache immutable key objects instead of starting over again every time:

Class Key implements comparable<key> {public
class Keys {public
static final int max_key = 10_000_000;
   
    private static final key[] Keys_cache = new Key[max_key];
static {for
(int i = 0; i < Max_key; ++i) {
Keys_cache[i] = new KEY (i);
}
public static Key of (int value) {return
keys_cache[value];
}


   

Now we can start the test. Our benchmark tests use continuous key values to create different sizes of HashMap (10 of the exponentiation, from 1 to 1 million). In the test we also use key to look up and measure the time spent by different sizes of HashMap:

Import Com.google.caliper.Param;
Import Com.google.caliper.Runner;
Import Com.google.caliper.SimpleBenchmark;
public class Mapbenchmark extends Simplebenchmark {
private hashmap<key, integer> map;
@Param
private int mapsize;
@Override
protected void SetUp () throws Exception {
map = new hashmap<> (mapsize);
for (int i = 0; i < mapsize ++i) {
map.put (keys.of (i), i);
}
}
public void Timemapget (int reps) {for
(int i = 0; i < reps i++) {
map.get (keys.of (i% mapsize));
}
}

Interestingly, in this simple Hashmap.get (), Java 8 is 20% faster than Java 7. Overall performance is also quite good: although HashMap has 1 million records, a single query only takes less than 10 nanoseconds, which is probably about 20 CPU cycles on my machine. Quite shocking! But that's not the goal we want to measure.

Assuming there is a bad key, he always returns the same value. This is the worst scenario, and this situation should not use HashMap at all:

Class Key implements Comparable<key> {
//...
@Override public
int hashcode () {return
0;
}
}

The results of Java 7 are expected. As the size of the hashmap increases, the cost of the Get () method becomes larger. Because all of the records are in the very long list in the same bucket, the average query for one record requires traversing the half. So as you can see from the diagram, its time complexity is O (n).

But the performance of Java 8 is much better! It is a log curve, so its performance is better than several orders of magnitude. Although there is a serious hash collision, it is the worst case, but the same benchmark test in the JDK8 time complexity is O (logn). A separate view of the JDK 8 curve would make it clearer that this is a logarithmic linear distribution:

Why is there such a large performance boost, although the Big O sign is used here (Big O describes the asymptotic upper bound)? In fact, this optimization has been mentioned in the JEP-180. If the record in a bucket is too large (currently treeify_threshold = 8), HashMap will dynamically replace it with a dedicated TREEMAP implementation. The result would be better, O (Logn), rather than bad O (n). How does it work? The records that correspond to the key of the previous conflict are simply appended to a list that can only be searched through traversal. But beyond this threshold, HashMap starts to upgrade the list to a binary tree, using the hash value as the branch variable of the tree, and if the two hashes are unequal, pointing to the same bucket, the larger one is inserted into the right subtree. If the hash value is equal, HashMap expects the key value to be the best implementation of the comparable interface, so that it can be inserted in order. This is not necessary for HashMap key, but it is certainly best if implemented. If you don't implement this interface, you don't expect to get a performance boost in the event of a serious hash collision.

What is the use of this performance boost? For example, a malicious program, if it knows that we are using a hash algorithm, it may send a large number of requests, resulting in a serious hash collision. Then the constant access to these keys can significantly affect the performance of the server, resulting in a denial of service attack (DoS). The leap from O (n) to O (Logn) in JDK 8 can effectively prevent similar attacks while also making the predictability of HASHMAP performance slightly stronger. I hope this promotion will eventually convince your boss to agree to upgrade to JDK 8来.

The test uses an environment that is Intel Core I7-3635QM @ 2.4 GHZ,8GB memory, SSD drives, and runs on a 64-bit Windows 8.1 system using the default JVM parameters.

Summarize
the basic implementation of HashMap, as analyzed above, concludes with the following:

    • The HashMap internally uses the entry object to save the key value pairs, based on the hash table storage, solves the conflict with the zipper method.
    • The default capacity size for HashMap is 16, and the default load factor is 0.75. You can specify the capacity size, and the capacity will eventually be set to a power of 2, which is uniformly hashed.
    • HashMap key and value can be null, of course only one key is Null,value can have multiple.
    • HashMap the number of key values exceeds the capacity * load factor will be expanded, the capacity after the expansion of about twice times the original. The expansion will be hashed, so the position of the element may change, and this is a time-consuming operation.
    • HashMap is not thread-safe.
Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.