Studying the Hash Storage Mechanism by analyzing the JDK source code -- Reprinting

Source: Internet
Author: User
Through source code analysis of hashmap and hashset, its Hash Storage Mechanism set and reference

Just like an array of reference types, when we put a Java object into an array, we do not actually put a Java object into an array, but just put the object reference into an array, each array element is a reference variable.

In fact, there are many similarities between hashset and hashmap. For hashset, the system uses the hash algorithm to determine the storage location of the collection elements. This ensures that the collection elements can be saved and retrieved quickly; for hashmap, the system key-value is processed as a whole. The system always calculates the storage location of key-value based on the hash algorithm, this ensures that the key-value pairs of map can be saved and retrieved quickly.

Before introducing the set storage, we need to point out that although the set claims to store Java objects, it does not actually put Java objects into the set, it is just a reference to keep these objects in the Set set. That is to say, the Java set is actually a collection composed of multiple referenced variables, which point to the actual Java object.

 

Back to Top

Storage Implementation of hashmap

When the program tries to put multiple key-values into hashmap, take the following code snippet as an example:

Hashmap <string, double> map = new hashmap <string, double> (); map. put ("language", 80.0); map. put ("math", 89.0); map. put ("English", 78.2 );

Hashmap uses a so-called "hash algorithm" to determine the storage location of each element.

When the program executes map. put ("", 80.0); The system calls the hashcode () method of "" to obtain its hashcode value-each Java object has a hashcode () method, you can obtain its hashcode value through this method. After obtaining the hashcode value of this object, the system determines the storage location of the element based on the hashcode value.

Let's look at the source code of the put (K key, V value) method of the hashmap class:

Public v put (K key, V value) {// if the key is null, call the putfornullkey method to process if (Key = NULL) return putfornullkey (value ); // calculate the hash value int hash = hash (key. hashcode (); // search for the index of the specified hash value in the corresponding tableInt I = indexfor (hash, table. Length );// If the entry at the I index is not null, The next element of E is continuously traversed through a loop (Entry <K, V> E = table [I]; e! = NULL; E = E. next) {object K; // find that the specified key is equal to the key to be put in (the hash value is the same // use equals to compare true) if (E. hash = hash & (k = E. key) = Key | key. equals (k) {v oldvalue = E. value; E. value = value; E. recordaccess (this); Return oldvalue ;}// if the entry at the I index is null, there is no entry modcount ++; // Add key and value to the I index addentry (hash, key, value, I); return NULL ;}
JDK source code

You can find a src.zip compressed file under the JDK installation directory, which contains all the source files of the Java base class library. As long as you are interested in learning, you can open this compressed file at any time to read the source code of the Java class library, which is very helpful to improve your programming ability. It should be pointed out that the source code contained in src.zip does not contain Chinese comments like the above. The comments are added by the author.

The above program uses an important internal interface: map. Entry. Each map. Entry is actually a key-value pair. It can be seen from the above program that when the system decides to store the key-Value Pair in hashmap, the value in the entry is not considered at all. It only calculates and determines the storage location of each entry based on the key. This also illustrates the previous conclusion: we can regard the value in the map set as a subsidiary of the key. When the system determines the storage location of the key, the value will be saved there.

The above method provides a method to calculate the hash code based on the return value of hashcode (): Hash (). This method is purely mathematical computation. The method is as follows:

static int hash(int h) {     h ^= (h >>> 20) ^ (h >>> 12);     return h ^ (h >>> 7) ^ (h >>> 4); }

For any given object, as long as its hashcode () returns the same value, the hash code value calculated by the Program Calling the hash (INT h) method is always the same. Next, the program will call the indexfor (int h, int length) method to calculate which index the object should be stored in the table array. The code for indexfor (int h, int length) is as follows:

static int indexFor(int h, int length) {     return h & (length-1); }

This method is very clever, it always uses H&(Table. Length-1) to obtain the storage location of the object. The length of the underlying array of hashmap is always 2 to the power of N. For more information, see the introduction of the hashmap constructor.

When length is always a multiple of 2, H& (length-1)It would be a very clever design: Suppose H = 5, length = 16, then H & length-1 will get 5; If H = 6, length = 16, then H & length-1 will get 6 ...... If H = 15, length = 16, H & length-1 will get 15; but when H = 16, length = 16, then H & length-1 will get 0; when H = 17, length = 16, then H & length-1 will get 1 ...... This ensures that the calculated index value is always within the index of the table array.

According to the source code of the put method, when the program tries to put a key-value pair into hashmap, the program first determines the storage location of the entry based on the return value of the key hashcode: if the hashcode () values of the keys of the two entries are the same, they are stored in the same location. If the keys of the two entries return true through equals comparison, the value of the newly added entry will overwrite the value of the original entry in the set, but the key will not overwrite. If the keys of these two entries are compared by equals, false is returned. The newly added entry forms an entry chain with the original entry in the set, the newly added entry is in the header of the entry chain. For more information, see the description of the addentry () method.

When a key-value pair is added to a hashmap, the return value of its key hashcode () determines the storage location of the key-Value Pair (that is, the entry object. When the hashcode () return values of the keys of the two entry objects are the same, the overwrite behavior is determined by the key comparison value through eqauls () (true is returned ), or generate an entry chain (return false ).

The above program also calls addentry (hash, key, value, I); Code, where addentry is a package access permission method provided by hashmap, this method is only used to add a key-value pair. The code for this method is as follows:

Void addentry (INT hash, K key, V value, int bucketindex) {// obtain the entry at the specified bucketindex <K, V> E = table [bucketindex]; // ① // place the newly created entry to the bucketindex index, and point the new entry to the original entry table [bucketindex] = new entry <K, V> (hash, key, value, e); // if the number of key-value pairs in map exceeds the limit if (size ++> = threshold) // extend the table object length to 2 times. Resize (2 * Table. Length); // ②}

The code for the above method is very simple, but it contains a very elegant design: the system always places the newly added entry object to the bucketindex index of the table array. If an entry object already exists at the bucketindex, the newly added entry object points to the original entry object (which generates an entry chain). If there is no entry object in the bucketindex index, that is, the E variable of code ① above is null, that is, the newly added entry object points to null, that is, no entry chain is generated.

 

Back to Top

Hash algorithm performance options

According to the code above, we can see that when the entry chain is stored in the same bucket, the newly added entry is always in the bucket, the entry that is first placed in the bucket is at the end of the entry chain.

There are two variables in the above program:

  • Size: this variable stores the number of key-value pairs contained in the hashmap.
  • Threshold: this variable contains the limit of the key-value pair that hashmap can accommodate. Its value is equal to the capacity of hashmap multiplied by the load factor ).

Code ② In the above program shows that when size ++> = threshold, hashmap automatically calls the resize method to expand the capacity of hashmap. The capacity of hashmap is doubled every time it is expanded.

The table used in the above program is actually a normal array, each array has a fixed length, the length of this array is the capacity of hashmap. Hashmap contains the following constructor:

  • Hashmap (): Construct a hashmap with an initial capacity of 16 and a load factor of 0.75.
  • Hashmap (INT initialcapacity): constructs a hashmap with an initial capacity of initialcapacity and a load factor of 0.75.
  • Hashmap (INT initialcapacity, float loadfactor): Creates a hashmap with the specified initial capacity and load factor.

When creating a hashmap, the system automatically creates a table array to save the entries in the hashmap. The following is the code of a constructor in the hashmap:

// Create hashmap public hashmap (INT initialcapacity, float loadfactor) with the specified initialization capacity and load factor {// The initial capacity cannot be negative if (initialcapacity <0) throw new capacity ("illegal initial capacity:" + initialcapacity); // if the initial capacity is greater than the maximum capacity, show the capacity if (initialcapacity> maximum_capacity) initialcapacity = maximum_capacity; // The load factor must be greater than 0. If (loadfactor <= 0 | float. isnan (loadfactor) throw new illegalargumentexcept Ion (loadfactor); // calculate the nth power value of 2 that is greater than initialcapacity. Int capacity = 1; while (capacity <initialcapacity) capacity <= 1; this. loadfactor = loadfactor; // set the capacity limit to * capacity load factor Threshold = (INT) (capacity * loadfactor); // initialize the table array table = new entry [capacity]; // ① Init ();}

The bold code in the above Code contains a concise code implementation: Find the nth power value greater than initialcapacity and the smallest 2, and use it as the actual capacity of hashmap (saved by the Capacity variable ). For example, if initialcapacity is set to 10, the actual capacity of the hashmap is 16.

Capacity of initialcapacity and hashtable

The initialcapacity specified during hashmap creation is not equal to the actual capacity of hashmap. Generally, the actual capacity of hashmap is larger than initialcapacity, unless the value of initialcapacity is equal to the N power of 2. Of course, after mastering the knowledge of hashmap capacity allocation, you should specify the initialcapacity parameter value to the nth power of 2 when creating a hashmap, which can reduce the computing overhead of the system.

At code 1, we can see that the essence of table is an array, an array of capacity length.

For hashmap and its Child classes, they use the hash algorithm to determine the storage location of elements in the collection. When the system starts to initialize hashmap, the system will create an entry array with a capacity length. The location of elements stored in this array is called "Bucket )", each bucket has its specified index, and the system can quickly access the elements stored in the bucket based on its index.

At any time, each "Bucket" of hashmap stores only one element (that is, one entry), because the entry object can contain a reference variable (that is, the last parameter of the entry constructor) it is used to point to the next entry, so it may occur that the bucket of hashmap has only one entry, but this entry points to another entry -- this forms an entry chain. 1:

Figure 1. Storage diagram of hashmap

 

Back to Top

Read Implementation of hashmap

When the entry stored in each bucket of hashmap is only a single entry -- that is, the entry chain is not generated through the pointer, hashmap has the best performance: when the program extracts the corresponding value through the key, the system only needs to calculate the return value of the hashcode () of the key, find the index of the key in the Table Array Based on the return value of the hashcode, and then retrieve the entry at the index, finally, return the value corresponding to the key. Check the get (K key) method code of the hashmap class:

Public v get (Object key) {// if the key is null, call getfornullkey to retrieve the corresponding value if (Key = NULL) return getfornullkey (); // calculate its hash code int hash = hash (key. hashcode (); // directly retrieve the value of the specified index in the table array. For (Entry <K, V> E = table [indexfor (hash, table. length)]; e! = NULL; // search for the next entr E = e of the entry chain. next) // ① {object K; // if the key of the entry is the same as the searched key if (E. hash = hash & (k = E. key) = Key | key. equals (k) Return e. value;} return NULL ;}

From the code above, we can see that if each bucket of hashmap has only one entry, hashmap can quickly retrieve the entry in the bucket according to the index; in the case of a "hash Conflict", an entry chain is not stored in a single bucket. The system can only traverse each entry in order, until the entry you want to search for is found. If the entry that you want to search for is located at the end of the entry chain (the entry is first placed in the bucket ), then the system must loop to the end to find this element.

In summary, hashmap treats key-value as a whole at the underlying layer, which is an entry object. At the underlying layer of hashmap, an entry [] array is used to store all key-value pairs. When an entry object needs to be stored, its storage location is determined based on the hash algorithm; when an entry needs to be retrieved, it will also find its storage location based on the hash algorithm and retrieve it directly. It can be seen that the reason why hashmap can quickly store and retrieve the entries it contains is similar to what our mother taught us in real life: different things should be placed in different places, you can quickly find it as needed.

When creating a hashmap, there is a default load factor. The default value is 0.75, which is a compromise between time and space costs: increasing the load factor can reduce the memory space occupied by the hash table (that is, the entry array), but it will increase the time overhead of data query, query is the most frequent operation (query is required for both the get () and put () Methods of hashmap). Reducing the load factor will improve the performance of data query, but it will increase the memory space occupied by the hash table.

After mastering the above knowledge, we can adjust the load factor value as needed when creating a hashmap. If the program is concerned about space overhead and memory shortage, you can increase the load factor appropriately. If the program is more concerned about the time overhead, the load factor can be appropriately reduced if the memory is relatively wide. Generally, programmers do not need to change the value of the load factor.

If you know at the beginning that hashmap will save multiple key-value pairs, you can use a large initialization capacity during creation, if the number of entries in hashmap never exceeds the capacity limit (capacity * load factor), hashmap does not need to call the resize () method to reassign the table array to ensure better performance. Of course, setting the initial capacity too high at the beginning may waste space (the system needs to create an entry array with a capacity length). Therefore, you must be careful when initializing the capacity settings when creating a hashmap.

 

Back to Top

Implementation of hashset

For a hashset, it is implemented based on hashmap. The underlying hashset uses hashmap to store all elements. Therefore, the implementation of hashset is relatively simple. You can view the source code of hashset and see the following code:

Public class hashset <E> extends abstractset <E> implements set <E>, cloneable, Java. io. serializable {// use the hashmap key to save all the elements in the hashset private transient hashmap <E, Object> map; // define a virtual object as the value of hashmap Private Static final object present = new object ();... // initialize the hashset. A hashmap public hashset () {map = new hashmap <E, Object> ();} is initialized at the underlying layer ();} // create a hashset with the specified initialcapacity and loadfactor // In fact, it is to create a hashmap public hashset (INT initialcapacity, float loadfactor) {map = new hashmap <e, object> (initialcapacity, loadfactor);} public hashset (INT initialcapacity) {map = new hashmap <E, Object> (initialcapacity);} hashset (INT initialcapacity, float loadfactor, boolean dummy) {map = new javashashmap <E, Object> (initialcapacity, loadfactor);} // call the map keyset to return all the keys public iterator <E> iterator () {return map. keyset (). iterator () ;}// call the size () method of hashmap to return the number of entries, and the number of elements in the set is obtained. Public int size () {return map. size () ;}// call the isempty () of hashmap to determine whether the hashset is empty. // when the hashmap is empty, the corresponding hashset is also empty. Public Boolean isempty () {return map. isempty ();} // call the hashinskey of hashmap to determine whether to include all elements of the specified key // hashset. It is the public Boolean contains (Object O) Saved by the key of hashmap) {return map. containskey (o) ;}// put the specified element into the hashset, that is, put the element as the key into the hashmap public Boolean add (E) {return map. put (E, present) = NULL;} // call the Remove Method of hashmap to delete the specified entry. In this way, the public Boolean remove (Object O) element of hashset is deleted) {return map. remove (o) = present;} // call the clear method of map to clear all entries, so that all elements in the hashset are cleared. Public void clear () {map. clear ();}...}

From the source code above, we can see that the implementation of hashset is actually very simple. It only encapsulates a hashmap object to store all the set elements, all set elements in a hashset are actually saved by the hashmap key, while the hashmap value stores a present, which is a static object.

Most hashset methods are implemented by calling the hashmap method. Therefore, the implementation of hashset and hashmap is essentially the same.

Put of hashmap and add of hashset

Because the add () method of hashset actually changes to calling the put () method of hashmap to add a key-value pair when adding a set element, when the key in the newly added hashmap entry is the same as the key in the set with the original entry (the return value of hashcode () is the same, true is also returned through equals comparison ), the value of the newly added entry will overwrite the value of the original entry, but the key will not change. Therefore, if you add an existing element to the hashset, newly Added collection elements (stored by the hashmap key at the underlying layer) do not overwrite existing collection elements.

After mastering the above theoretical knowledge, let's take a look at a sample program to test whether you have mastered the functions of hashmap and hashset.

 class Name{    private String first;     private String last;         public Name(String first, String last)     {         this.first = first;         this.last = last;     }     public boolean equals(Object o)     {         if (this == o)         {             return true;         }         if (o.getClass() == Name.class)         {             Name n = (Name)o;             return n.first.equals(first)                 && n.last.equals(last);         }         return false;     } }public class HashSetTest{    public static void main(String[] args)    {         Set<Name> s = new HashSet<Name>();        s.add(new Name("abc", "123"));        System.out.println(            s.contains(new Name("abc", "123")));    }}

After adding a new name ("ABC", "123") object to the hashset, the program immediately checks whether the hashset contains a new name ("ABC ", "123") object. Rough look, it is easy to think that the program will output true.

In actual running, the above program will see that the program outputs false, because the hashset criteria for determining the equality between two objects require that the hashcode () of the two objects be required, in addition to comparing and returning true through the equals () method () return values are equal. The above program did not override the hashcode () method of the name class. The values of hashcode () returned by two name objects are different. Therefore, hashset treats them as two objects, so the program returns false.

It can be seen that when we try to treat the object of a class as the key of hashmap, or try to save the object of this class into hashset, We will overwrite the equals (Object OBJ) of this class) the methods and hashcode () methods are very important, and the return values of these two methods must be consistent: when the two hashcode () return values of the class are the same, they use equals () if the method is compared, return true. Generally, all key attributes involved in calculating the return value of hashcode () should be used as the criteria for comparison of equals.

Hashcode () and equals ()

For details about how to correctly override the hashcode () method and equals () method of a class, refer to the crazy Java handout in crazy Java.

The following program correctly overrides the hashcode () and equals () Methods of the name class. The program is as follows:

Class Name {private string first; private string last; public name (string first, string last) {This. first = first; this. last = last;} // judge whether two names are equal according to first: Public Boolean equals (Object O) {If (this = O) {return true;} If (O. getclass () = Name. class) {name n = (name) O; return n. first. equals (first) ;}return false ;}// return the public int hashcode () {return first value based on the hashcode () value of the first object. hashcode ();} Public String tostring () {return "name [first =" + first + ", last =" + last + "]" ;}} public class hashsettest2 {public static void main (string [] ARGs) {hashset <Name> set = new hashset <Name> (); set. add (new name ("ABC", "123"); set. add (new name ("ABC", "456"); system. out. println (SET );}}

The above program provides a name class, which overrides the equals () and tostring () methods. Both methods are determined based on the first instance variable of the name class, when the first instance variables of the two name objects are the same, the hashcode () return values of the two name objects are the same, and true is returned for comparison through equals.

The main program method first adds the first name object to the hashset. The first instance variable value of this name object is "ABC ", next, the program tries to add a name object named first "ABC" to the hashset. Obviously, the new name object cannot be added to the hashset, because the first of the name object to be added here is "ABC", hashset judges that the new name object is the same as the original name object, so it cannot be added, when the program outputs the Set set in code ①, it will see that the set contains only one name object, which is the first name object whose last is "123.

 

Address: http://www.ibm.com/developerworks/cn/java/j-lo-hash/

Studying the Hash Storage Mechanism by analyzing the JDK source code -- Reprinting

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.