Concurrenthashmap principle Analysis of Java collection---

Source: Internet
Author: User

One, background: thread insecure hashmap becauseIn a multithreaded environment, a put operation using hashmap causes a dead loop, which results in a CPU utilization of nearly 100%, so hashmap cannot be used in the concurrency scenario. Inefficient Hashtable container Hashtable containers use synchronized to ensure thread safety, but Hashtable is inefficient in the case of intense online competition. Because when a thread accesses the Hashtable synchronization method, other threads may enter the blocking or polling state when they access the Hashtable synchronization method. If thread 1 adds elements using put, thread 2 cannot add elements using the Put method, nor can it use the Get method to get elements, so the more competitive the more efficient the less. Lock segmentation Technology Hashtable container in the highly competitive concurrency environment, the reasons for low efficiency, Because all the threads that access the Hashtable must compete for the same lock , If there are multiple locks in the container, and each lock is used to lock some of the data in the container, there is no lock contention between threads when accessing data from different data segments in the container. This can effectively improve the efficiency of concurrent access, which is the lock segmentation technology used by Concurrenthashmap, First, the data is divided into a section of storage, and then to each piece of data with a lock, when a thread occupies a lock to access one of the data, other segments of the data can also be accessed by other threads. Some methods need to span segments, such as size () and Containsvalue (), and they may need to lock the entire table rather than just a segment, which requires all segments to be locked sequentially, and the locks for all segments are released sequentially after the operation is complete. Here "in order" is very important, otherwise it is very possible to deadlock, Within the Concurrenthashmap, the segment array is final,and its member variables are actually final, but simply declaring the array as final does not guarantee that the array members are final, which requires implementation assurances. This ensures that no deadlocks occur because the order in which the locks are obtained is fixed. Concurrenthashmap is made up of segment array structure and hashentry array structure。 Segment is a re-entry lock Reentrantlock that plays the role of lock in Concurrenthashmap, and Hashentry is used to store key-value pairs of data. A concurrenthashmap contains an array of segment, the structure of segment is similar to HashMap, and it is an array and a linked list structure, A segment contains an hashentry array, each hashentry is an element of a linked list structure, and each segment guardian is an element of the Hashentry array, and when the data for the Hashentry array is modified, It must first obtain the corresponding segment lock. Second, the application scenario when there is a large array that needs to be shared across multiple threads, consider whether to layer it to multiple nodes., avoid large locks. And we can consider some module positioning through the hash algorithm. In fact, not only for threading, when designing the transaction of the data table (the transaction is also the embodiment of the synchronization mechanism in a sense), you can think of a table as an array that needs to be synchronized, if the table data is too long to consider the separation of transactions (which is why to avoid the appearance of large tables), such as the data to split the field, the level of the table and so on.Third, source code interpretation Concurrenthashmap (1.7 and before) the main entity class is three: Concurrenthashmap(Whole hash table), Segment(barrels), Hashentry(node), corresponding to the above diagram can see the relationship between
/** * The segments, each of which is a specialized hash table */  final segment<k,v>[] segments;
Constant (immutable) and variable (Volatile) Concurrenthashmap allow multiple read operations to be performed concurrently, and read operations do not require locking. If you use traditional techniques, such as implementations in HashMap, if you allow the addition or deletion of elements in the middle of a hash chain, the read operation will get inconsistent data without locking. CONCURRENTHASHMAP implementation technology is to ensure that hashentry is almost immutable. The hashentry represents a node in each hash chain, and its structure is as follows:
Static Final class Hashentry<k,v> {       final K key;       final int hash;       volatile V value;       Final hashentry<k,v> next;   
You can see that except that value is not final, the other values are final, which means that you cannot add or remove nodes from the middle or tail of the hash chain, because this requires modifying the next reference value, and all of the node modifications can only start from the head. For put operations, they can be added to the head of the hash chain. However, for the remove operation, it may be necessary to remove a node from the middle, which requires that all the nodes in front of the node to be deleted are replicated all over again, and the last node points to the next one to delete the nodes. This is also described in detail when explaining the delete operation. To ensure that the read operation can see the most recent value, set the value to volatile, which avoids locking. Other to speed up the locating segment and the hash slot in the segment, the number of hash slots in each segment is 2^n, which makes it possible to position the hash slots in segments and segments by bit operations. When the concurrency level is the default value of 16 o'clock, which is the number of segments, the high 4 bits of the hash value determine which segment is allocated. But we also do not forget the "Introduction to the Algorithm" to our lesson: the number of hash slots should not be 2^n, which may lead to the hash slot distribution uneven, which requires the hash value to re-hash once. (This paragraph seems a bit superfluous) positioning operation:
Final segment<k,v> segmentfor (int hash) {       return segments[(hash >>> segmentshift) & Segmentmask] ;   }
  Now that Concurrenthashmap uses segmented lock segment to protect data from different segments, you must first locate the segment through the hashing algorithm when inserting and retrieving the elements. You can see that Concurrenthashmap first uses Wang/jenkins hash's variant algorithm to hash the hashcode of the element once. The purpose of the hash is to reduce the hash conflict, so that the elements can be distributed evenly on different segment, thus improving the efficiency of the container's access. If the quality of the hash is poor to the pole, then all elements are in a segment, not only the access element is slow, the segmented lock loses its meaning. I made a test that hashes are not executed directly by hashing them.  system.out.println (Integer.parseint ("0001111",  2)  & 15); System.out.println (Integer.parseint ("0011111",  2)  & 15); System.out.println (Integer.parseint ("0111111",  2)  & 15); System.out.println (Integer.parseint ("1111111",  2)  & 15);  The hash value of the output after calculation is all 15, This example shows that if you do not hash again, the hash conflict can be very serious, because as long as the low, regardless of the number of high, its hash is always the same. We re-hash the above binary data after the result is as follows, in order to facilitate reading, less than 32 bits of the high 0, every four bits with a vertical line split.   0100|0111|0110|0111|1101|1010|0100|11101111|0111|0100|0011|0000|0001|1011|10000111|0111|0110|1001|0100|0110|0011|11101000 |0011|0000|0000|1100|1000|0001|1010  can find that each bit of data is hashed out, and this re-hash allows each bit of the number to participate in the hash, thereby reducing the hash conflict. The Concurrenthashmap locates the segment with the following hashing algorithm. By default, Segmentshift is 28,segmentmask to15, the largest number after the hash is 32-bit binary data, the right to move unsigned 28-bit, meaning that the high 4-bit participation in the hash operation,  (Hash >>> segmentshift)  & The results of  segmentmask operation are 4,15,7 and 8, and we can see that the hash value does not conflict.
Final segment<k,v> segmentfor (int hash) {    return segments[(hash >>> segmentshift) & Segmentmask] ;}

All members of the data structure are final, where segmentmask and segmentshift are primarily for locating segments, see the Segmentfor method above. About the basic data structure of the hash table, here do not want to do too much discussion. A very important aspect of hash table is how to solve the hash conflict, Concurrenthashmap and HashMap use the same way, are the hash value of the same node in a hash chain. Unlike HashMap, Concurrenthashmap uses multiple sub-hash tables, that is, segments (Segment). Each segment is equivalent to a sub-hash table, and its data members are as follows:
Static final class Segment<k,v> extends Reentrantlock implements Serializable {/** * the numb           Er of elements in this segment ' s region.           */transient volatileint count; /** * Number of updates that alter the size of the table. This was * used during bulk-read methods to make sure they see a * consistent snapshot:if modcounts CH Ange during a traversal * of segments computing size or checking containsvalue, then * we might has a           n inconsistent view of State so (usually) * must retry.           */transient int modcount;           /** * The table is rehashed if its size exceeds this threshold.           * (The value of this field was always <tt> (int) (capacity * * loadfactor) </tt>.)           */transient int threshold;           /** * the Per-segment table. */transient volatile hashentry<k,v>[] Table  /** * The load factor for the hash table. Even though this value * was same for all segments, it's replicated to avoid needing * links to outer           Object.   * @serial */final float loadfactor;  }
Count is used to count the number of data in this segment, which is volatile, which is used to coordinate modification and read operations to ensure that read operations are able to read almost the latest modifications. The coordination method is this, each modification operation made a structural change, such as the addition/deletion of nodes (modify the value of the node is not a structural change), write the count value, each read operation starts to read the value of Count. This takes advantage of the enhancement of volatile semantics in Java 5, which has a happens-before relationship to the write and read of the same volatile variable. Modcount the number of times the structure of a statistical segment is changed, mainly in order to detect whether a segment has changed during the traversal of multiple segments and will be described in detail when it comes to cross-section operations. Threashold is used to indicate the threshold value that needs to be rehash. Table array the nodes in the bucket, each array element is a hash chain, expressed in hashentry. Table is also volatile, which makes it possible to read the latest table values without needing synchronization. Loadfactor represents a load factor. Remove operation Remove (key)
Public V Remove (Object key) {     hash = hash (Key.hashcode ());      return Segmentfor (hash). Remove (key, hash, null);   }

The entire operation is to navigate to the segment and then delegate to the remove operation of the segment. When multiple delete operations are concurrent, they can be performed concurrently, as long as they are in a different segment. The following is the Remove method implementation of segment:
 V remove (object key, int hash, object value) {lock ();           try {int c = count-1;           hashentry<k,v>[] tab = table;           int index = hash & (tab.length-1);           Hashentry<k,v> first = Tab[index];           hashentry<k,v> e = first;           while (E! = null && (E.hash! = Hash | |!key.equals (E.KEY))) e = E.next;           V oldValue = null;               if (E! = null) {v v = e.value;                   if (value = = NULL | | value.equals (v)) {oldValue = v;                   All entries following removed node can stay//in list, but all preceding ones need to be                   Cloned.                   ++modcount;                   hashentry<k,v> Newfirst = E.next; *for (hashentry<k,v> p = first; P! = e; p = p.next) *newfirst = new Hashentry<k,v> (p.key                                     , P.hash,                Newfirst, P.value);                   Tab[index] = Newfirst; Count = C;       Write-volatile}} return oldValue;       } finally {unlock (); }   }
The entire operation is performed with a segment lock, and the line before the blank line is primarily anchored to the node E to be deleted. Next, if the node does not exist, return null directly, otherwise it is necessary to copy the first node of E, the tail node points to the next node of E. The nodes behind the e do not need to be duplicated, they can be reused. What does the middle for loop do? (* mark) from the code point of view, is to locate all the entry after cloning and spell back to the front, but it is necessary? Each time you delete an element, you clone the previous element again? This is actually determined by the invariance of the entry, looking closely at the entry definition, and discovering that all other attributes except value are decorated with final, which means that after the next field is set for the first time, it can no longer be changed, instead of cloning all its previous nodes. As to why entry should be set to invariance, this is not required to synchronize with the invariant access to save time about the following is a before deleting an element: after deleting element 3:The second figure is actually a bit of a problem, the replication node should be a value of 2 node in front, the value of 1 of the node in the back, that is, exactly the same as the original node order, fortunately this does not affect our discussion. The entire remove implementation is not complex, but there are several points to note. First, when the node to be deleted exists, the last step to delete is to subtract the value of count by one. This must be the last step, otherwise the read operation may not see the structural modifications made to the previous segment. Second, the start of the Remove execution assigns the table to a local variable tab, because table is a volatile variable, and the cost of reading and writing volatile variables is significant. The compiler does not have any optimizations to read and write volatile variables, and direct access to non-volatile instance variables does not have much effect, and the compiler optimizes them accordingly. The get operation of the get Operation Concurrenthashmap is directly delegated to the segment get method, which looks directly at the segment get method:
V Get (Object key, int hash) {       if (count! = 0) {//Read-volatile the current bucket is 0          hashentry<k,v> e = GetFirst (ha SH);  Get the head node         while (E! = null) {               if (E.hash = = Hash && key.equals (E.key)) {                   v v = e.value;                   if (v! = null)                       return v;                   Return Readvalueunderlock (e); Recheck               }               e = E.next;           }       }       Returnnull;   
The get operation does not require a lock. Unless the read value is empty, the lock is stressed, we know that the Get method of the Hashtable container needs to be locked, then how does the Concurrenthashmap get operation do not lock? The reason is that the shared variable that is going to be used in its Get method is defined as volatile  the first step is to access the count variable, which is a volatile variable, since all the modifications are written in the last step of the count  variable when the structure is modified. This mechanism ensures that get operations can get almost the latest structural updates. For non-structural updates, that is, the change of the node value, because the value of the hashentry variable is  volatile, it is also guaranteed to read the most recent value.   The next step is to traverse the hash chain according to hash and key to find the node to get, if not found, directly to return NULL. The reason to traverse the hash chain without locking is that the chain pointer next is final. But the head pointer is not final, which is returned by the GetFirst (hash) method, which is the presence of values in the  table array. This allows GetFirst (hash) to return obsolete header nodes, for example, when the Get method is executed, after the GetFirst (hash) is executed, the other thread performs the delete operation and updates the header node, which results in the header node returned in the Get method not up-to-date. This is allowed, through the coordination mechanism of the count variable, that get can read to almost the latest data, although it may not be up to date. To get the latest data, only full synchronization is used.   Finally, if the desired node is found, determine its value if it is not empty and return it directly, otherwise read again in a locked state. This seems to be somewhat puzzling, theoretically the value of the node can not be empty, this is because  put when the judgment, if the empty will throw nullpointerexception. The only source of the null value is the default value in Hashentry, because value in  hashentry is not final, and non-synchronous reads may read to null values. Take a closer look at the statement:tab[index] = new hashentry<k,v> (Key, hash, first, value) of the put operation, In this statement, the assignment of value in the Hashentry constructor and the assignment to Tab[index] may be reordered, which may result in a null value for the node. Here, when V is empty, it is possible that a thread isChange the node, and the previous get operation is not locked, according to the Bernstein condition, read write or write after reading will cause inconsistent data, so here to re-lock this E to read again, to ensure that the correct value is obtained.
V Readvalueunderlock (hashentry<k,v> e) {       lock ();       try {           return e.value;       } finally {           unlock ();       }   }
such as the Count field used to count the current segement size and the value of the hashentry used to store the values. Variables defined as volatile, able to maintain visibility between threads, can be read at the same time, and guaranteed not to read expired values, but can only be single-threaded write (there is a case can be multi-threaded writing, that is, the value written is not dependent on the original value), In the get operation only need to read not to write the shared variable count and value, so you can not lock. The reason for not reading expired values is based on the happen before principle of the Java memory model, where write operations on volatile fields precede read operations, even if two threads modify and get the volatile variable at the same time, the get operation can get the most recent value. This is the classic scenario of replacing a lock with a volatile put operation the same put operation is also a put method entrusted to the segment. Here is the put method for the paragraph:
V Put (K key, int hash, V value, Boolean onlyifabsent) {       lock ();       try {           int c = count;           if (c + + > Threshold)//Ensure capacity               rehash ();           hashentry<k,v>[] tab = table;           int index = hash & (tab.length-1);           Hashentry<k,v> first = Tab[index];           hashentry<k,v> e = first;           while (E! = null && (E.hash! = Hash | |!key.equals (E.KEY)))               e = E.next;           V OldValue;           if (E! = null) {               oldValue = E.value;               if (!onlyifabsent)                   e.value = value;           }           else {               oldValue = null;               ++modcount;               Tab[index] = new hashentry<k,v> (key, hash, first, value);               Count = C; Write-volatile           }           return oldValue;       } finally {           unlock ();       }   }
This method is also carried out in the case of holding Zhi (locking the entire segment), which is, of course, for concurrency security, the modification of data is not concurrent, you must have a statement to determine whether the overrun is sufficient to ensure that the capacity is rehash. The next is to find if there is a node of the same key, and if it exists, replace the value of the node directly. Otherwise create a new node and add it to the head of the hash chain, so be sure to modify the value of Modcount and count, and also modify the value of count to be the last step. The Put method calls the Rehash method, the Reash method is also very sophisticated, mainly using the table size of 2^n, here is not introduced. And more difficult to understand is the sentence int index = hash & (tab.length-1), the original segment inside is the real Hashtable, that is, each segment is a traditional hashtable, such as, From the structure of the two can see the difference, here is to find out where the entry in the table, and then get the entry is the first node of the chain, if e!=null, the description found, this is to replace the value of the node (onlyifabsent = = False) , otherwise, we need a new entry, its successor is first, and let Tab[index] point to it, what does it mean? In fact, the new entry is inserted into the chain head, the rest is very easy to understand because the put method needs to write to the shared variables, so for thread safety, in the operation of shared variables must be locked. The Put method first navigates to the segment and then inserts in the segment. The insert operation takes two steps, the first step is to determine if you need to expand the hashentry array in segment, and the second step is to position the element and place it in the hashentry array.
    • Whether or not you need to enlarge. Before inserting an element, it is determined whether the Hashentry array in segment exceeds the capacity (threshold), and if the threshold is exceeded, the array is expanded. It is worth mentioning that segment's expansion judgment is more appropriate than HashMap, because HashMap is after inserting elements to determine whether the element has reached the capacity, if it arrives to expand, but it is likely that there is no new element expansion after the insertion, then hashmap an invalid expansion.
    • How to enlarge. When you expand, you first create an array that is twice the size of the original, and then hash the elements in the original array into the new array. For efficient concurrenthashmap, the entire container is not scaled up, and only one segment is scaled.
Another operation is containskey, and this implementation is much simpler because it does not need to read the value:
Boolean ContainsKey (Object key, int hash) {       if (count! = 0) {//Read-volatile           hashentry<k,v> e = GetFirst (h ASH);           while (E! = null) {               if (E.hash = = Hash && key.equals (e.key))                   returntrue;               e = E.next;           }       }       Returnfalse;   
Size () operation if we want to count the size of the elements in the whole concurrenthashmap, we must count the size of all the elements in the segment and sum them. Segment the global variable count is a volatile variable, then in the multi-threaded scenario, do we simply add all the segment count to get the entire concurrenthashmap size? No, although you can get the latest value of the count of each segment when it is added, the count will not be counted until it has changed before it can accumulate.  So the safest thing to do is to lock all segment's Put,remove and clean methods when counting the size, but this is obviously very inefficient. Because in the cumulative count operation, the probability of the previous cumulative count change is very small, so the Concurrenthashmap practice is to first try 2 times by not locking segment way to count the size of each segment, if the process of statistics,  When the container's count changes, the lock is used to count the size of all segment. So how does concurrenthashmap determine if the container has changed at the time of the statistic? With the Modcount variable, the variable modcount is added 1 before the put, remove, and clean methods are manipulated, so the size of the container is changed before and after the statistic size is modcount.

Concurrenthashmap principle Analysis of Java collection---

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.