JAVA CAs principle depth analysis concurrent implementation __java

Source: Internet
Author: User
Tags cas volatile

The Java.util.concurrent package is built entirely on CAS and will not have this package without CAs. The importance of CAs is visible.

CAS

Cas:compare and swap, translated into comparisons and exchanged.

In Java.util.concurrent package, a kind of optimistic lock which distinguishes from Synchronouse synchronous lock is implemented by CAS.

This article starts from the application of CAs, and then analyzes the principle.

CAS Applications

CAS has 3 operands, memory value V, old expected value A, new value B to modify. If and only if the expected value A and memory value v are the same, the memory value V is modified to B, otherwise nothing is done.

non-blocking algorithm (nonblocking algorithms)

The failure or suspension of one thread should not affect the failure or suspend algorithm of other threads.

Modern CPUs provide special instructions for automatically updating shared data, and can detect interference from other threads, and Compareandset () uses these instead of locks.

Take out the Atomicinteger to study how the data is correct without a lock.

private volatile int value;

First of all, in the absence of a lock mechanism may need to use the volatile primitives, to ensure that the data between the threads is visible (shared).

This allows the value of the variable to be read directly when it is obtained.

Public final int get () {
return value;
}

Then look at how ++i did it.

Public final int Incrementandget () {
for (;;) {
int current = get ();
int next = current + 1;
if (Compareandset (current, next))
return to Next;
}
}

A CAS operation is used here, each time the data is read from memory and then the results of this data and +1 are CAS-operated, and the results are returned if successful, or retry until successful.

And Compareandset uses JNI to complete the CPU instruction operation.

Public final Boolean compareandset (int expect, int update) {
Return Unsafe.compareandswapint (this, valueoffset, expect, update);
}

The whole process is like this, using the CPU of the CAS directive, at the same time using JNI to complete the Java non-blocking algorithm. Other atomic operations are done using similar features.

which

Unsafe.compareandswapint (this, valueoffset, expect, update);

Similar:

if (this = = expect) {

this = Update

return true;

} else {

return false;

}

So the question is, the success process takes 2 steps: compare this = = expect, and replace this = Update,compareandswapint How these two steps are atomic. Refer to the principle of CAs.

CAs principle

CAS is implemented by invoking the code for JNI. Jni:java Native interface is a local call to Java, allowing Java to invoke other languages.

And Compareandswapint is the use of C to invoke the CPU low-level instructions to implement.

The following analysis compares the common CPUs (Intel x86) to explain the implementation of CAs.

The following is the source code for the Compareandswapint () method of the Sun.misc.Unsafe class:

Public Final native Boolean compareandswapint (Object o, long offset,
                                              int expected,
                                              int x);

You can see that this is a local method call. The C + + code that this local method calls in sequence in OpenJDK is: Unsafe.cpp,atomic.cpp and ATOMICWINDOWSX86.INLINE.HPP. The final implementation of this local method is located at the following location in OpenJDK: openjdk-7-fcs-src-b147-27jun2011\openjdk\hotspot\src\oscpu\windowsx86\vm\ ATOMICWINDOWSX86.INLINE.HPP (corresponds to the Windows operating system, X86 processor). Here is a fragment of the source code corresponding to the Intel x86 processor:

Adding a lock prefix to a instruction on MP machine
//VC + + doesn ' t like the lock prefix to is on a single line
  //so we can ' t insert a label after the lock prefix.
By emitting a lock prefix, we can define a label after it.
#define LOCK_IF_MP (MP) __asm CMP MP, 0  \
                       __asm je L0      \
                       __asm _emit 0xF0 \
                       __asm L0:

inline Jint
  atomic::cmpxchg    (jint     exchange_value, volatile jint* dest     , Jint compare_value     ) {
  // Alternative for interlockedcompareexchange
  int MP = OS::IS_MP ();
  __asm {
    mov edx, dest
    mov ecx, exchange_value
    mov eax, compare_value
    lock_if_mp (MP)
    Cmpxchg DWORD ptr [edx], ecx
  }
}

As the source code above shows, the program will decide whether to add a lock prefix for the CMPXCHG directive based on the type of current processor. If the program is running on a multiprocessor, add the lock prefix (lock CMPXCHG) to the cmpxchg instruction. Conversely, if the program is running on a single processor, the lock prefix is omitted (the single processor itself maintains sequential consistency within a single processor and does not require the memory barrier effect provided by the lock prefix).

Intel's manual describes the lock prefix as follows: Ensure that the read-write operation of memory is performed by atoms. In processors prior to Pentium and Pentium, instructions with the lock prefix lock the bus during execution, leaving the other processors temporarily unable to access memory through the bus. Obviously, this can lead to expensive expenses. Starting with the Pentium 4,intel Xeon and the P6 processor, Intel makes a meaningful optimization based on the existing bus lock: If the memory region to be accessed (area of Memory) is locked in a cache within the processor during the execution of the lock prefix instruction (that is, the cache line containing the memory area is currently exclusive or modified) and the memory area is fully contained in a single cache line, then the processor executes the instruction directly. Because the cache line has been locked during instruction execution, other processors cannot read/write the memory area to which the instruction is to be accessed, thus guaranteeing the atomicity of the instruction execution. This procedure is called cache locking, Cache locking will greatly reduce the execution overhead of the lock prefix instruction, but will still lock the bus when the competition between multiprocessor is high or the memory address of the instruction access is not aligned. Disables this instruction from reordering read and write instructions before and after. Flushes all data in the write buffer to memory.

Note knowledge:

There are 3 types of locks on CPUs:

3.1 Processor automatically guarantees the atomicity of basic memory operations

First, the processor automatically guarantees the atomic nature of the basic memory operation. The processor guarantees that reading or writing a byte from the system memory is atomic, meaning that when a processor reads a byte, the other processor cannot access the byte's memory address. Pentium 6 and the latest processors can automatically guarantee that a single processor is atomic to the 16/32/64 bit in the same cache line, but a complex memory processor does not automatically guarantee its atomicity, such as cross bus widths, across multiple cache rows, and access to a page table. However, the processor provides two mechanisms for bus locking and cache locking to ensure the atomicity of complex memory operations.

3.2 Use of the bus lock to ensure atomicity

The first mechanism is to guarantee atomicity by means of a bus lock. If multiple processors simultaneously read and overwrite shared variables (i++ is the classic read overwrite) operation, the shared variable is then manipulated by multiple processors at the same time, so the read rewrite operation is not atomic, and the value of the shared variable will be inconsistent with the expectation after the operation, for example: if I=1, we do two times I + + operation, we expect the result to be 3, but it is possible that the result is 2. The following figure

The reason is that it is possible for multiple processors to read the variable i from their respective caches at the same time, and then add them separately, and then write to the system memory separately. So if you want to make sure that reading overwrites the shared variable is atomic, you must ensure that CPU1 reads overwrite the shared variable, CPU2 cannot manipulate caching of the shared variable's memory address.

The processor uses a bus lock to solve the problem. The so-called bus lock is a lock# signal provided by the processor, and when a processor outputs this signal on the bus, the other processor's request is blocked and the processor can use the shared memory exclusively.

3.3 Use of cache locks to guarantee atomicity

The second mechanism is to guarantee atomicity through cache locking. At the same time we only need to ensure that the operation of a memory address is atomic, but the bus lock to the CPU and memory communication between the lock, so that during the lock, other processors can not operate other memory address data, so the cost of the bus lock is relatively large, The nearest processor uses cache locking to optimize in some cases instead of a bus lock.

Frequently used memory is cached in the processor's l1,l2 and L3 cache, so atomic operations can be done directly in the processor's internal cache, without the need to declare a bus lock, and the "cache lock" method is used in the Pentium 6 and most recent processors to achieve complex atomicity. The so-called "cache lock" is if the cache in the processor cache line memory area is locked during the lock operation, when it performs a lock operation to write back memory, the processor does not claim the lock# signal on the bus, but modifies the internal memory address and allows its caching consistency mechanism to guarantee the atomic nature of the operation. Because the caching consistency mechanism prevents the simultaneous modification of memory area data that is cached by more than two processors. The cache row is invalid when another processor writes data to a cached row that has been locked, and in Example 1, when CPU1 modifies I in the cache row using a cache lock, then CPU2 cannot cache I cache rows at the same time.

However, there are two cases where the processor does not use cache locking. In the first case, the processor invokes the bus lock when the operation's data cannot be cached inside the processor, or the operation's data spans multiple cache lines (cache line). The second scenario is that some processors do not support cache locking. For Inter486 and Pentium processors, a bus lock is invoked even if the locked memory area is in the processor's cache line.

The above two mechanisms can be implemented by inter processors that provide a number of lock prefix directives. such as bit test and modify instruction BTS,BTR,BTC, Exchange instruction Xadd,cmpxchg and other operands and logical instructions, such as add, or (or), etc., the memory area that is manipulated by these instructions will be locked, causing the other processors to not access it at the same time.

CAs Disadvantage

Although CAS is an efficient solution to atomic operations, there are still three major problems with CAs. ABA problem, long cycle time overhead and only guarantee the atomic operation of a shared variable

1. The issue of ABA . Because CAs need to check that the value is not changed when manipulating values, if the change is not changed, but if a value turns out to be a, b, and a, then the check with CAS will find that its value has not changed, but in fact it has changed. The solution to the ABA problem is to use the version number. Append the version number to the variable, each time the variable is updated with the version number plus one, then the a-b-a will become 1a-2b-3a.

Starting with JAVA15, the JDK atomic package provides a class atomicstampedreference to address the ABA problem. The Compareandset method of this class is to first check whether the current reference equals the expected reference, and whether the current flag is equal to the expected flag, and if all equal, set the reference and the value of the flag to the given update value in atomic form.

Reference document on ABA issues: http://blog.hesey.net/2011/09/resolve-aba-by-atomicstampedreference.html

2. Long cycle time overhead . A spin-CAS is a very large execution cost to the CPU if it is unsuccessful for a long time. If the JVM can support the pause instructions provided by the processor, then there will be a certain increase in efficiency, the pause directive has two functions, first it can delay the pipelining instruction (de-pipeline), so that the CPU will not consume too much execution resources, the delay depends on the specific implementation of the version, The latency time on some processors is zero. Second, it can avoid the CPU pipeline being emptied (CPU pipeline flush) due to the memory sequence conflict (memory order violation) when exiting the loop, thereby increasing the CPU execution efficiency.

3. Only the atomic operation of a shared variable can be guaranteed . When an operation is performed on a shared variable, the We can use cyclic CAS to guarantee atomic operation, but when operating on multiple shared variables, cyclic CAS cannot guarantee the atomic nature of the operation, this time you can use a lock, or a tricky way is to combine multiple shared variables into a shared variable to operate. For example, there are two shared variables i=2,j=a, combine ij=2a, and then use the CAs to manipulate IJ. Starting with Java1.5 The JDK provides a atomicreference class to guarantee atomicity between referenced objects, and you can put multiple variables in one object for CAS operations.

implementation of concurrent package

Because Java CAS also have volatile read and volatile memory semantics, communication between Java threads now has the following four ways: a thread writes the volatile variable, and then the B thread reads the volatile variable. A thread writes the volatile variable, and then the B thread updates the volatile variable with the CAs. A thread updates a volatile variable with CAS, followed by a B thread updating the volatile variable with CAs. A thread updates a volatile variable with CAS, followed by a B-thread reading the volatile variable.

The Java CAS will use the highly efficient machine-level atomic instructions provided on modern processors, these atomic directives perform read-and-write operations on memory in an atomic manner, which is the key to implementing synchronization in multiprocessor (essentially, a computer that supports atomicity read-write instructions), is an asynchronous equivalence machine for sequential computing of Turing machines, so any modern multiprocessor will support some kind of atomic instruction that can perform atomic read-write operations on memory. At the same time, volatile variable read/write and CAs can implement communication between threads. The integration of these features forms the cornerstone of the entire concurrent package. If we carefully analyze the source code implementation of the concurrent package, we will find a generalized implementation pattern: First, declare the shared variable to be volatile, and then use the atomic conditional update of CAs to synchronize between threads, and, together with volatile read/ Write and CAs have volatile read and write memory semantics to implement communication between threads.

Aqs, Non-blocking data structures and atomic variable classes (classes in the Java.util.concurrent.atomic package), the underlying classes in these concurrent packages are implemented using this pattern, and the high-level classes in the concurrent package are dependent on these base classes. Overall, the concurrent package implementation diagram is as follows:



Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.