Read a bunch of articles, and finally the principle of Java CAs in-depth analysis clearly.
Thanks to Google's powerful search, take this sarcasm Baidu, rely on Baidu what all learn!
Reference Documentation:
Http://www.blogjava.net/xylz/archive/2010/07/04/325206.html
Http://blog.hesey.net/2011/09/resolve-aba-by-atomicstampedreference.html
Http://www.searchsoa.com.cn/showcontent_69238.htm
http://ifeve.com/atomic-operation/
Http://www.infoq.com/cn/articles/java-memory-model-5
The Java.util.concurrent package is built entirely on CAS, and there is no CAs to have this package. The importance of CAs is visible.
CAS
Cas:compare and swap, translated into comparison and exchanged.
In the Java.util.concurrent package, a kind of optimistic lock is distinguished from the Synchronouse synchronous lock by using CAs.
This article starts with the application of CAs, and then deeply analyzes the principle.
CAS Applications
CAS has 3 operands, a memory value of V, an old expected value of a, and a new value to be modified B. If and only if the expected value A and the memory value of the V phase, the memory value of V is modified to B, otherwise do nothing.
non-blocking algorithm (nonblocking algorithms)
Failure or suspend of one thread should not affect other threads ' failed or suspended algorithms.
Modern CPUs provide special instructions to automatically update shared data and detect interference from other threads, and Compareandset () replaces the lock with these.
Take out Atomicinteger to study how the data is correct without a lock.
private volatile int value;
First of all, in the absence of a lock mechanism may require the use of volatile primitives, to ensure that data between threads is visible (shared).
This allows the value of the variable to be read directly when it is obtained.
Public final int get () {
return value;
}
Then let's see how ++i did it.
Public final int Incrementandget () {
for (;;) {
int current = get ();
int next = current + 1;
if (Compareandset (current, next))
return next;
}
}
The CAS operation is used here, each time the data is read from memory and then the CAS operation is performed with the result of +1, if successful, the result is returned, otherwise it is retried until it succeeds.
Instead, Compareandset uses JNI to perform the operation of the CPU instructions.
Public final Boolean compareandset (int expect, int update) {
Return Unsafe.compareandswapint (this, valueoffset, expect, update);
}
The whole process is like this, using the CPU's CAS instruction, and using JNI to complete the Java non-blocking algorithm. Other atomic operations are done using a similar feature.
which
Unsafe.compareandswapint (This, valueoffset, expect, update);
similar:
if (this = = expect) {
This = Update
return true;
} else {
return false;
}
so the problem is, the success process requires 2 steps: compare this = = expect, replace this = Update,compareandswapint How do these two steps atomicity? Refer to the principles of CAs.
CAs principle
CAS is implemented by invoking JNI code. Jni:java Native Interface is a Java local call that allows Java to invoke other languages.
The compareandswapint is implemented using C to invoke the CPU's underlying instructions.
The following analysis compares the common CPU (Intel x86) to explain the principle of CAS implementation.
The following is the source code for the Compareandswapint () method of the Sun.misc.Unsafe class:
Public Final native Boolean compareandswapint (Object o, long offset, int expected, int x);
You can see that this is a local method call. The native method calls the C + + code in OPENJDK in turn: Unsafe.cpp,atomic.cpp and AtomicWindowsx86.inline.hpp. The final implementation of this local method is in the following position in OPENJDK: openjdk-7-fcs-src-b147-27June2011\openjdk\hotspot\src\oscpu\windows x86\vm\ atomicWindowsx86.inline.hpp (corresponds to the Windows operating system, X86 processor). The following is a fragment of the source code corresponding to the Intel x86 processor:
Adding a lock prefix to an instruction on MP machine//VC + + doesn ' t like the lock prefix to is on a single line//so we Can ' t insert a label after the lock prefix.//by emitting A-lock prefix, we can define a label after it. #define Lock_if_m P (MP) __asm CMP MP, 0 __asm je L0 __asm _emit 0xF0 __asm l0:inline jint atomic::cmpxchg (jint Exchange_value, volatile jint* dest, Jint compare_value) { //Alternative for InterlockedCompareExchange int MP = OS::IS_MP (); __asm { mov edx, dest mov ecx, exchange_value mov eax, compare_value lock_if_mp (MP) Cmpxchg DWORD ptr [edx], ecx }}
As shown in the source code above, the program determines whether to add a lock prefix to the CMPXCHG directive based on the current processor type. If the program is running on a multiprocessor, add the lock prefix (lock CMPXCHG) to the cmpxchg instruction. Conversely, if the program is running on a single processor, the lock prefix is omitted (the single processor itself maintains sequential consistency within a single processor and does not require the memory barrier effect provided by the lock prefix).
The Intel manual describes the lock prefix as follows:
- Ensure that the read-change-write operation of the memory is performed atomically. In processors prior to Pentium and Pentium, instructions with a lock prefix lock the bus during execution, leaving other processors temporarily unable to access memory through the bus. Obviously, this will cost you dearly. Starting with the Pentium 4,intel Xeon and P6 processors, Intel has made a significant optimization on the basis of the original bus lock: If the area of memory to be accessed Memory) is locked in the cache inside the processor during the lock prefix instruction (that is, the cache row that contains the memory area is currently exclusive or modified), and the region is fully contained in a single cache line, and the processor executes the instruction directly. Because the cache row is locked during instruction execution, the other processor cannot read/write the memory area to which the instruction is to be accessed, thus guaranteeing the atomicity of the instruction execution. This procedure is called cache locking, and the cache lock will significantly reduce the execution overhead of the lock prefix instruction, but will still lock the bus when there is a high degree of contention between multiple processors or if the memory address of the instruction access is misaligned.
- It is forbidden to re-order the instruction with the previous and subsequent read and write instructions.
- Flushes all the data in the write buffer to memory.
Notes knowledge:
There are 3 types of locks on the CPU:
3.1 Processor automatically guarantees atomicity of basic memory operations
First the processor automatically guarantees the atomicity of basic memory operations. The processor guarantees that a byte is read or written from the system memory, meaning that when a processor reads one byte, the other processor cannot access the memory address of the byte. Pentium 6 and the latest processors can automatically guarantee that single-processor 16/32/64 bits in the same cache line are atomic, but complex memory manipulation processors cannot automatically guarantee their atomicity, such as cross-bus widths, across multiple cache lines, and cross-page table access. However, the processor provides bus locking and cache locking two mechanisms to ensure the atomicity of complex memory operations.
3.2 Using a bus lock to ensure atomicity
The first mechanism is to ensure atomicity through a bus lock. If multiple processors simultaneously read and overwrite shared variables (i++ is the classic read overwrite operation), then the shared variable will be manipulated by multiple processors at the same time, so that the read rewrite operation is not atomic, and the value of the shared variable will be inconsistent with the expected after the operation, for example: if I=1, we do two times I + + operation, we expect the result to be 3, but it is possible that the result is 2. Such as
The reason is that it is possible for multiple processors to read the variable i from their respective caches at the same time, adding one operation separately and then writing to the system memory. If you want to ensure that the operation of the read overwrite of the shared variable is atomic, you must ensure that the CPU1 read overwrites the shared variable, and CPU2 cannot manipulate the cache that caches the shared variable's memory address.
The processor uses a bus lock to solve this problem. The so-called bus lock is a lock# signal that is provided by the processor, and when a processor outputs this signal on the bus, the other processor's request is blocked, and the processor can use the shared memory exclusively.
3.3 Using a cache lock to ensure atomicity
The second mechanism is to ensure atomicity through cache locking. At the same time we just need to ensure that the operation of a memory address is atomic, but the bus lock the CPU and memory communication between the lock, which makes the lock during the other processors can not manipulate the data of other memory addresses, so bus locking overhead is relatively large, Recent processors use cache locking instead of bus locking in some situations to optimize.
Frequently used memory is cached in the processor's L1,L2 and L3 caches, so atomic operations can be done directly in the processor's internal cache, without the need to declare bus locks, which can be used to achieve complex atomicity in the form of "cache locking" in Pentium 6 and most recent processors. The so-called "cache lock" is if the cache in the processor cache line in the memory area is locked during the lock operation, when it performs a lock operation writeback memory, the processor does not claim the lock# signal on the bus, but modifies the internal memory address, and allows its cache consistency mechanism to ensure the atomicity of operations, Because the cache coherency mechanism prevents the simultaneous modification of memory region data that is cached by more than two processors, the cache row is invalid when the other processor writes back the data of the cached row that has been locked, and in Example 1, when CPU1 modifies I in the cache row using cache locking, CPU2 cannot cache the I cache row at the same time.
However, there are two cases in which the processor does not use cache locking. The first scenario is when the data of the operation cannot be cached inside the processor, or the operation's data spans multiple cache lines, the processor calls the bus lock. The second scenario is that some processors do not support cache locking. For Inter486 and Pentium processors, bus locking is also called in the cache line of the processor, even if the locked memory area is in the process.
The above two mechanisms can be implemented through the inter processor with a lot of lock prefix instructions. For example, the bit test and modify instruction BTS,BTR,BTC, Exchange instruction Xadd,cmpxchg and some other operands and logic instructions, such as add (plus), or (or), etc., the memory area that is manipulated by these instructions is locked, causing the other processor to not access it at the same time.
CAs cons
CAS is an efficient solution to atomic operations, but CAS still has three major problems. ABA problem, long cycle time overhead and guaranteed atomic operation of only one shared variable
1. ABA issues . Because CAs needs to check that the value is not changed when the value is manipulated, if it does not change, it is updated, but if a value is a, B, and a, then checking with CAS will show that its value has not changed, but actually it has changed. The solution to the ABA problem is to use the version number. Append the version number before the variable, and add one to the version number each time the variable is updated, then A-b-a will become 1a-2b-3a.
Starting from JAVA1. 5 The JDK Atomic package provides a class atomicstampedreference to solve the ABA problem. The Compareandset method of this class is to first check whether the current reference is equal to the expected reference, and whether the current flag is equal to the expected flag, and if all is equal, then atomically sets the reference and the value of the flag to the given update value.
About ABA Issues Reference document: Http://blog.hesey.net/2011/09/resolve-aba-by-atomicstampedreference.html
2. Long cycle time overhead . Spin CAs can cause very large execution overhead for CPUs if they are unsuccessful for a long time. If the JVM can support the pause instruction provided by the processor then the efficiency will be improved, the pause command has two functions, first it can delay the pipelining instruction (de-pipeline), so that the CPU does not consume excessive execution resources, the delay depends on the specific implementation of the version, On some processors, the delay time is zero. Second, it avoids the CPU pipelining being emptied (CPU pipeline flush) when exiting the loop due to memory order collisions (violation), which improves CPU execution efficiency.
3. Only one atomic operation of a shared variable can be guaranteed . When performing operations on a shared variable, we can use the method of circular CAs to guarantee atomic operations, but for multiple shared variables, the cyclic CAS cannot guarantee the atomicity of the operation, it is possible to use locks at this time, or there is a trickery way to combine multiple shared variables into a single shared variable to operate. For example, there are two shared variable i=2,j=a, merge ij=2a, and then use CAs to manipulate IJ. Starting with Java1.5 The JDK provides the atomicreference class to guarantee atomicity between reference objects, and you can put multiple variables in an object for CAS operations.
implementation of concurrent package
Because Java's CAs have both volatile read and volatile write memory semantics, communication between Java threads now has the following four ways:
- A thread writes the volatile variable, and then the B thread reads the volatile variable.
- A thread writes the volatile variable, and then the B thread updates the volatile variable with CAs.
- A thread updates a volatile variable with CAS, and then the B thread updates the volatile variable with CAs.
- A thread updates a volatile variable with CAS, and then the B thread reads the volatile variable.
The CAs in Java use the high-efficiency machine-level atomic instructions available on modern processors that atomically perform read-and-write operations on memory, which is the key to achieving synchronization in a multiprocessor (essentially, a computer that supports atomic read-change-write instructions, is an asynchronous equivalent machine for calculating Turing machines sequentially, so any modern multiprocessor will support some atomic instruction that performs atomic read-and-write operations on memory. At the same time, the read/write and CAS of volatile variables can implement communication between threads. The integration of these features forms the cornerstone of the entire concurrent package. If we carefully analyze the source code implementation of the concurrent package, we will find a generalized implementation pattern:
- First, declare the shared variable to be volatile;
- Then, the synchronization between threads is realized by using the atomic condition update of CAs.
- At the same time, the communication between threads is implemented with volatile read/write and the volatile reading and writing memory semantics of CAs.
AQS, Non-blocking data structures and atomic variable classes (classes in the Java.util.concurrent.atomic package), the underlying classes in these concurrent packages, are implemented using this pattern, and the high-level classes in the concurrent package are dependent on these base classes for implementation. Overall, the concurrent package is implemented as follows:
JAVA CAs principle depth analysis