Multithreading mechanisms for Java: Cache consistency and CAS
One, bus lock and cache consistency
This is the concept of two operating system levels. With the advent of the multicore era, concurrency has become a normal phenomenon, the operating system must have some mechanisms and primitives to ensure that some basic operation of the atomic, such as the processor needs to ensure that read a byte or write a byte is atomic, then how it is implemented? There are two mechanisms: bus lock and cache consistency.
We know that the communication between the CPU and the physical memory is much slower than the CPU's processing speed, so the CPU has its own internal cache, which reads the in-memory data into the internal cache according to some rules to speed up frequent reads. We assume that there is only one CPU and one internal cache on a single PC, so all processes and threads see the number in the cache, there is no problem, but now the server is usually multi-cpu, and more generally, there are multiple cores per CPU, and each kernel maintains its own cache. Then there is a cache inconsistency in multithreading concurrency, which can cause serious problems.
Take i++ as an example, the initial value of I is 0. Then at the beginning of each cache is stored the value of I 0, when the first kernel to do i++, the value of its cache becomes 1, even if immediately write back to the main memory, then after the write back to the second kernel cache of the I value is still 0, its execution i++, Writing back to memory overwrites the operation of the first kernel, making the final result 1 instead of the expected 2.
So how to solve the whole problem? The operating system provides a mechanism for bus locking. The front-end bus (also known as the CPU bus) is the backbone of all CPUs connected to the chipset, which is responsible for the CPU's communication with all parts of the world, including cache, memory, North Bridge, its control bus sends control signals to various parts, specifies the parts to be accessed via the address bus, and transmits through the data bus in two directions. When CPU1 to do i++ operation, it sends a lock# signal on the bus, the other processor cannot manipulate the cache that caches the shared variable memory address, that is, blocking the other CPUs, so that the processor can enjoy this shared memory alone.
But we only need to operate on this shared variable is atomic, and the bus lock to the CPU and memory communication to lock, so that during the lock, other processors can not manipulate the data of other memory addresses, thus the cost is large, so the subsequent CPU provides a cache consistency mechanism, This optimization was provided after Intel's Pentium 486.
Cache consistency mechanism overall, when a block of CPU operations on the data in the cache, it informs other CPUs to discard the caches stored inside them, or to re-read from the main memory, such as:
This is explained in detail in the Mesi protocol, which is widely used in the Intel family.
The MESI protocol is named in several states (full name is modified, Exclusive, Share, or Invalid) in cache rows (cached base data units, typically 64 bytes on Intel's CPUs). The protocol requires that two state bits be maintained on each cache line, so that each unit of data may be in one of the four states of M, E, S, and I, with the following status meanings:
M: Modified. Data that is in this state has cached data only in this CPU, and not in other CPUs. At the same time, its state is modified in relation to the in-memory value and is not updated to memory.
E: Exclusive. Data in this state, which is cached in this CPU, and whose data is not modified, is consistent with memory.
S: Shared. Data in this state is cached in multiple CPUs and is consistent with memory.
I: invalid. This cache in this CPU is invalid.
Here we first introduce the corresponding listener on the cache of the Protocol:
A cache line in the M state must always listen for all attempts to read the corresponding main memory address of the cache line, and if it listens, it must write back the data in its cache row to the CPU before this operation executes.
A cache line in the S state must always listen for a request that invalidates the cache line or exclusive access to the cache line, and if it listens to it, it must set its cache row state to I.
A cache line in the E state must always listen to other operations that attempt to read the corresponding main memory address of the cache line, and if so, it must set its cache row state to S.
When the CPU needs to read the data, if the state of its cache row is I, it needs to read from memory and turn its state into S, if not I, you can read the value in the cache directly, but before this, you must wait for the other CPU's listener results, such as other CPUs also have the cache of the data and the status is M, You need to wait for it to update the cache to memory before it is read.
When the CPU needs to write data, it can only execute when its cache line is M or e, otherwise it needs to emit a special RFO instruction (Read or Ownership, which is a bus transaction), notifying the other CPU cache is invalid (I), in which case the performance cost is relatively large. After the write is complete, modify its cache state to M.
So if a variable in a certain period of time only by a thread is frequently modified, then use its internal cache can be completely done, do not involve the bus transaction, if the cache will be the CPU exclusive, one will be the CPU exclusive, then will continue to produce RFO instruction affects concurrency performance. It is said that the cache is often exclusive does not mean that the more threads are more likely to trigger, but the CPU coordination mechanism here, which is somewhat similar to sometimes multithreading does not necessarily improve efficiency, because the thread hangs, the scheduling overhead is more than the cost of executing the task, where the multi-CPU is also the same, if the CPU scheduling unreasonable, Also, the overhead of forming RFO instructions is greater than the cost of the task. Of course, this is not what programmers need to consider, the operating system will have the corresponding memory address of the relevant judgment, which is not within the scope of this article.
Not all cases will use cache consistency, such as the manipulated data cannot be cached inside the CPU or the operation data across multiple cache lines (status cannot be identified), the processor will call the bus lock, and when the CPU does not support cache lock, the natural can only use the bus lock, For example, Pentium 486 and older CPUs.
Ii. CAS (Compare and Swap)
With the introduction of bus locking and cache consistency in the previous chapter, CAS is better understood, not Java-specific, but operating system needs to be guaranteed. The CAS directive, known as the CMPXCHG directive on the Intel CPU, is to replace the content of the specified memory address with a given value if it is equal, substituting its contents with the new value provided in the instruction, and if not equal, the update fails. This comparison and exchange operation is atomic and cannot be interrupted, and the principle of guaranteed atomicity is the "bus lock and cache consistency" mentioned in the previous section. First of all, CAS also contains read, compare (this is also a kind of operation) and write these three operations, and the previous i++ is not much different, yes, there is no difference in operation, but CAS is guaranteed by the hardware command atomicity, and i++ not, and the hardware level of atomicity than i++ Such high-level language software levels run much faster. Although CAs also contains multiple operations, its operations are fixed (a comparison), and the locking performance overhead is minimal.
With the rise of the internet industry and hardware multi-cpu/multi-core progress, high concurrency has become more and more common phenomenon, CAS has been more and more widely used, in the Java domain is also so. JDK1.4 was released in February 2002, when the hardware equipment is far less advanced, multi-CPU and multicore has not been popularized, so in the JDK1.5 before the synchronized is the use of suspended threads, waiting for scheduling way to achieve thread synchronization, overhead, and with the constant upgrade of hardware , the CAS mechanism-comparison and exchange-was introduced in the JDK5 in September 2004 to completely solve this problem, and in general it is no longer necessary to suspend (refer to the description of the lock level in the following article, only when the lock is entered into the weight level), but a number of attempts, It uses the optimistic locking mechanism implemented by the underlying CPU commands. This is an optimistic lock in terms of memory, because it compares the current value to the value before the update before updating it, and if it does, the update, if not, performs an infinite loop (called spin) until the current value is consistent with the value before the update.
Take the code for Atomicinteger in concurrent as an example, its getandincrement () method (obtained and self-increment, that is, i++) the source code is as follows:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 |
/** * Atomically increments by one the current value. * * @return the previous value */ public final int getAndIncrement() { for (;;) { int current = get(); int next = current + 1 ; if (compareAndSet(current, next)) return current; } }
/** * Atomically sets the value to the given updated value * if the current value {@code ==} the expected value. * * @param expect the expected value * @param update the new value * @return true if successful. False return indicates that * the actual value was not equal to the expected value. */ public final boolean compareAndSet( int expect, int update) { return unsafe.compareAndSwapInt( this , valueOffset, expect, update); } |
It calls the Compareandset (int expect,int update) method, where expect is expected, that is, the original value before the operation, and update is the value after the operation, in i=2 as an example, the expect=2,update= 3, it calls the Sun.misc.Unsafe Compareandswapint method to execute, this method code is as follows:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
/*** * Compares the value of the integer field at the specified offset * in the supplied object with the given expected value, and updates * it if they match. The operation of this method should be atomic, * thus providing an uninterruptible way of updating an integer field. * * @param obj the object containing the field to modify. * @param offset the offset of the integer field within <code>obj</code>. * @param expect the expected value of the field. * @param update the new value of the field if it equals <code>expect</code>. * @return true if the field was changed. */ public native boolean compareAndSwapInt(Object obj, long offset, int expect, int update); |
This is a local method, which uses CAs to guarantee its atomicity, and if it fails, it continues to run through the loop until it succeeds, which is the biggest difference from JDK5 before, the failed thread no longer needs to be suspended, re-dispatched, but can be re-executed without hindrance, which greatly reduces the overhead of the suspend schedule ( Of course, if CAs are unsuccessful for a long time, it can also result in CPU-intensive, depending on the application scenario.
There are a few things to note about CAS policy:
When threads preempt resources particularly frequently (as opposed to CPU execution efficiency), they can result in long spins and CPU-intensive performance.
There is an ABA problem (that is, the value before the update is a, but in the course of the operation is updated to B by other threads, and updated to a), when the current thread is considered to be executable, in fact, there is an inconsistency, if this inconsistency has an impact on the program (there are really few scenes of this effect, Unless you do something else with this variable for the identity bit during the variable operation, such as initializing the configuration, you need to use Atomicstampedreference (in addition to comparing the original value before the update, and you need to compare it with the stamp flag bit before the update).
Only one variable can be atomically manipulated. If you need to do atomic operations with multiple variables as a whole, you should use atomicreference to put these variables in an object and do atomic operations on the object.
CAS is widely used in JDK5 J.U.C packets, and is applied to synchronized JVM implementations in JDK6. Therefore in the JDK5 j.u.c efficiency is much higher than synchronized, and to the JDK6, both efficiency is similar, and synchronized use more simple, more error-prone, so it is the first choice of expert group recommendation, unless the need to use J.U.C special features ( If the block is blocked for a period of time, instead of continuing to wait).
Multithreading mechanisms for Java: Cache consistency and CAS