Java concurrency mechanism and underlying implementation principles

Last Update:2017-09-19 Source: Internet

Author: User

Tags array length intel core i7

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

The Java code is compiled into Java bytecode, bytecode is loaded into the JVM by the ClassLoader, and the JVM performs bytecode conversions to the assembly instructions on the CPU. The concurrency mechanism in Java relies on the implementation of the JVM and the instructions of the CPU.

The definition of volatile in the third edition of the Java Language specification is as follows: The Java programming language allows threads to access shared variables, and in order to ensure that shared variables can be updated accurately and consistently, the thread should ensure that the variable is obtained separately through exclusive locks. The Java language provides volatile. If a field is declared as a Volatile,java thread, the memory model ensures that all threads see that the value of this variable is consistent. Volatile does not cause thread context switching and scheduling.

CPU Terminology

Terms	English	Term description
Memory barrier	Memory Barries	is a set of processor directives that enable sequential throttling of memory operations
Buffer lines	Cache line	The minimum storage unit that can be allocated in the cache. When the processor fills the cache line, it loads the entire cache line, requiring multiple main memory read cycles
Atomic operation	Atomic operations	Non-interruptible One or more column operations
Cache row Padding	Cache line Fill	When the processor recognizes that a read operation from memory is cacheable, the processor reads the entire cache line to the appropriate cache
Cache Hit	Cache Hit	If the memory location of the cache row padding operation is still the next time the processor accesses the address, the processor reads the operand from the cache instead of reading from memory
Write hit	Write hit	When the processor writes the operand back to the area of a memory cache, it first checks to see if the cached memory address is in the cache line, rather than writing back to memory, which is called a write hit
Write missing	Write misses the cache	A valid cache row is written to a memory area that does not exist

Suppose instance is a volatile variable, instance = new Singleton ();

Transition to assembly for:

0x01a3deld:movb￥0x0, 0x1104800 (%esi);

0x01a3de24:lock add1 $0x0, (%ESP)

A shared variable modified with a volatile variable is written with the lock word. The lock prefix instruction will cause two things under the multi-core processor: 1) writes the data of the current processor cache row back to the system memory 2) This write-back operation causes the data in the other CPU to cache the memory address to be invalid.

To improve processing speed, the processor does not communicate directly with the memory, but instead reads the system memory data into the internal cache before doing so, but does not know when the memory will be written. If a volatile variable is declared, the JVM sends a lock prefix instruction to the processor, writing the data from the cache row of the variable to the system memory. In multi-processor, in order to ensure that each processor cache is consistent, the cache consistency protocol is implemented, each processor by sniffing the data propagated on the bus to check whether the value of its cache is expired, when the processor found its own cache line corresponding memory address is modified, The cache line of the current processor is set to an invalid state, and when the processor modifies the data, it will re-read the data from the system memory into the processor cache.

The two implementation principles of volatile:

1) The lock prefix instruction causes the processor cache to write to memory. The lock prefix log is more responsible for claiming the processor's lock# signal during the execution of the instruction. In a multiprocessor environment, the lock# signal ensures that the processor can monopolize any shared memory while claiming the signal. lock# signals generally do not lock the bus, but lock the cache. For Intel486 and Pentium processors, the lock# signal is always asserted on the bus during a lock operation. However, in P6 and current processors, if the area of memory accessed is already cached inside the processor, the lock# signal is not asserted. Instead, it locks the expected cache of this memory and writes it to memory, and uses the cache coherency mechanism to ensure the atomicity of the modification, known as "cache lock," which prevents simultaneous modification of memory region data cached by more than two processors.

2) The cache write-back to memory of one processor causes the cache of other processors to be invalid. The IA-32 processor and the Intel64 processor use the Mesi (modify, exclusive, shared, invalid) control protocol to maintain consistency between the internal cache and other processor caches. When operating on Multicore processor systems, the IA-32 and Intel64 processors can sniff other processors to access system memory and their internal caches. The processor uses sniffer technology to ensure that its internal cache, system memory and other processor-cached data are always online consistent. For example, in the Pentium and P6 family processors, if you sniff a processor to detect that another processor intends to write a memory address that is currently shared, the processor being sniffed will invalidate its cache row, forcing the cache row population to be enforced the next time the same memory address is accessed.

Optimization of volatile

A new linkedtransferqueue is added to the concurrent package in JDK7, which optimizes the team's performance by appending bytecode when using volatile variables. LinkedTransferQueue uses an inner class type paddedatomicreference<qnode> to define the head node and the node for the queue, and this inner class does only one thing with respect to the parent class atomicreference. is to append the shared variable to 64 bytes. Because Intel Core I7,core,atom and NetBurst have several Core solo and Pentium M processor L1, the L2 or L3 cache line is 64 bytes wide, and partially populated cache rows are not supported. If the head node or tail node of the queue is less than 64 bytes. The processor reads them all into the same cache line, and each processor caches the same header and tail nodes under multiple processors. When a processor attempts to modify the head node, the entire cache row is locked. In this way, under the effect of the cache consistency mechanism, the other processors will not be able to access the tail nodes in their cache, and queue and outbound operations will need to modify the head node and tail node continuously, and in the case of multi-processor, the queue and team efficiency can be seriously affected. Use the Append 64-byte method to fill the cache row of the buffer zone, avoid the head node and tail node loading into the unified cache row, so that the head, the tail node will not lock each other when modified.

In the following two cases, the use of volatile variables should not be appended to 64 bytes:

1) The cache line is not a 64-byte wide processor. If the P6 series and the Pentium processor. Their L1 and L2 cache lines are 32 bytes wide

2) shared variables are not written frequently. The way to append bytecode requires the processor to read more bytes to the buffer zone, which will have a certain performance cost. However, JAVA7 will eliminate or rearrange unused fields.

The realization principle of synchronized

Each object in Java can be used as a lock. It is shown in the following 3 ways:

1. For the normal synchronization method, the lock is the current instance object

2. For the static synchronization method, the lock is the class object of the current

3. For the synchronization method fast, the lock is a configuration object in synchronized brackets

From the JVM specification, you can see that the JVM implements method synchronization and code block synchronization based on entering and exiting the monitor object. But the implementation details are different. Code block synchronization is implemented using Monitorenter and monitorexit directives, and method synchronization is implemented in a different way. The monitorenter instruction is inserted at the beginning of the synchronization code block after compilation, and Monitorexit is inserted at the end of the method and the exception, and the JVM guarantees that each monitorenter must have corresponding monitorexit paired with it. Any object has a monitor associated with it, and after a monitor is held, it will be locked. After the thread executes to the monitorenter instruction. An attempt is made to take ownership of the monitor that corresponds to the object, which is an attempt to acquire the lock on the object.

Synchronized locks are present in the Java object header. If the object is an array type, the virtual machine stores the object header with 3 word widths (word), and if the object is a non-array type, the object header is stored 2 words wide. In a 32-bit virtual machine, the 1-word width equals 4 bytes, or 32 bits.

Length of the Java object header

Length	Content	Description
32/64 bit	Mark Word	Store object hashcode or lock information, etc.
32/64 bit	Class Metadata Address	Pointer to data stored in object type
32/64 bit	Array length	The length of the array

In the Java object header, Mark Word stores the object's hashcode, generational age, and lock tag bits.

Default storage for 32-bit Mark Word

Lock status	25bit	4bit	1bit whether the lock is biased	2bit lock Mark Bit
No lock status	Hashcode of objects	Age of Object generation	0	01

Mark Word's state changes

Lock status	25bit 23bit/2bit	4bit	1bit Whether the lock is biased	2bit Lock Flag bit
Lightweight lock	Pointer to lock record in stack			00
Heavy-weight lock	Pointer to mutex			10
GC Flag	Empty			11
Biased lock	Thread id\| epoch\| Object Generational Age \|			01

Under 64-bit virtual machines, Mark Word is a 64-bit size

Lock status

25bit

31bit

1bit

Cms_free

4bit

Age of Generation

1bit

Biased lock

2bit

Lock Flag bit

No lock

Unused

Hashcode

Biased lock

ThreadID (54bit) Epoch (2bit)

In order to reduce the performance cost of acquiring locks and release locks, Java SE1.6 introduces "biased lock" and "lightweight lock", in Java SE1.6, there are 4 states of lock, the level from low to high is: stateless lock, biased lock state, lightweight lock state and heavy lock state. Locks can be upgraded but cannot be degraded to improve the efficiency of acquiring locks and releasing locks.

Bias Lock:

When a thread accesses a synchronization block and acquires a lock, the thread ID of the lock bias is stored in the lock record in the object header and the stack frame, and the thread does not need to perform a CAS operation to lock or unlock the synchronization block at a later time, simply test the object's head with the mark Whether Word stores a biased lock pointing to the current thread. If the test succeeds, it indicates that the thread has acquired a lock. If the test fails, you will need to test if the identity of the lock in Mark Word is set to 1, if not, use CAs to compete for locks, and if set, try using CAs to point the object header's bias lock to the current thread.

A bias lock uses a mechanism that waits until the competition appears to release the lock, so a thread that holds a bias lock will release the lock when other threads try to compete for a biased lock. The revocation of the bias lock, which waits for the global security point (the current point in time no byte code is executing). It pauses the thread that holds the biased lock first, and then checks to see if the thread holding the biased lock is still alive, and if the thread is not active, the object header is set to No lock, and if the thread is still alive, the stack with the bias lock is executed, the lock record of the biased object is traversed, the record in the stack and the mark Word either re-favors another thread, or reverts to a lock-free or tagged object that is not appropriate as a biased, and finally wakes up the paused thread

The bias lock is turned on by default in Java 6 and Java 7, but it is activated only a few seconds after the application starts, and you can use the JVM parameter to close the delay:-xx:biasedlockingstartipdelay=0. Turn off the bias lock through the JVM:-xx:-usebiasedlocking=false, the program defaults to a lightweight lock state.

Lightweight lock

Before the thread executes the synchronization block, the JVM now creates space in the stack frame of the current thread to store the lock record and copies the Mark word in the object header to the record, officially becoming displaced Mark word. The thread then attempts to replace Mark Word in the object header with a pointer to the lock record using CAs. If successful, the current thread acquires the lock and, if it fails, indicates that another thread is competing for the lock, and the current thread attempts to use the spin to acquire the lock.

When lightweight unlocked, an atomic CAS operation will be used to replace displaced Mark word back to the object header, and if successful, Zebian is no competition to occur. If the failure means that the current lock is competitive, the lock will grow into a heavyweight lock.

Since spin consumes the CPU, in order to avoid useless spins, once the lock is upgraded to a heavyweight lock, it is no longer restored to the lightweight lock state. When the lock is in a heavyweight state, other lines are blocked when they attempt to acquire the lock, and the thread that holds the lock wakes those threads when the lock is released, and the Awakened thread makes a new round of lock contention.

Lock	Advantages	Disadvantages	Usage Scenarios
Biased lock	Locking and unlocking does not require additional consumption, and there is only a nano-scale difference compared to performing an asynchronous method	If there is a lock contention between the threads, the additional lock revocation consumption is brought	Suitable for only one thread to access synchronous fast scenes
Lightweight lock	Competing threads do not block, increasing program responsiveness	If a thread that does not have a lock contention will use spin to consume the CPU	The pursuit of response time, synchronous fast execution speed very fast
Heavy-weight lock	Thread contention does not apply spin and does not consume CPU	Thread blocking; slow response time	The pursuit of throughput; synchronous block execution speed is longer

The implementation principle of atomic operation

One or a series of operations that cannot be interrupted when an atomic operation occurs.

Term name	English	Explain
Cache rows	Cache line	Minimum operating unit of the cache
Compare and Exchange	Compare and Swap	CAS operations need to enter two values, an old value and a new value, in the operation period to compare the old value has not changed, if there is no change before the new value, change is not exchanged
CPU pipelining	CPU Pipeline	CPU pipelining works like an assembly line in the industrial production, in the CPU by 5~6 a different function of the circuit unit composed of a command processing line, and then a X86 instruction is divided into 5~6 step after the circuit units are executed separately, so that can be achieved in a CPU clock cycle to complete an instruction, Therefore, it increases the CPU's operation speed.
Memory Order Conflicts	Momory Order violation	Memory sequence conflicts are generally caused by false sharing, which means that multiple CPUs simultaneously modify different parts of the same cache row and cause one of the CPUs to be invalid, and the CPU must empty the pipeline when this memory sequence conflict occurs

The 32-bit IA-32 processor uses atomic operations between multiple processors using a method based on cache locking or bus locking. First the processor automatically guarantees the atomicity of basic memory operations. The processor guarantees that a byte is read or written from the system memory, meaning that when a processor reads one byte, the other processor cannot access the memory address of the byte. PENTIUM6 and the latest processors can automatically guarantee that the single-processor operation of 16/32/64 bits in a unified cache line is atomic. However, complex memory manipulation processors cannot guarantee their atomicity, such as cross-bus widths, across multiple cache lines, and cross-page table visits. The processor provides bus locking and cache locking two mechanisms to ensure the atomicity of memory operations for complex memory operations.

If multiple processors simultaneously read and overwrite shared variables, then shared variables are manipulated by multiple processors at the same time, so that read rewriting operations are not atomic. To ensure that the read overwrites the operation of the shared variable, the atom must be guaranteed that when a CPU overwrites the variable, the rest of the CPU cannot manipulate the cache that caches the shared variable memory address. A bus lock is a lock# signal provided by the processor that, when the signal is output on the bus by the processor, the other processor's request is blocked, and the processor can have exclusive shared memory.

Guaranteed atomicity through cache locking. At the same time, it is only necessary to ensure that the operation of a memory address atomicity, but the bus lock between the CPU and memory of the communication lock, which makes the lock during the other processors can not operate other memory address of the data, so bus locking overhead is relatively large, At present, the processor uses cache lock instead of bus lock to optimize in some cases. The frequent use of memory is cached in the processor's L1,L2 and L3 caches, so atomic operations can be done directly in the processor's internal cache, without the need to declare bus locks, and the "cache lock" approach can be used to achieve complex atomicity in Pentium 6 and current processors. Cache locking means that if the memory area is cached in the processor's cache line and locked during the lock operation, when it performs a lock operation that writes back to memory, the processor does not declare the lock# signal on the bus, but modifies the internal memory address and allows its cache consistency mechanism to guarantee the atomicity of the operation. Therefore, the cache coherency mechanism prevents memory region data that is cached by more than two processors from being modified at the same time, and invalidates cache rows when other processors write back data for cached rows that have been locked.

There are two cases in which the processor does not use cache locking. The first scenario is when the data of the operation cannot be cached inside the processor, or the operation's data spans multiple cache rows, the processor invokes the bus lock. The second scenario is that some processors do not support cache locking. For the Intel 486 and Pentium processors, bus locking is also called in the cache line of the processor, even if the locked memory area is in.

In Java, atomic operations can be implemented by locking and looping CAs.

CAS operations in the JVM are implemented using the CMPXCHG instructions provided by the processor. The basic idea of a self-rotating CAS implementation is to cycle through the CAS operation until it succeeds.

PrivateAtomicinteger Atomici =NewAtomicinteger (0);Private inti = 0; Public Static voidMain (string[] args) {FinalCounter cas =NewCounter (); List<Thread> ts =NewArraylist<thread> (600); LongStart =System.currenttimemills ();  for(intj = 0; J < 100; J + +) {Thread T=NewThread (NewRunnable () { Public voidrun () { for(inti = 0; I < 10000; i++) {cas.count ();                   Cas.safecount ();    }               }           }); }     for(Thread t:ts) {T.start (); }     for(Thread t:ts) {Try{t.join (); }Catch(interupptedexception e) {e.printstacktrace ();    }} System.out.println (CAS.I);    System.out.println (Cas.atomicI.get ()); System.out.println (System.currenttimemillis ()-start);}Private voidSafecount () { for(;;) {         inti =Atomici.get (); Booleansuc = Atomici.compareandset (i, + +)i); if(suc) { Break; }    }}Private voidcount () {i++;}

The three major problems of CAS for atomic operations:

ABA problem. Therefore, CAS needs to check that the values have changed when operating the values and update them if there is no change. Burliness a value turns into a, B becomes a, and when you check with CAs, it doesn't change its value. The solution to the ABA problem is to use the version number. Precede the variable with the version number, and each time the variable is updated, the version number is added 1. Beginning with Java 1.5, the JDK atomic package provides the Atomicstampedreference class to address the ABA problem. The compareandset of this class first checks whether the current reference is equal to the expected reference, and checks whether the current flag is equal to the expected flag, and if it is all equal, atomically sets the reference and the value of the flag to the given update value.

 Public Boolean compareandset{     v expectedreference,     v newreference,     int  Expectedstamp,      int  Newstamp}

Long cycle time overhead. Spin CAs can have a very large execution overhead if it is not successful for a long time. If the JVM supports the processor's pause instruction, the efficiency will be improved somewhat. The pause directive has two functions: it can delay pipelining execution instructions (de-pipeline), so that the CPU does not consume excessive execution resources, and the delay time depends on the version that is implemented, and on some processors the delay time is zero. It also avoids the CPU pipelining being emptied (CPU Pipeline Flush) as a result of memory order collisions (violation) in the exit loop, which increases CPU execution efficiency.

Only atomic operations of a shared variable can be guaranteed. When performing operations on a shared variable, you can use circular CAs to guarantee atomic operations, but for multiple shared variables, the cyclic CAS cannot guarantee the atomicity of the operation, at which point locks can be used. Or you can combine multiple shared variables into a shared variable to operate. Beginning with JAVA 1.5, the JDK provides the Atomicreference class to guarantee atomicity between reference objects, and multiple variables can be placed in an object for CAS operations.

Using the lock mechanism to implement atomic operations

The locking mechanism ensures that only the lock-locked line friend can operate the locked memory area. There are many kinds of locking mechanisms inside the JVM, with biased locks, lightweight locks, and mutex locks. The way that the JVM implements locks is to use a cyclic CAs, that is, when a thread wants to enter a synchronization block using a circular CAs to obtain a lock, when it exits the synchronization block is using a circular CAS release lock.

Java concurrency mechanism and underlying implementation principles

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More