How to implement the volatile keyword in Java

Source: Internet
Author: User
Tags arithmetic semaphore volatile intel core i7

The original address, reproduced please indicate the source, thank you


We know that the role of the volatile keyword is to ensure that the variable in the multi-threaded visibility, it is the core of the java.util.concurrent package, no volatile there is not so many concurrent classes for us to use.

This article explains in detail how the volatile keyword to ensure the visibility of variables in the multi-threaded, before this, it is necessary to explain the CPU cache related knowledge, grasp this part of knowledge will let us better understand the principle of volatile, so better, Use the volatile keyword more correctly.

CPU Cache

CPU cache is mainly to solve the problem of CPU speed and memory read/write speed mismatch, because the CPU operation speed than memory read and write speed much faster, for example:

    • Primary memory access is typically between dozens of and hundreds of clock cycles
    • One L1 cache reads and writes only one or two clock cycles
    • Only a few clock cycles are required to read and write a L2 cache at a time

This significant difference in Access speed causes the CPU to take a long time to wait for data to arrive or write data to memory.

Based on this, the CPU is now mostly read and write without direct access to memory (the CPU is not connected to the memory of the pin), instead of the CPU cache, the CPU cache is located between the CPU and memory of the temporary memory, its capacity is much smaller than memory, but the exchange speed is much faster than memory. The data in the cache is a small portion of the data in memory, but this is a small part of the CPU is about to be accessed in a short time, and when the CPU calls a lot of data, it can read from the cache first, thus speeding up the reading speed.

The CPU cache can be divided into the following combinations of reading order and CPU:

    • First-level cache: The L1 cache, located next to the CPU core, is the CPU cache that is most tightly bound to the CPU
    • Second-level cache: L2 Cache, sub-internal and external two chips, internal chip two-level cache running speed and the same frequency, external chip two-level cache running speed is only half the frequency
    • Level Three cache: referred to as L3 cache, some high-end CPUs have

The data stored in each level cache is all part of the next level of caching, and the technical difficulty and manufacturing cost of the three caches are relatively decreasing, so their capacity is also relatively incremental.

When the CPU is going to read a data, it is first looked up from the primary cache, if it is not found again from the level two cache, if it is not again from the level three cache or in memory. In general, the hit rate of each cache is about 80%, meaning that 80% of all data can be found in the first-level cache, and only 20% of the total data is required to be read from the two-level cache, the three cache, or in-memory.

Problems with CPU caching

Use a graph to show the relationship between the CPU-->CPU cache and the main memory data read:

When the system is running, the CPU performs the calculation as follows:

    1. Program and data are loaded into main memory
    2. Instructions and data are loaded into the CPU cache
    3. CPU execution instructions, write results to cache
    4. Data in cache writes back to main memory

If the server is a single-core CPU, then there is no problem with these steps, but if the server is a multicore CPU, then the problem is with the Intel Core i7 processor's cache concept model (image from "in-depth computer system"):

Consider one of the following:

    1. Kernel 0 reads a byte, and according to the local principle, its neighboring bytes are also read into the kernel 0 cache
    2. Nuclear 3 Does the same work, so the kernel 0 has the same data as the kernel 3 cache.
    3. Kernel 0 modifies that byte, and after it has been modified, that byte is written back to the kernel 0 cache, but the information is not written back to main memory
    4. Kernel 3 accesses this byte, because the kernel 0 does not write the data back to main memory, the data is out of sync

To solve this problem, the CPU manufacturer made a rule that when a CPU modifies bytes in the cache, the other CPUs in the server are notified and their caches are considered invalid . Thus, in the above case, the kernel 3 found its own cache data is invalid, nuclear 0 will immediately write their own data back to main memory, and then nuclear 3 re-read the data.

Disassemble Java bytecode to see what the assembly plane did to the volatile keyword

With the theoretical basis above, we can look at how the volatile keyword is actually implemented. First write a simple code:

1/** 2  * @author May Cangjie 3  */4 public class Lazysingleton {5  6
   private static volatile Lazysingleton instance = null; 7      8 public     static Lazysingleton getinstance () {9         if (instance = = null) {Ten             instance = new Lazysingleton () ;         }12         return instance;14}15 public     static void Main (string[] args) {         Lazysingleton.getinstance ();     }19     20}

First decompile the. class file for this code and look at the resulting bytecode:

There's nothing special about that. To know, bytecode instructions, such as Getstatic, Ifnonnull, new, etc., eventually corresponding to the level of the operating system, are converted to a single instruction to execute, we use the PC, Application Server CPU architecture is usually IA-32 architecture, The instruction set used in this architecture is CISC (complex instruction set), while assembly language is the mnemonic for this instruction set.

So, since we don't see a clue at the bytecode level, let's look at what it's like to convert code into assembly instructions. Windows to see the code corresponding to the assembly code is not difficult (spit groove a sentence, say it is not difficult, for this problem I searched all kinds of information, almost ready to install virtual machine, on the Linux system), access to Hsdis tool path can download the Hsdis tool directly, downloaded after the decompression, Place the Hsdis-amd64.dll and hsdis-amd64.lib two files under the%java_home%\jre\bin\server path, such as:

Then run the main function, and before running the main function, add the following virtual machine parameters:

-server-xcomp-xx:+unlockdiagnosticvmoptions-xx:+printassembly-xx:compilecommand=compileonly,* Lazysingleton.getinstance

To run the main function, the assembly instructions generated by the code are:

 1 Java HotSpot (TM) 64-bit Server VM warning:printassembly is enabled; Turning on debugnonsafepoints to gain additional output 2 compileroracle:compileonly *lazysingleton.getinstance 3 Loaded  Disassembler from D:\JDK\jre\bin\server\hsdis-amd64.dll 4 decoding compiled method 0x0000000002931150:5 Code:6 Argument 0 is unknown. rip:0x29312a0 Code size:0x00000108 7 [disassembling for mach= ' AMD64 '] 8 [Entry Point] 9 [verified Entry point]10 [Consta Nts]11 # {method} ' getinstance ' () Lorg/xrq/test/design/singleton/lazysingleton; ' In ' org/xrq/test/design/singleton/ Lazysingleton ' # [sp+0x20] (sp of caller) 0x00000000029312a0:mov DWORD ptr [rsp+0ffffffffffffa000h                                                 ],eax14 0x00000000029312a7:push rbp15 0x00000000029312a8:sub rsp,10h; *synchronization entry16 ;[email protected] (line) 0x00000000029312ac:mov r10,7ada9e   428h; {OOP (A' Java/lang/class ' = ' Org/xrq/test/design/singleton/lazysingleton ')}18 0x00000000029312b6:mov r11d,dword ptr [r10+58                                                 h]19, *getstatic Instance20 ;[email protected] (line) 0x00000000029312ba:test r11d,r11d   0x00000000029312bd:je 29312e0h23 0x00000000029312bf:mov r10,7ada9e428h; {OOP (a ' java/lang/class ' = ' Org/xrq/test/design/singleton/lazysingleton ')}24 0x00000000029312c9:mov r11d,dword ptr                                                  [r10+58h]25 0x00000000029312cd:mov rax,r1126 0x00000000029312d0:shl rax,3h; *getstatic Instance27 ;[email protected] (line) 0x00000000029312d4:add rsp,10h29   0x00000000029312d8:pop rbp30 0x00000000029312d9:test dword ptr [330000h],eax; {poll_return}31 0x00000000029312df:ret32 0x00000000029312e0:mov rax,qword ptr [r15+60h]33 0x00000000029312e4:mov r10,rax34 0x0 0000000029312e7:add r10,10h35 0x00000000029312eb:cmp r10,qword ptr [r15+70h]36 0x00000000029312ef:jnb 2   93135bh37 0x00000000029312f1:mov qword ptr [r15+60h],r1038 0x00000000029312f5:prefetchnta byte ptr [r10+0c0h]39   0x00000000029312fd:mov r11d,0e07d00b2h; {OOP (' Org/xrq/test/design/singleton/lazysingleton ')}40 0x0000000002931303:mov r10,qword ptr [r12+r11*8+0b0h]41 0x                                                 000000000293130b:mov qword ptr [rax],r1042 0x000000000293130e:mov dword ptr [rax+8h],0e07d00b2h43   ; {OOP (' Org/xrq/test/design/singleton/lazysingleton ')}44 0x0000000002931315:mov dword ptr [RAX+0CH],R12D45 0x000000 0002931319:mov Rbp,rax; *new;[email protected] (line) 0x000000000293131c:mov rdx,rbp47 0x000000000293131f:call 2907c60h;                                                 Oopmap{rbp=oop off=132}48; *invokespecial <init>49 ;                                       [email protected] (line 14) 50   ; {optimized virtual_call}51 0x0000000002931324:mov r10,rbp52 0x0000000002931327:shr r10,3h53 0x000000000293   132b:mov r11,7ada9e428h; {OOP (a ' java/lang/class ' = ' Org/xrq/test/design/singleton/lazysingleton ')}54 0x0000000002931335:mov dword ptr [r11+   58h],r10d55 0x0000000002931339:mov r10,7ada9e428h; {OOP (a ' java/lang/class ' = ' Org/xrq/test/design/singleton/lazysingleton ')}56 0x0000000002931343:shr r10,9h57 0x00  00000002931347:mov r11d,20b2000h58 0x000000000293134d:mov byte ptr [r11+r10],r12l59 0x0000000002931351:lock                                 Add DWORD ptr [rsp],0h; *putstatic Instance60                ;[email protected] (line) 0x0000000002931356:jmp 29312bfh62   0x000000000293135b:mov rdx,703e80590h;          {OOP (' Org/xrq/test/design/singleton/lazysingleton ')}63 0x0000000002931365:nop64 0x0000000002931367:call 292fbe0h ; oopmap{off=204}65; *new;                                       [email protected] (line 14) 66   ; {runtime_call}67 0x000000000293136c:jmp 2931319h68 0x000000000293136e:mov rdx,rax69 0X0000000002931371:JM P 2931376h70 0x0000000002931373:mov Rdx,rax; *new;[email protected] (line) 0x0000000002931376:add rsp,10h72   0x000000000293137a:pop rbp73 0x000000000293137b:jmp 2932b20h;   {runtime_call}74 [Stub code]75 0x0000000002931380:mov  rbx,0h;   {no_reloc}76 0x000000000293138a:jmp 293138ah;   {runtime_call}77 [Exception handler]78 0x000000000293138f:jmp 292fca0h; {runtime_call}79 [deopt Handler code]80 0x0000000002931394:call 2931399h81 0x0000000002931399:sub qword ptr [   rsp],5h82 0x000000000293139e:jmp 2909000h;  {runtime_call}83 0x00000000029313a3:hlt84 0x00000000029313a4:hlt85 0x00000000029313a5:hlt86 0x00000000029313a6: hlt87 0x00000000029313a7:hlt

So long assembler code, you may not know where the CPU is doing the hands and feet, nothing is not difficult, positioning to 59, 602 lines:

0x0000000002931351:lock add DWORD ptr [rsp],0h  ; *putstatic instance                                                ;-[email protected] (line 14)

The reason for locating these two lines is that the line 14,line 14 is where the volatile variable instance is assigned. The back of the add DWORD ptr [rsp],0h are all normal assembler statements, meaning the double-byte stack pointer register +0, where the key is the Add Front lock command, After a detailed analysis of the role of the lock command and why the lock command to ensure that the volatile keyword memory visibility.

What did the lock command do?

Previously said IA-32 architecture, about the CPU architecture of the problem we are interested to inquire about it, here to check the IA-32 manual about the lock instruction description, no IA-32 manual can go to this address download IA-32 manual, is a Chinese version of the manual.

I'll take a look at some descriptions of the function of the lock directive in the IA-32 manual (because the role of the lock directive is scattered throughout the manual, not in a chapter or section):

When modifying memory operations, use the lock prefix to invoke locking's read-modify-write operation, which is used for reliable communication between processors in multiprocessor systems, as described below: (1) in Pentium and early IA-32 processors, The lock prefix causes a lock# signal to be generated when the processor executes the current instruction, which always causes an explicit bus lock to appear (2) in Pentium4, Inter Xeon, and P6 series processors, the lock operation is handled by either a cache lock or a bus lock. If memory access is cached and affects only a single cache row, the cache lock is invoked in the operation, and the actual area in the system bus and system memory is not locked. At the same time, other Pentium4, Intel Xeon, or P6 series processors on the bus write back all the modified data and invalidate their cache to ensure system memory consistency. If memory access does not have a cache and/or it crosses the boundary of the cache line, the processor generates a lock# signal and does not respond to bus control requests during the lock operation
The 32-bit IA-32 processor supports locking atomic operations on a region in system memory. These operations are commonly used to manage shared data structures such as semaphores, segment descriptors, system segments, or page tables, and two or more processors may modify the same field or flag in these data structures at the same time. The processor uses three interdependent mechanisms to implement a locked atomic operation: 1, guaranteed atomic operation 2, bus lock, using the lock# signal and the lock instruction prefix 3, the cache coherence protocol, to ensure that the data structure in the cache atomic operation (cache lock). This mechanism exists in the PENTIUM4, Intel Xeon, and P6 series processors
The IA-32 processor provides a lock# signal that is automatically activated during certain critical memory operations to lock the system bus. When this output signal is emitted, control requests from other processors or bus agents are blocked. The software can specify other occasions that require lock semantics by adding a lock prefix prior to the instruction. In Intel386, Intel486, Pentium processors, the explicit locking of instructions will cause the lock# signal to be generated. Hardware designers ensure the availability of lock# signals in the system hardware to control memory access between processors. For Pentinum4, Intel Xeon, and P6 series processors, if the area of memory being accessed is cached inside the processor, the lock# signal is not normally emitted and, instead, the lock is applied only to the processor's cache.
To explicitly enforce the lock semantics, the software can use the lock prefix when modifying memory areas in the following directives. An illegal opcode exception (#UD) occurs when the lock prefix is placed before other instructions or if the instruction does not write to memory (that is, the target operand is in the register). "1" bit test and modification instructions (BTS, BTR, BTC) "2" Switching instructions (XADD, CMPXCHG, cmpxchg8b) "3" automatically assume a xchg instruction with a lock prefix
"4" The Arithmetic and logic instructions for the following single operand: INC, DEC, not, NEG
"5" Arithmetic and logic instructions for the following double operands: ADD, ADC, SUB, SBB, and, or, XOR
A locked instruction guarantees that the memory area where the target operand is located is locked, but the system may interpret the locked area slightly larger.
The software should use the same address and operand length to access the semaphore (as shared memory for sending signals between processors). For example, if a processor uses one word to access the semaphore, the other processor should not use a byte to access the semaphore.
The integrity of the bus lock does not accept the effect of memory area alignment. The locking semantics persist to meet the number of bus cycles required to update the entire operand. However, it is recommended that locking access be aligned on their natural boundaries to improve system performance:
"1" Any 8-bit access boundary (lock or no lock)
16-bit boundary for "2" locked word access
32-bit boundary for "3" locked double-word access
64-bit boundary for "4" locked four-word access
For all other memory operations and all visible external events, the lock operation is atomic. All fetch instructions and page table operations can be passed through the lock command. The locking instructions can be used to synchronize the operation of one processor to write data while the other processor reads the data.
The IA-32 architecture provides several mechanisms for hardening or weakening the memory ordering model to handle special programming scenarios. These mechanisms include the following: "1" I/o instructions, locking instructions, lock prefixes, and serialization instructions, forcing a strong sequencing on the processor "2" sfence instructions (introduced in Pentium III) and lfence instructions, The mfence directive (introduced in PENTIUM4 and Intel Xeon processors) provides sequencing and serialization capabilities for some special types of memory operations ... (There are still two of them.) These mechanisms can be used in the following ways. Memory-mapped devices and other I/O devices on the bus are often sensitive to the order in which they write to buffers, and the I/O directives (in directives and out directives) perform strong-write ordering of such access in the following manner. Before an I/O instruction is executed, the processor waits for all instructions before executing and all buffers are written to memory. Only the fetch instruction and the page table query can pass the I/O instruction, and the subsequent instruction waits until the I/O instruction finishes executing.

The several functions of the lock directive can be derived from the repeated consideration of these sections of the IA-32 manual for the lock instruction:

    1. Lock bus, other CPU read and write requests to memory will be blocked until the lock is released, but the actual subsequent processor uses lock cache instead of the lock bus, because the lock bus overhead is relatively large, the other CPU during the lock bus can not access memory
    2. The write operation after lock writes back the modified data while invalidating other CPU-related cache rows to reload the latest data from main memory
    3. is not a memory barrier but can complete similar memory barrier function, block the barrier two times the order reordering

(1) wrote that because of the efficiency problem, the actual subsequent processor uses lock cache to replace the lock bus, this scenario, the multi-cached data is consistent through the cache consistency protocol to ensure that, we look at what is the cache consistency protocol.

Cache Consistency Protocol

Before you talk about cache consistency, let's talk about the concept of cache lines :

    • The cache is segmented (line), a segment corresponding to a block of storage space, which we call the cache row, it is the CPU cache can allocate the smallest storage unit, size 32 bytes, 64 bytes, 128 bytes Unequal, which is related to CPU architecture, usually 64 bytes. When the CPU sees an instruction that reads memory, it passes the memory address to the first-level data cache, and the first-level data cache checks to see if it has a cache segment for that memory address, and if not, loads the entire cache segment from memory (or a higher cache). Note that this is about loading the entire cache segment at once, which is the principle of locality mentioned above

It says that lock# will lock the bus, which is actually unrealistic because the lock bus is inefficient. So it's best to do this: use multiple sets of caches, but they behave as if they were just a set of caches. Caching consistency protocol is designed to do this, as the name implies, such protocols are intended to keep the content of multiple sets of caches consistent .

There are a number of cache conformance protocols, but most of the computer devices that are routinely processed belong to the sniffer (snooping) protocol, and the basic idea is:

All memory transfers occur on a shared bus, and all processors can see the bus: The cache itself is independent, but memory is a shared resource, and all memory accesses are arbitrated (in the same instruction cycle, only one CPU cache can read and write memory).
The CPU cache not only deals with the bus when it does the memory transfer, but instead keeps the data exchange occurring on the sniffer bus, keeping track of what other caches are doing. So when a cache represents the processor it belongs to to read and write memory, the other processors are notified that they will keep their caches in sync. As soon as one processor writes memory, the other processors immediately know that there is a failure in the cache segment in which they exist.

Mesi protocol is the most mainstream cache consistency protocol, in the Mesi protocol, each cache line has 4 states, can be expressed as 2 bits, respectively:

The I, S, and M states here already have a corresponding concept: invalid/not loaded, clean, and dirty cache segments. So here the new knowledge point is only the E-state, which represents exclusive access, which solves the problem of "we need to tell other processors before we start modifying a block of memory": only if the cache line is in the E or M state, the processor can write it, that is, only in these two states, The processor is exclusive of this cache line. When the processor wants to write a cache line, if it does not have exclusive rights, it must first send a "I want exclusive" request to the bus, which informs the other processors to invalidate the copy of the same cache segment they own (if any). The processor can begin to modify the data only after the exclusive rights are acquired----and at this point the processor knows that there is only one copy of the cache line in my own cache, so there will be no conflicts.

Conversely, if there are other processors that want to read this cache line (as soon as they are sniffing the bus), the exclusive or modified cache line must go back to the "shared" state. If it is a modified cache line, then write the content back into memory first.

Read and write from the lock command back to volatile variables

Believe that with the above for lock's explanation, the volatile keyword implementation principle should be at a glance. First look at a picture:

Working memory is actually an abstraction of the CPU registers and caches, or the working memory of each thread can be simply understood as CPU registers and caches.

So when writing two threads thread-a and threab-b simultaneously manipulate a volatile variable I in main memory, Thread-a writes the variable i, then:

    • Thread-a issued lock# instructions
    • lock# instruction Lock bus (or lock cache line), while invalidating cache line content in the Thread-b cache
    • Thread-a write back to main memory the latest modified I

Thread-b read the variable i, then:

    • Thread-b finds that the cache line for the corresponding address is locked, waits for the lock to be released, and the cache consistency protocol guarantees that it reads the most recent value

Thus, it can be seen that the volatile keyword reading and normal variable reading compared to basically no difference, the difference is mainly in the write operation of the variable.


Prior to the role of the volatile keyword I personally have some confusion, in-depth understanding of the role of the volatile keyword, feel a lot more understanding of volatile. Believe to see the article here you, as long as Ken think, willing to study, will be as I have a sudden, enlightened feeling ^_^

How to implement the volatile keyword in Java

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.