Mesi Agreement and RFO request
A typical CPU microarchitecture has a Level 3 cache, with each core having its own private L1, L2 cache. What if another kernel thread is trying to access the current kernel L1, L2 cache rows of data, when you are programming multiple threads?
It has been said that the 1th-core cache line can be accessed directly through the 2nd core. This is possible, but this method is not fast enough. Cross-core access needs to pass the memory Controller (see previous article), the typical case is that the 2nd core often accesses the 1th core of this data, each time there is a cross-core consumption. Worse, it is possible that the 2nd and 1th cores are not in one slot. Besides, the memory controller's bus bandwidth is limited and cannot carry so much data. Therefore, the CPU designers are more inclined to another approach: if the 2nd core needs this data, the 1th core directly to the data sent past, the data only need to be transmitted once.
So when does the transfer of cache lines occur? The answer is simple: occurs when a kernel needs to read dirty cache rows from another core. But how does the former judge that the cache line of the latter has been soiled (written)?
The above questions will be answered in detail below. First we need to talk about a protocol –mesi protocol (link). Now the mainstream processor is used to ensure the coherence of the cache and the coherence of memory. M,e,s and I represent the four states that cache rows are in when using the Mesi protocol:
- M (Modified, Modified): The local processor has modified the cache line, which is a dirty line, and its contents are not the same as in-memory content. And this cache only has a local copy (proprietary).
- E (Proprietary, Exclusive): Cache line content is the same as in memory, and no other processor has this line of data
- S (Share, shared): Cache line content is the same as in memory, it is possible that other processors also have copies of this cache line
- I (Invalid, Invalid): Cache line is invalidated and cannot be used
Originated from the kernel developer Ulrich Drepper famous what every Programmer should Know about memory a book (download), briefly showing the cache line of four state transitions. However, his book does not explain how the four state of the white is converted, the following I use a small paragraph to illustrate.
initially , the cache line does not load any data, so it is in the I state.
Write locally ( local write) if the local processor writes data to a cache line in the I state, the state of the cache row becomes m.
Local Read if the local processor reads a cache line in the I state, it is clear that the cache has no data for it. At this point in two cases: (1) The other processor's cache also does not have this line of data, after loading the data from memory to this cache line, then set it to the E state, indicating that only my family has this data, no other processors (2) The other processor's cache has this row of data, the state of this cache row is set to the
P.S. If the cache line is in the M state and then written/read out by the local processor, the status will not change.
remote Read assumes we have two processors C1 and C2. If C2 needs to read the cache line contents of the other processor C1, C1 needs to send the contents of its cache line through the memory controller to C2, and then set the corresponding cache line state to s when the C2 is received. Before setting up, the memory also has to get this data from the bus and save it.
Remote Write is actually not a remote write, but C2 get C1 data, not to read, but to write. is also a local write, but C1 also have this copy of the data, how to do it? C2 will issue a RFO request for owner, it needs permission to have this row of data, the corresponding cache line of the other processor is set to I, except it is self, who does not move this row of data. This guarantees the security of the data, while processing the RFO request and setting the I process will result in a significant performance drain on the write operation.
The above is just a list of some state transitions, to pave the following. If all the descriptions, need a very large amount of text, you can refer to this diagram to know the reason, through this diagram to understand the MESI protocol more detailed information.
Pseudo Share
As we know from the previous section, the cost of writing is high, especially if you need to send a RFO message. When we write a program, when does a RFO request occur? The following two types are available:
1. The work of a thread moves from one processor to another, and all cache lines it operates need to be moved to the new processor. If the cache line is then written again, the cache line has multiple copies on the different cores and needs to send the RFO request.
2. Two different processors do need to operate the same cache line
As we know from the previous article, in a Java program, the members of an array are also contiguous in the cache. In fact, the adjacent member variables from the Java object are also loaded into the same cache row. If multiple threads manipulate different member variables, but the same cache line, the pseudo-share (false sharing) problem occurs. The following is an example of a sample diagram and an experimental example from the post of Disruptor Project Lead (stealing will be lazy, but with a more detailed profile method).
A thread running on processor core 1 wants to update the value of the variable x, while another thread running on processor Core 2 wants to update the value of the variable Y. However, these two frequently changed variables are in the same cache row. Two threads will take turns sending the RFO message, which takes ownership of this cache line. When core 1 has the ownership start update x, the cache line corresponding to Core 2 needs to be set to the I state. When Core 2 has the ownership to start updating Y, the cache line corresponding to core 1 needs to be set to the I state (failure state). Taking turns to seize ownership not only brings a lot of RFO messages, but if a thread needs to read this line of data, both L1 and L2 caches are stale data, and only the L3 cache is well-synchronized. We know from the previous article that reading L3 data is very influential in performance. Worse, it is read across slots, L3 are miss, and can only be loaded from memory.
On the surface, X and Y are both manipulated by independent threads, and there is no relationship between the two operations. Only they share a cache line, but all competing conflicts originate from sharing.
Experiment and analysis
To cite Martin's example, make a minor change, with the following code:
1 Public Final classFalsesharingImplementsRunnable {2 Public Static intNum_threads = 4;// Change3 Public Final Static Longiterations = 500L * 1000L * 1000L;4 Private Final intarrayindex;5 Private Staticvolatilelong[] longs;6 7 PublicFalsesharing (Final intarrayindex) {8 This. arrayindex =arrayindex;9 }Ten One Public Static voidMainFinalString[] args)throwsException { AThread.Sleep (10000); -System.out.println ("Starting ....")); - if(Args.length = = 1) { theNum_threads = Integer.parseint (args[0]); - } - -longs =NewVolatilelong[num_threads]; + for(inti = 0; i < longs.length; i++) { -Longs[i] =NewVolatilelong (); + } A Final LongStart =system.nanotime (); at runtest (); -System.out.println ("duration =" + (System.nanotime ()-start)); - } - - Private Static voidRuntest ()throwsinterruptedexception { -thread[] Threads =NewThread[num_threads]; in for(inti = 0; i < threads.length; i++) { -Threads[i] =NewThread (Newfalsesharing (i)); to } + for(Thread t:threads) { - T.start (); the } * for(Thread t:threads) { $ T.join ();Panax Notoginseng } - } the + Public voidrun () { A Longi = iterations + 1; the while(0! =-i) { +Longs[arrayindex].value =i; - } $ } $ - Public Final Static classVolatilelong { - Public volatile LongValue = 0L; the Public LongP1, p2, p3, P4, P5, P6;//Notes - }Wuyi}
The logic of the code is that the default 4 threads modify the contents of an array of different elements. The type of the element is Volatilelong, with only one long integer member value and 6 unused long integer members. Value is set to volatile so that all threads are visible for the modification of value. Run on a westmere (Xeon E5620 8core*2) machine to see
= 9316356836
Comment out the above code 49 lines to see the result:
= 59791968514
Two logic exactly the same program, the former only need 9 seconds, the latter ran for nearly a minute, this is incredible! We use pseudo-sharing (false sharing) theory to analyze. The next program longs the 4 elements of the array, because Volatilelong has only 1 long integer members, so the entire array will be loaded into the same cache row, but with 4 threads manipulating the cache line at the same time, the pseudo-sharing happens quietly. The reader can test the 2,4,8, what is the effect of 16 threads respectively, and what is the trend.
So how to avoid pseudo-sharing? Our non-annotated code tells us the method. We know that a cache line has 64 bytes, while the Java program's object header is fixed at 8 bytes (32-bit system) or 12 bytes (64-bit system is turned on by default, not compressed to 16 bytes), see links for details. We only need to fill in 6 useless long integer 6*8=48 bytes, so that different Volatilelong objects in different cache lines, you can avoid pseudo-sharing (64-bit system is more than 64 bytes of cache rows does not matter, as long as the different threads do not operate the same cache line). This method is called completion (Padding).
How is it effective to observe this optimization at the system level? Unfortunately, due to the different microarchitecture of many computers, we have no tools to directly detect pseudo-shared events (including Intel VTune and Valgrind). All the tools are found on the side, and the following are proven by Linux tool oprofile. The above program's array is only 64 * 4 = 256 bytes, and in a continuous physical space, it is logically possible that the data will be hit on the L1 cache, and will certainly not be passed into the L2 cache, only when a pseudo-share occurs. So we can prove it by observing the in event of the L2 cache, the steps are as follows:
# Set Capture L2 cache in event $ sudo --setup--event=l2_lines_in:100000--------l ' which Java
Compare the results of the two versions, the slow version:
$ opreport-2400.20x07 (any L2 lines alloacated) Count 100000samples % image Name symbol name34085 99.8447 anon (tgid:18051 range:0x7fcdee53d000-0x7fcdee7ad000) anon (tgid : 18051 range:0x7fcdee53d000-0x7fcdee7ad000)Wuyi 0.1494 anon (tgid:16054 range:0 x7fa485722000-0x7fa485992000) Anon (tgid:16054 range:0x7fa485722000-0x7fa485992000)2 0.0059 Anon (tgid:2753 range:0x7f43b317e000-0x7f43b375e000) anon (tgid:2753 range:0x7f43b317e000-0x7f43b375e000)
Fast version:
$ opreport-2400.20x07 (any L2 lines alloacated) Count 100000samples % image Name symbol name 88.0000 anon (tgid:18873 range:0x7f3e3fa8a000-0x7f3e3fcfa000) anon (tgid:18873 range:0x7f3e3fa8a000-0x7f3e3fcfa000)3 12.0000 anon (tgid:2753 range:0 x7f43b317e000-0x7f43b375e000) anon (tgid:2753 range:0x7f43b317e000-0x7f43b375e000)
The slow version due to false sharing raises the L2 cache in event up to 34,085 times, while the fast version is 0 times.
Summarize
Pseudo-sharing is easy to happen in multicore programming and is more subtle. For example, in the JDK's linkedblockingqueue, there is a reference head pointing to the queue header and a reference to the end of the queue. And this kind of queue is often in asynchronous programming, the values of these two references are often modified by different threads, but they are likely to be in the same cache line, resulting in pseudo-sharing. The more threads, the more cores, the greater the negative effect on performance.
Some Java compilers will not use the complete data, that is, the sample code 6 long integer at compile-time optimization, you can add some code in the program to prevent compilation optimization.
1 public static long preventfromoptimization (Volatilelong v) { 2 return v.p1 + v.p2 + v.p3 + v.p4 + v.p5 + v.p6; 3 }
Also, because of the GC problem with Java. The location of the data in memory and the corresponding CPU cache line is subject to change, so be aware of the GC's impact when using the pad.
Java pseudo Share