Reprinted from: http://ifeve.com/from-javaeye-cpu-cache/
http://ifeve.com/from-javaeye-false-sharing/
The CPU is the brain of the computer, it is responsible for executing the program instructions, and the memory is responsible for the data, including the program itself. Memory is much slower than CPU, and now it takes about 200 CPU cycles (CPU cycles) to get a single piece of data in memory, and CPU register normally 1 CPU cycles are enough.
Web browser in order to speed up, will be in the local storage cache previously browsed data; traditional database or NoSQL database in order to speed up the query, often set a cache in memory, reduce the disk (slow) IO. The same memory is too far from the CPU, so the CPU designers add a cache (CPU cache) to the CPU. If you need to operate on the same batch of data many times, then putting the data into a cache closer to the CPU will give the program a great speed boost. For example, make a loop count, put the count variable in the cache, and don't have to go to the memory every time the loop accesses the data. Here is the simple CPU cache:
With the development of multicore, the CPU cache is divided into three levels: L1, L2, L3. The smaller the level, the closer to the CPU, the faster it is, and the smaller the capacity. L1 is the closest to the CPU, it has the smallest capacity, such as 32K, the fastest, each core has a L1 cache (exactly two L1 cache per core, a storage data l1d cache, a Save command l1i cache). L2 cache is larger, such as 256K, slower, generally each core has a separate L2 cache;l3 cache is the largest level of level three cache, such as 12MB, but also the slowest level, the same CPU slot between the kernel share a L3 Cache.
Just like the database Cache, the first to get data in the fastest Cache to find data, if there is no hit (cache miss) down the level of search, until the three level Cache can not find, it only to the memory to the data. The longer you miss, the more time it takes to consume data.
In order to efficiently access the cache, it is not easy to write a single piece of data to the cache. The cache is made up of cache rows, and a typical row is 64 bytes. The CPU access cache is operated in the smallest unit of behavior. A Java Long is 8 bytes, so you can get 8 long variables from one cache line. So if you access a long array, when a long is loaded into the cache, there will be no more than 7 loaded, so you can iterate through the group very quickly.
Now that the typical CPU microarchitecture has a Level 3 cache, each core has its own private L1, L2 cache, what if the other core thread wants to access the data in the current kernel L1, L2 cache rows when multi-threading is being programmed?
There is one way to access the 1th-core cache line directly through a 2nd core. This is possible, but this method is not fast enough. Cross-core access needs to pass the memory Controller, typically the 2nd kernel accesses this data for the 1th core, and each time there is a cross-core consumption. Worse, it is possible that the 2nd and 1th cores are not in one slot, and the memory controller's bus bandwidth is limited, so much data can not be carried. So the CPU designers are more inclined to the other way: if the 2nd core needs this data, the 1th core directly to the data sent past, the data only need to be transmitted once.
So when does the transfer of cache lines occur? The answer is simple: occurs when a kernel needs to read dirty cache rows from another core. But how does the former judge that the cache line of the latter has been soiled (written)?
The above questions will be answered in detail below. The first thing to talk about is a protocol---mesi protocol. Now the mainstream processor is used to ensure the coherence of the cache and the coherence of memory. M, E, S, and I represent the four states that cache rows are in when using the Mesi protocol:
M (Modified, Modified):The local processor has modified the cache line, which is a dirty row, and its contents are not the same as in-memory content. And this cache only has a local copy (proprietary).
E (Proprietary, Exclusive):The cache line contents are the same as in memory, and none of the other processors have this line of data.
S (Share, shared):The cache line content is the same as in memory, and it is possible that other processors also have copies of this cache line.
I (Invalid, Invalid):The cache line is invalid and cannot be used.
The following is a simple explanation of how the four states of the cache line are converted:
initial:Initially, the cache line does not load any data, so it is in the I state.
Local write:If the local processor writes data to a cache line in the I state, the state of the cache row becomes m.
Local read:If the local processor reads a cache line in the I state, it is clear that the cache has no data for it. At this point in two cases: (1) The other processor's cache also does not have this line of data, then from the memory load data into this cache line, then set it to the E-state, indicating that only my family has this data, the other processors do not have; (2) The state of this cache row is set to the S state. P.S. If the cache line is in the M state and then written/read out by the local processor, the status will not change.
remote read:Assume there are two processors C1 and C2. If C2 needs to read the cache line contents of the other processor C1, C1 needs to send the contents of its cache line through the memory controller to C2,C2 and then set the corresponding cache line status to S. Before setting up, the memory also has to get this data from the bus and save it.
Remote write:In fact, is not the remote write, but C2 get C1 data, not to read, but to write, but also local write, just C1 also have this copy of the data, how to do? C2 will issue a RFO request for owner, it needs permission to have this row of data, the corresponding cache line of the other processor is set to I, except it is self, who does not move this row of data. This guarantees the security of the data, while processing the RFO request and setting the I process will result in a significant performance drain on the write operation.
As you know, write operations are expensive, especially when you need to send RFO messages. When will the RFO request occur when the program is written? The following two types are available:
1. The work of a thread moves from one processor to another, and all cache lines it operates need to be moved to the new processor. If the cache line is then written again, the cache line has multiple copies on the different cores and needs to send the RFO request.
2. Two different processors do need to operate on the same cache line.
In a Java program, the members of an array are also contiguous in the cache. In fact, the adjacent member variables from the Java object are also loaded into the same cache row. If multiple threads manipulate different member variables, but the same cache line, the pseudo-share (false sharing) problem occurs.
For example: A thread running on processor core 1 wants to update the value of the variable x, while another thread running on processor Core 2 wants to update the value of the variable Y. However, these two frequently changed variables are in the same cache row. Two threads will take turns sending the RFO message, which takes ownership of this cache line. When core 1 has the ownership start update x, the cache line corresponding to Core 2 needs to be set to the I state. When Core 2 has the ownership to start updating Y, the cache line corresponding to core 1 needs to be set to the I state (failure state). Taking turns to seize ownership not only brings a lot of RFO messages, but if a thread needs to read this line of data, both L1 and L2 caches are stale data, only the L3 cache is synchronized. Reading L3 data is very performance-impacting, and worse, it is read across slots, L3 are miss, and can only be loaded from memory.
On the surface, X and Y are both manipulated by independent threads, and there is no relationship between the two operations. Only they share a cache line, but all competing conflicts originate from sharing.
So how to avoid pseudo-sharing? A cache line has 64 bytes, while the Java program's object header is fixed at 8 bytes (32-bit system) or 12 bytes (64-bit system is turned on by default, not compressed to 16 bytes). Just fill in 6 useless long integer 6*8=48 bytes, so that different variables in different cache lines, you can avoid the pseudo-share (64-bit system more than the cache row 64 bytes also does not matter, as long as the different threads do not operate the same cache line can), this method is called completion (Padding). For example:
[Java]View PlainCopy
- public final static CLASS VOLATILELONG {  
- public Volatile long value = 0l;
- public long p1, p2, p3, p4, p5, p6;
- }
Pseudo-sharing is easy to happen in multicore programming and is more subtle. For example, in the JDK's linkedblockingqueue, there is a reference head pointing to the queue header and a reference to the end of the queue. And this kind of queue is often in asynchronous programming, the values of these two references are often modified by different threads, but they are likely to be in the same cache line, resulting in pseudo-sharing. The more threads, the more cores, the greater the negative effect on performance.
Some Java compilers will not use the complete data, even if the sample code in the 6 long integer at compile-time optimization, you can add some code in the program to prevent the compilation optimization.
[Java]View PlainCopy
- Public static long preventfromoptimization (Volatilelong v) {
- return V.P1 + v.p2 + v.p3 + v.p4 + v.p5 + v.p6;
- }
Also, because of the GC problem with Java. The location of the data in memory and the corresponding CPU cache line is subject to change, so be aware of the GC's impact when using the pad.
Understanding CPU caching and pseudo-sharing from a Java perspective