False sharing, the silent performance killer of concurrent programming, and falsesharing

Source: Internet
Author: User

False sharing, the silent performance killer of concurrent programming, and falsesharing

In the process of concurrent programming, most of our focus is on how to control the access control of shared variables (code level), but few people will pay attention to the influencing factors related to system hardware and JVM underlying layer. Some time ago I learned a high-performance asynchronous Processing framework called Disruptor, which is known as the "fastest message framework". Its LMAX architecture can process 6 million orders per second in one thread! When talking about why Disruptor is so fast, I came into contact with the concept of false sharing, which mentioned: the write competition on the cache row is the most important limiting factor for the scalability of parallel threads running in the SMP system. Since it is difficult to see from the Code whether there will be pseudo-sharing, someone will describe it as a silent performance killer.

This article is only intended to merge and sort out what we have learned. At present, we have not studied and practiced it in depth, and hope to help you understand pseudo-sharing from scratch.

The non-standard definition of pseudo-sharing is that cache lines are stored in the cache system. When multiple threads modify mutually independent variables, if these variables share the same cache row, this will inadvertently affect the performance of each other. This is pseudo-sharing.

The following describes the causes and consequences of pseudo-sharing. First, we need to know what a cache system is.

CPU Cache

Baidu Encyclopedia of CPU cache is defined:

The Cache Memory is a temporary Memory located between the CPU and Memory. It has a much smaller capacity than the Memory, but the switching speed is much faster than the Memory .]
The emergence of high-speed cache mainly aims to solve the conflict between the CPU operation speed and the memory read/write speed, because the CPU operation speed is much faster than the memory read/write speed, this will take the CPU a long time to wait for the data to arrive or write the data into the memory.
The data in the cache is a small part of the memory, but this small part is about to be accessed by the CPU in a short time. When the CPU calls a large amount of data, you can avoid calling the memory directly from the cache to speed up reading.

There are several layers of cache between the CPU and the main memory, because even direct access to the main memory is very slow. If you are performing the same operation on a piece of data multiple times, it makes sense to load the operation to a place close to the CPU.

According to the data reading sequence and the closeness with the CPU, the CPU cache can be divided into Level 1 cache and level 2 Cache. Some high-end CPUs also have Level 3 cache. All the data stored in each level of cache is part of the next level of cache. The closer the cache is to the CPU, the faster the cache is. Therefore, the L1 cache is very small but very fast (expression L1 indicates a level-1 cache), and is closely related to the CPU kernel in which it is used. L2 is larger and slower, and can only be used by a single CPU core. L3 is more common in modern multi-core machines, still larger and slower, and is shared by all CPU cores in a single slot. Finally, you have a primary memory shared by all the CPU cores in all the slots. CPU with a level-3 cache can reach a hit rate of 95% to the level-3 cache. Less than 5% of data needs to be queried from the memory.

Shows the storage structure of a multi-core machine:

MESI protocol and RFO request

From the previous section, we know that each core has its own private L1 and L2 cache. In multi-thread programming, what should I do if the thread of another core wants to access data cached by L1 and L2 lines in the current core?

Some people say that it is feasible to directly access the 2nd-core cache rows through 1st cores, but this method is not fast enough. Cross-core access requires the use of Memory Controller (the Memory Controller is an important part of the Internal Control Memory of the computer system and the data exchange between the Memory and the CPU through the Memory Controller ), the typical situation is that 2nd cores often access the data of 1st cores, so each time there is a cross-core consumption .. Worse, it is possible that 2nd cores and 1st cores are not in the same slot. Besides, the bus bandwidth of Memory Controller is limited, so it cannot handle so much data transmission. Therefore, the CPU designers prefer another method: If the 2nd cores need the data, the 1st cores directly send the data content, and the data only needs to be transferred once.

So when will the transfer of cache rows occur? The answer is simple: when one core needs to read the dirty cache rows of another core. But how does the former determine that the latter's cache row has been corrupted?

Next we will answer the above questions in detail. First we need to talk about a protocol-MESI protocol. Currently, mainstream processors use this tool to ensure cache and memory coherence. M, E, S, and I represent the four statuses of the cache row when the MESI protocol is used:

M (modify, Modified): The local processor has Modified the cache row, that is, the dirty row. Its content is different from the content in the memory, and the cache has only one local copy (private ); E (Exclusive): the cache row content is the same as that in the memory, and Other Processors do not have this row of data; S (Shared, Shared): the cache row content is the same as that in the memory, other Processors may also copy the cached row; I (Invalid, Invalid): the cached row is Invalid and cannot be used.

The following describes how the four States are converted:

Initial: At the beginning, the cache row does not load any data, so it is in the I state. Local Write: if the Local processor writes data to the cache row in the I state, the status of the cache row changes to M. Local Read: if the Local processor reads a cache row in the I state, it is obvious that the cache has no data for it. There are two situations: (1) if the cache of other processors does not contain this row of data, load the data from the memory to this cache row and set it to the E state, this indicates that only one of us has this data, and none of the other processors have this data. (2) If the cache of other processors has this data row, the status of this cache row is set to S. (Note: If the cache row in the M status is written/Read by the local processor, the status will not change.) Remote Read ): suppose we have two processors c1 and c2. If c2 needs to read the content of the cache row of another processor c1, c1 needs to cache the content of the row through the Memory Controller (Memory Controller) send to c2. After receiving the request, c2 sets the status of the corresponding cache row to S. Before setting, the memory also needs to get the data from the bus and save it. Remote Write: in fact, it is not a Remote Write, but after c2 obtains c1 data, it is not for reading, but for writing. It can be regarded as local writing, but c1 also has a copy of the data. What should I do? C2 will send an RFO (Request For Owner) Request, which requires the permission to own this row of data, and the corresponding cache row of other processors is set to I. No one can change this row of data except itself. This ensures data security. At the same time, the process of processing RFO requests and setting I will consume a lot of performance for write operations.

State Transition is supplemented:

1. The job of a thread is moved from one processor to another. All the cache lines it operates on need to be moved to the new processor. If you write the cache row again later, the cache row has multiple copies on different cores and needs to send an RFO request. 2. Two different processors actually need to operate the same cache row.

Next, we need to know what a cache row is.

Cache row

As mentioned at the beginning of this article, cache lines are stored in the cache system. The cache row is usually 64 bytes (this article is based on 64 bytes, other lengths such as 32 bytes are not the focus of this Article), and it effectively references an address in the main memory. A Java long type is 8 bytes, so 8 long variables can be stored in a cache row. Therefore, if you access a long array, when a value in the array is loaded into the cache, it will load another seven, so that you can traverse the array very quickly. In fact, you can quickly traverse any data structure allocated in consecutive memory blocks. If the items in your data structure are not adjacent to each other (such as linked lists) in the memory, you will not get the advantage of free cache loading, in addition, cache miss may occur for each item in the data structure.

If there is such a scenario where multiple threads operate on different member variables but the same cache row, what will happen at this time ?. That's right. The pseudo-Sharing (False Sharing) problem occurs! There is a classic example of the Disruptor project, as shown below:

Pseudo-sharing

Okay, then we will use the code for the experiment and evidence.

Public class FalseShareTest implements Runnable {public static int NUM_THREADS = 4; public final static long ITERATIONS = 500L * 1000L * 1000L; private final int arrayIndex; private static VolatileLong [] longs; public static long SUM_TIME = 0l; public FalseShareTest (final int arrayIndex) {this. arrayIndex = arrayIndex;} public static void main (final String [] args) throws Exception {Thread. sleep (10000); for (int j = 0; j <10; j ++) {System. out. println (j); if (args. length = 1) {NUM_THREADS = Integer. parseInt (args [0]);} longs = new VolatileLong [NUM_THREADS]; for (int I = 0; I <longs. length; I ++) {longs [I] = new VolatileLong ();} final long start = System. nanoTime (); runTest (); final long end = System. nanoTime (); SUM_TIME + = end-start;} System. out. println ("Average time consumption:" + SUM_TIME/10);} private s Tatic void runTest () throws InterruptedException {Thread [] threads = new Thread [NUM_THREADS]; for (int I = 0; I <threads. length; I ++) {threads [I] = new Thread (new FalseShareTest (I);} for (Thread t: threads) {t. start () ;}for (Thread t: threads) {t. join () ;}} public void run () {long I = ITERATIONS + 1; while (0! = -- I) {longs [arrayIndex]. value = I ;}} public final static class VolatileLong {public volatile long value = 0L; public long p1, p2, p3, p4, p5, p6; // block this row }}

The logic of the above code is very simple, that is, four threads modify the content of different elements in an array. The element type is VolatileLong. There is only one long integer member value and six useless long integer members. The value is set to volatile to make the value Modification visible to all threads. The program runs in two cases. The first case is not to block the last and third rows (see "block this line"), and the second case is to block the last and third rows. To "ensure" the relative reliability of data, the program takes the average time of 10 executions. The execution is as follows (execution environment: 32-bit windows, quad-core, 8 GB memory ):

(Not blocked) (BLOCKED)

Two programs with identical logic consume about 2.5 times the Time of the latter, which is incredible! At this time, we will use the False Sharing theory for analysis. The former has four elements in the longs array. Because VolatileLong has only one long integer member, the entire array will be loaded to the same cache row, but four threads operate on this cache row at the same time, so pseudo-sharing happened quietly.

Based on this, we have reason to believe that the number of threads is within a certain range (Note: Why do we emphasize the number of threads), as the number of threads increases, the higher the frequency of pseudo-sharing, the longer the execution time. In order to prove this point of view, I use single thread, 2, 4, and 8 threads on the same machine to test the filling and no filling conditions. The execution scenario is to take the average time of 10 executions. The result is as follows:

The most important question is: how can we avoid pseudo-sharing?

One solution is to keep the objects operated by different threads in different cache rows.So how should we do it? In fact, there is an answer in the line of code we comment on, that is, the cache row filling (Padding ). Now, the above example shows that a cache row contains 64 bytes, while the object header of the Java program occupies 8 bytes (32-bit system) or 12 bytes (64-bit compression is enabled by default, but not 16 bytes), so we only need to fill in 6 useless long integers with 6*8 = 48 bytes, by placing different VolatileLong objects in different cache rows, pseudo-sharing is avoided (it doesn't matter if the 64-bit system exceeds the 64-byte of the cache row, as long as different threads do not operate on the same cache line ).

Pseudo-sharing is easy to occur in multi-core programming and is very concealed. For example, in JDK's javasblockingqueue, there is a reference head pointing to the queue header and a reference tail pointing to the end of the queue. This queue is often used in asynchronous programming. These two referenced values are often modified by different threads, but they are likely to be in the same cache row, therefore, pseudo-sharing occurs. The more threads, the more cores, the greater the negative effect on performance.

Due to some Java compiler optimization policies, incomplete data may be optimized during compilation. We can add some code in the program to prevent compilation optimization. As follows:

public static long preventFromOptimization(VolatileLong v) {          return v.p1 + v.p2 + v.p3 + v.p4 + v.p5 + v.p6;  }

Another technique is to use compilation instructions to force the alignment of each variable.The following code explicitly uses the compiler to use _ declspec (align (n) Here n = 64, aligned according to the cache line boundary.

__declspec (align(64)) int thread1_global_variable;__declspec (align(64)) int thread2_global_variable;

When arrays are used, padding is filled at the end of the cache line to ensure that data elements start at the cache line boundary. If the array cannot be aligned according to the cache line boundary, fill in the data structure [array element] to make it twice the cache line size. The following code explicitly fills in the data structure so that it is aligned according to the cache line. The _ declspec (align (n) Statement ensures that the array is also aligned. If the array is dynamically allocated, you can increase the allocation size and adjust the pointer to reach the cache line boundary.

struct ThreadParams{    // For the following 4 variables: 4*4 = 16 bytes    unsigned long thread_id;    unsigned long v; // Frequent read/write access variable    unsigned long start;    unsigned long end;    // expand to 64 bytes to avoid false-sharing     // (4 unsigned long variables + 12 padding)*4 = 64    int padding[12];};

In addition, there are a lot of research on Pseudo-sharing on the Internet, and some data fusion-based solutions are proposed. If you are interested, you can understand.

What should we do in actual development for pseudo-sharing?

Through the above introduction, we already know the influence of pseudo-sharing on programs. So, in the actual production and development process, do we have to solve the potential pseudo-Sharing Problem by filling the cache row?

Not necessarily.

First of all, we have repeatedly stressed that pseudo-sharing is very concealed. We are not able to use tools to detect pseudo-sharing events at the system level for the time being. Second, different types of computers have different microarchitectures (for example, 32-bit systems and 64-bit systems have different java objects, it is even more difficult to grasp. An exact filling solution only applies to a specific operating system. In addition, the cache resources are limited. If it is filled, it will waste precious cache resources and is not suitable for a wide range of applications. Finally, the mainstream Intel microarchitecture CPU L1 cache has reached a hit rate of more than 80%.

To sum up, not every system is suitable to spend a lot of energy solving potential pseudo-sharing problems.

 

 

Appendix

Reference Article 1: Understanding pseudo-Sharing (False Sharing) from the Java perspective

Reference Article 2: avoiding and recognizing pseudo-sharing between threads

Reference 3: A method to improve locality and reduce pseudo-sharing by Using Data Fusion

 

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.