Why do I need a memory barrier "go"

Last Update:2017-01-11 Source: Internet

Author: User

Tags ack

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Transferred from: http://blog.csdn.net/chen19870707/article/details/39896655

Author:echo Chen (Chenbin)
Email:[email protected]
blog:blog.csdn.net/chen19870707
Date:september 30th, 2014

From an article outside the wall, the best way to understand how to use memory barrier is to understand why it exists. The CPU hardware was designed to increase the execution speed of the instruction by adding two buffers (store buffer, invalidate queue). This two buffer can prevent the CPU from making unnecessary waits in some cases, thus increasing the speed, but the existence of these two buffers also brings new problems.

To analyze this problem carefully, you need to understand how the cache works.

The current CPU cache works much like the hash table used in software programming, the book says "N-Way Group (N-way set associative)", wherein the "group" is the hash table of the modulus, that is, the number of hash chain, and often said "N Road", is the maximum length of each linked list. The table entry for the list is called Cache-line, which is a fixed-size block of memory. Read the operation is very direct, no longer repeat. If a CPU writes a data item, it must first be removed from the cache of the other CPU, which is called invalidation. When the invalidation is over, the CPU can safely modify the data. If the data item is in the cache of the CPU, but is read-only, this process is called "Write Miss". Once the CPU removes the data from the cache of other CPUs, it can read and write the data item repeatedly. If the other CPU tries to access this data item at this point, it will produce a "cache miss", because the first CPU has already invalidated the data item. This type of Cache-miss is called "Communication Miss" because the data items that produce this kind of miss are often used to communicate between CPUs, such as locks, which are such data items.

To ensure that the cache remains consistent in a multiprocessor environment, a protocol is required to prevent data inconsistency and loss. The protocol currently used is the MESI protocol. Mesi is a combination of the first letters of the four states of Modified,exclusive, GKFX, invalid. Using this protocol, the cache will have a 2-bit tag in front of each cache-line, indicating the current state.

Modified state: The Cache-line contains the modified data, the data in memory does not appear in the other Cpu-cache, at which point the CPU cache contains the latest data
Exclusive status: Similar to modified, but the data is not modified to indicate that the in-memory data is up-to-date. If you are removing data items from the cache at this time, you do not need to write the data back into memory
Shared state: Data items may be duplicated in other CPUs, and the CPU must be able to write data to the Cache-line after it has queried other CPUs
Invalid state: Indicates that the cache-line is empty

Mesi uses message passing to switch between the above states, see [1] for a specific conversion process. If the CPU uses a shared bus, the following message is sufficient:

READ: Contains the physical address of the cache-line to be read
Read response: The data that contains the read request, either memory-satisfied or cache-satisfied
Invalidate: Contains the physical address of the cache-line to be invalidate, all other caches must remove the corresponding data item
Invalidate ACK: Reply message
Read invalidate: Contains the physical address of the cache-line to be read, while the other cache removes the data. Requires read response and invalidate ACK messages
Writeback: Contains the data and address to write back, the state will be in the modified state lines write back to memory, make room for other data

References [1] in the words:

Interestingly enough, a shared-memory multiprocessor system really is a message-passing computer under the covers. This means, clusters of SMP machines that use distributed shared memory is using message passing to implement shared Memory at both different levels of the system architecture.

Although the protocol guarantees data consistency, it is not efficient in some cases. For example, if CPU0 is updating a data in Cpu1-cache, it must wait for Cache-line to pass from Cpu1-cache to Cpu0-cache before performing a write operation. The transfer between caches takes a lot of time and is a few orders of magnitude higher than executing a simple operation register. In fact, it doesn't make sense to spend this time because CPU0 will overwrite it regardless of what data is passed from Cpu1-cache. To solve this problem, the hardware designer introduced the store buffer, which is between the CPU and the cache, and when the write operation is performed, the CPU writes the data directly to store buffer and no longer waits for another CPU message. But this design can lead to a very obvious error condition.

Try to consider the following code:

   1:a = 1;

   2:b = a + 1;

   3:assert (b = = 2);

Assume that the values for both A and B are 0,a in Cpu1-cache, and B is in Cpu0-cache. If you follow the procedure below to execute this code:

1 CPU0 execution a=1;
2 because A is in Cpu1-cache, so CPU0 sends a read invalidate message to occupy the data
3 CPU0 A into store buffer
4 CPU1 received the read invalidate message, so it passes the cache-line and removes the cache-line from its own cache
5 CPU0 began to execute b=a+1;
6 CPU0 received the CPU1 cache-line, namely "a=0"
7 CPU0 reads the value of a from the cache, which is "0"
8 CPU0 Update Cache-line, writes data from store buffer, i.e. "a=1"
9 CPU0 uses the value read to a of "0", performs a plus 1 operation, and writes the result "1" to B (b in Cpu0-cache, so directly)
CPU0 execution assert (b = = 2); Failed

The problem is that we have two copies of "A", one in Cache-line and one in store buffer. The hardware designer's solution is "store Forwarding", which is read from both the cache and the store buffer when the load operation is performed. That is, when a load operation is performed, if the data is available in the Store-buffer, the CPU will take the data directly from the Store-buffer without going through the cache. Because "store Forwarding" is a hardware implementation, we don't need to be too concerned.

There is one more error situation, consider the following code:

   1:void foo (void)

   2: {

   3:a = 1;

   4:B = 1;

5:}

   7:void bar (void)

   8: {

   9:while (b = = 0) continue;

  10:assert (A = = 1);

  11:}

Suppose that variable A is in Cpu1-cache and B is in Cpu0-cache. CPU0 executes foo (), CPU1 executes bar (), and the program executes in the following order:

1 CPU0 Execution A = 1; Because A is not in Cpu0-cache, CPU0 puts the value of a in Store-buffer and sends a read invalidate message
2 CPU1 Execution while (b = = 0) continue; But because B is no longer cpu1-cache, it sends a read message
3 CPU0 Executes B = 1; because B is in Cpu0-cache, it stores the value of B directly into Store-buffer
4 CPU0 received the read message, so it passed the updated cache-line of B to CPU1 and was marked as shared
5 CPU1 received a cache-line containing B and installed into its own cache
6 CPU1 can now continue to execute while (b = = 0) Continue, because B=1 so the loop ends
7 CPU1 Perform assert (a = = 1), because A is in Cpu1-cache, and the value is 0, so the assertion is false
8 CPU1 receives a read invalidate message, passes the cache-line containing a to CPU0, and then marks Cache-line as invalid. But it's too late.

That is, there may be cases where B is already assigned, but a is not, so there is a case of B = 1, a = 0. Hardware designers cannot help with this type of problem because the CPU cannot know the correlation between variables. So the hardware designer provides the memory barrier instruction, which allows the software to tell the CPU such a relationship. The workaround is to modify the code as follows:

   1:void foo (void)

   2: {

   3:a = 1;

   4:SMP_MB ();

   5:B = 1;

6:}

The SMP_MB () directive forces the CPU to flush Store-buffer before subsequent store operations. Taking the above procedure as an example, after adding memory barrier, it is guaranteed that a in Cpu0-store-buffer will be flushed to the cache when the b=1 is executed, and a in Cpu1-cache must have been marked as invalid. For code executed in CPU1, it is guaranteed that when b==0 is false, a is no longer in cpu1-cache, and therefore must be passed from Cpu0-cache to get the new value "1". See [1] for the specific process.

The above example is an environment using memory barrier, another environment involves another buffer, which is exactly a queue-"Invalidate Queues".

Store buffer is generally small, so the CPU executes several store operations to fill. At this point the CPU must wait for the invalidation ACK message to release the buffer space-the record that gets the invalidation ACK message is synchronized to the cache and removed from the store buffer. The same situation occurs after the memory barrier is executed, when all subsequent store operations must wait for invalidation to complete, regardless of whether these actions lead to Cache-miss. The workaround is simple: use "Invalidate Queues" to queue the Invalidate message and return the Invalidate ACK message immediately. But there are problems with this approach.

Consider the following scenario:

   1:void foo (void)

   2: {

   3:a = 1;

   4:SMP_MB ();

   5:B = 1;

6:}

   8:void bar (void)

   9: {

  10:while (b = = 0) continue;

  11:assert (A = = 1);

  12:}

A is in shared state and B is within the Cpu0-cache. CPU0 executes foo (), CPU1 executes the function bar (). The following actions are performed:

1 CPU0 perform a=1. Because Cache-line is a shared state, the new value is placed in Store-buffer and the invalidate message is passed to notify CPU1
2 CPU1 executes while (b==0) continue, but B is no longer cpu1-cache, so the read message is sent
3 CPU1 receives the invalidate message to CPU0, queues it, and returns an ACK message
4 CPU0 receives an ACK message from CPU1, then executes SMP_MB () and moves a from store-buffer to Cache-line
5 CPU0 executes b=1; Because the cache-line is already included, the new value of B is written to the Cache-line
6 CPU0 received the read message, and then passed the cache-line containing the new value of B to CPU1, and marked as shared state
7 CPU1 received a cache-line containing B
8 CPU1 continue to execute while (b==0) continue, because false so the next statement
9 CPU1 execution assert (a==1), because the old value of a is still in Cpu1-cache, the assertion fails
10 Although the assertion failed, CPU1 handled the invalidate message in the queue and really invalidate the cache-line with a, but it was too late

As you can see, the problem is that when the CPU queues up a invalidate message, it reads the data corresponding to the message again before it has been processed, and the data should have been invalidated at this time.

The workaround is to also add a memory barrier to the bar ():

   1:void bar (void)

   2: {

   3:while (b = = 0) continue;

   4:SMP_MB ();

   5:assert (A = = 1);

6:}

The function of SMP_MB () here is to process the message in "Invalidate Queues", so when the assert (A==1) is executed, the CPU1 containing a in Cache-line is invalid, and the new value is re-read from the Cpu0-cache.

The memory Bariier can also be subdivided into "write memory barrier (WMB)" and "Read Memory barrier (RMB)". RMB only handles invalidate QUEUES,WMB processing only store buffer.

You can use the $ and wmb to rewrite the above example:

   1:void foo (void)

   2: {

   3:a = 1;

   4:SMP_WMB ();

   5:B = 1;

6:}

   8:void bar (void)

   9: {

  10:while (b = = 0) continue;

  11:SMP_RMB ();

  12:assert (A = = 1);

  13:}

Finally mention the x86 MB. X86CPU will automatically handle the store order, so the SMP_WMB () primitives do nothing, but load may be out of order, SMP_RMB () and SMP_MB () expand to Lock;addl.

[1] Http://www.rdrop.com/users/paulmck/scalability/paper/whymb.2010.06.07c.pdf
[2] Http://en.wikipedia.org/wiki/Memory_barrier
[3] Http://www.mjmwired.net/kernel/Documentation/memory-barriers.txt

[4]http://sstompkins.wordpress.com/2011/04/12/why-memory-barrier%ef%bc%9f/

Echo chen:blog.csdn.net/chen19870707

Why do I need a memory barrier "go"

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More