Understanding Memories Barrier (Memory barrier)

Source: Internet
Author: User
Tags assert data structures int size min processing instruction relative volatile

This example is validated by Linux (g++) and the CPU is the X86-64 processor architecture. All of the Linux kernel codes listed are also valid (or only) x86-64.

In this paper, we first explain the memory Barrier by example (and kernel code), then introduce a lock-free ring buffer which is implemented by memory Barrier. Memory Barrier Introduction

The actual access order of the program at run time and the order in which the program code is written is not necessarily consistent, which is memory scrambling access. The reason the memory disorderly access behavior occurs is to improve the performance of the program when it runs. Memory-out-of-sequence access occurs primarily in two phases: compile-time, compiler optimizations cause memory-disorderly access (instruction Reflow) to run, and multiple-CPU interactions cause memory-disorderly access

Memory Barrier allows the CPU or compiler to be in an orderly manner in which to access the RAM. A memory access operation prior to a Barrier must be completed before its completion. Memory Barrier includes two types: compiler memory Barrier CPU memory Barrier

Most of the time, the compiler and CPU cause memory disorderly access is not a problem, but in some special cases, the correctness of the program logic depends on the memory access order, this time memory disorderly access will bring a logical error, for example:

Thread 1
while (!ok);
Do (x);
 
Thread 2
x =;
OK = 1;

In this code, OK is initialized to 0, and thread 1 waits for OK to be set to 1 after the Do function is executed. If you say that thread 2 writes to memory in random order, that is, after the value of x is assigned to the OK assignment, then the actual parameter accepted by the DO function is likely to be unexpected to the programmer, not 42. compile-time memory disorderly access

At compile time, when the compiler optimizes the code, it may change the order in which the actual instructions are executed (for example, the O2 or O3 in GCC will change the order in which the instructions are actually executed):

test.cpp
int x, y, R;
void f ()
{
x = r;
y = 1;
}

The result of compiler optimizations may cause y = 1 to finish before x = R. First compile this source file directly:

g++-S Test.cpp

Get the relevant assembly code as follows:

MOVL R (%rip),%eax
movl%eax, x (%rip)
movl $, y (%rip)

Here we see that x = r and y = 1 are not in a disorderly order. Now use the optimization option O2 (or O3) to compile the above code (g++-o2-s test.cpp) and generate the assembly code as follows:

MOVL R (%rip),%eax
movl $, y (%rip)
movl%eax, x (%rip)

We can see clearly that after the compiler optimizations MOVL $, y (%rip) is performed before Movl%eax, X (%rip). The way to avoid random memory access at compile time is to use the compiler barrier (also known as optimization barrier). The Linux kernel provides the function barrier () to allow the compiler to ensure that its previous memory accesses are completed before it is finished. The kernel implementation barrier () is as follows (x86-64 schema):

#define BARRIER () __asm__ __volatile__ (""::: "Memory")

Now add this compiler barrier to the code:

int x, y, R;
void f ()
{
x = r;
__asm__ __volatile__ (""::: "Memory");
y = 1;
}

This avoids the problem of memory scrambling access caused by compiler optimizations (if you are interested in looking at the compiled assembly code again). In this case, we can also use the keyword volatile to avoid compile-time memory access (but not to avoid the subsequent run-time memory chaos access). The volatile keyword allows for the avoidance of random ordering between related variables in memory access, where the definition of x and Y can be modified to solve the problem:

volatile int x, y;
int R;
void f ()
{
x = r;
y = 1;
}

The volatile keyword is now added, which causes X to be ordered in memory access relative to Y and y relative to X. In the Linux kernel, a macro access_once is provided to prevent the compiler from ordering sequential access_once instances. In fact, Access_once realize the source code as follows:

#define ACCESS_ONCE (x) (* (volatile typeof (X) *) & (x))

This code simply converts the variable x to volatile. Now we have the third modification scenario:

int x, y, R;
void f ()
{
access_once (x) = R;
Access_once (y) = 1;
}

This basically illustrates the problem of our compile-time access to memory disorder. The following is a description of run-time memory disorderly access. run-time memory disorderly access

At runtime, while the CPU executes instructions in a random order, on a single CPU, the hardware ensures that all memory access operations appear to be performed in the order in which the program code is written, and that this is not necessary for memory Barrier (without regard to compiler optimizations). Here we look at the behavior of the CPU scrambling execution. In the case of random execution, the order in which a processor actually executes instructions is determined by the available input data, not the order in which the programmer writes.
Early processors are ordered processors (In-order processors), and ordered processor processing instructions typically have the following steps: Instruction acquisition if the input operands of the instruction is available (for example, it is already in the register), The command is distributed to the appropriate functional unit. If one or more of the operands are not available (usually because they need to be fetched from memory), the processor waits until they are available to execute the function unit with the appropriate function unit to write the result back to the Register heap (register file, a set of registers in a CPU)

In contrast, the sequence processor (Out-of-order processors) processing instruction usually has the following steps: The instruction fetch instruction is distributed to the instruction queue instruction to wait in the instruction queue until the input operand is available (once the input operand is available, the instruction can leave the queue, Even if an earlier instruction is not executed) instructions are assigned to the appropriate functional unit and execution results are placed in the queue (rather than immediately written to the register heap) only the result of execution of the instruction executed by the earlier request is written to the register heap, and the results of the instruction execution are written to the Register heap (the execution result is reordered, Make the execution appear orderly)

As can be seen from the above execution, the execution of the order is more efficient than an orderly execution that avoids waiting for an operational object that is not available (the second step in an orderly execution). On modern machines, processors run much faster than memory, and ordered processors can already handle a large number of instructions while waiting for the available data.
Now think about the process of processing instructions in the order processor, we can get a few conclusions: for a single CPU instruction acquisition is ordered (through the queue implementation) for a single CPU instruction execution results are ordered to return the register heap (through the queue implementation)

It can be concluded that, on a single CPU, without considering the compiler optimization leads to disorderly ordering, multi-threaded execution does not have the problem of memory disorderly access. We can also get a similar conclusion from the kernel source code (incomplete excerpt):

#ifdef CONFIG_SMP
#define SMP_MB () MB ()
#else
#define SMP_MB () barrier ()
#endif

As you can see here, if SMP is used MB,MB is defined as CPU Memory barrier (which is discussed later), instead of SMP, the compiler barrier is used directly.

On a multi-CPU machine, the problem is different. Each CPU has a cache (the cache is primarily designed to compensate for slower access between the CPU and memory), and when a particular data is first acquired by a particular CPU, this data is obviously not in the CPU cache (this is the cache miss). This cache miss means that the CPU needs to fetch data from memory (this process requires the CPU to wait for hundreds of cycles) and this data will be loaded into the cache of the CPU so that it can be accessed directly from the cache. When a CPU writes, it must ensure that the other CPUs have removed this data from their caches (to ensure consistency), and that the CPU can only safely modify the data after the removal operation is complete. Obviously, when there are multiple caches, we have to use a cache conformance protocol to avoid data inconsistency, and the process of this communication can lead to the emergence of random access, which is what is called the run-time memory disorderly access. There is no further discussion of the whole detail here, which is a more complicated question, interested in studying the Http://www.rdrop.com/users/paulmck/scalability/paper/whymb.2010.06.07c.pdf article, Its detailed analysis of the entire process.

Now an example is given to illustrate the multi-CPU memory scrambling access:

Test2.cpp #include <pthread.h> #include <assert.h>//-------------------int cpu_thread1 = 0;
 
int cpu_thread2 = 1;
 
volatile int x, y, R1, R2;
 
void Start () {x = y = R1 = r2 = 0;} void End () {ASSERT (! (
R1 = = 0 && r2 = = 0));
 
} void Run1 () {x = 1; r1 = y;}
 
void Run2 () {y = 1; r2 = x;}
-------------------static pthread_barrier_t Barrier_start;
 
Static pthread_barrier_t Barrier_end; Static void* Thread1 (void*) {while (1) {pthread_barrier_wait (&barrier_start); Run1 () pthread_barrier_wait (&
Barrier_end);
} return NULL; } static void* Thread2 (void*) {while (1) {pthread_barrier_wait (&barrier_start); run2 (); Pthread_barrier_wait (&
Barrier_end);
} return NULL; } int main () {Assert (Pthread_barrier_init (&barrier_start, NULL, 3) = = 0); assert (Pthread_barrier_init (&barrier
 
_end, NULL, 3) = = 0);
pthread_t T1;
pthread_t T2;
ASSERT (Pthread_create (&t1, NULL, THREAD1, null) = = 0); ASSERT (Pthread_create (&t2, NULL,Thread2, NULL) = = 0);
CPU_SET_T CS;
Cpu_zero (&AMP;CS);
Cpu_set (Cpu_thread1, &cs);
ASSERT (PTHREAD_SETAFFINITY_NP (t1, sizeof (CS), &cs) = = 0);
Cpu_zero (&AMP;CS);
Cpu_set (Cpu_thread2, &cs);
 
Assert (PTHREAD_SETAFFINITY_NP (T2, sizeof (CS), &cs) = = 0);
 
while (1) {start (); pthread_barrier_wait (&barrier_start); pthread_barrier_wait (&barrier_end); end ();}
return 0; }

Here, two threads are created to run the test code (the code that needs to be tested is placed in the Run function). I used the pthread barrier (which differs from the Memory barrier discussed in this article) primarily to allow two sub-threads to run their run functions at the same time. This code keeps trying to run the run function of two threads at the same time in order to get the result we expect. The start function (for data initialization) is called once each time the run function runs, and the end function is called once run runs (results are checked). The RUN1 and run2 two functions run on which CPU is controlled by the Cpu_thread1 and cpu_thread2 two variables.
Compile this program first: g++-lpthread-o test2 test2.cpp (not optimized here to avoid compiler-optimized interference). It is important to note that two threads are running on two different CPUs (CPU 0 and CPU 1). As long as the memory does not have disorderly access, then R1 and R2 cannot be 0 at the same time, so an assertion failure indicates that there is memory disorderly access. After compiling the program, you will find that there is a certain probability that the assertion failed. To further illustrate the problem, we change the value of Cpu_thread2 to 0, in other words, to let two threads run under the same CPU, and then run the program to find that the assertion no longer fails.

Finally, we use the CPU memory Barrier to solve the problem of out-of-order access (X86-64 architecture):

int cpu_thread1 = 0;
int cpu_thread2 = 1;
 
void Run1 ()
{
x = 1;
__asm__ __volatile__ ("Mfence"::: "Memory");
R1 = y;
}
 
void Run2 ()
{
y = 1;
__asm__ __volatile__ ("Mfence"::: "Memory");
r2 = x;
}
ready to use Memory Barrier

Memory Barrier Common occasions include: Implementing a synchronous Primitive (synchronization primitives) to implement a lock-free data structure (lock-free data structures) driver

In the actual application development, the developer may not know the memory Barrier can develop the correct multithreaded program, mainly because the memory Barrier is already implied in various synchronization mechanisms (but with the actual memory Barrier have subtle differences), This makes it not possible to use Memory Barrier directly without any problems. But Memory Barrier is useful if you want to write data structures such as lock-free.

Typically, on a single CPU, there is a dependency on the Memory access order:

Q = P;
D = *q;

Here the memory operation is orderly. On the Alpha CPU, however, there is not necessarily an orderly memory read operation that relies on data dependency barrier (which is not explained in detail because of the unusual alpha).

In the Linux kernel, in addition to the compiler Barrier-barrier () and Access_once () mentioned earlier, there is a CPU Memory barrier: Universal barrier, guaranteed read and write operations, MB () and SMP_MB () write operations Barrier, only the WMB () and SMP_WMB () read operations are guaranteed, and only the read operations are guaranteed, RMB () and SMP_RMB ().

Note that all CPU Memory Barrier (except data dependent Barrier) imply compiler Barrier. The memory Barrier at the beginning of the SMP will use the compiler Barrier on a single processor based on the configuration, and the CPU memory Barrier (that is, MB (), WMB (), RMB () is used on the SMP, recalling the relevant kernel code above.

The last thing to note is that some types of memory Barrier in the CPU memory Barrier need to be paired, otherwise it will go wrong, in detail: A write operation Barrier needs to be used with the read operation (or data dependency) Barrier (of course, the general Barrier is also possible), and vice versa. example of Memory Barrier

Read the kernel code to further learn the use of Memory Barrier.
The Linux kernel implements a lock-free (only a read thread and a write thread) ring buffer Kfifo is used to the Memory Barrier, the implementation of the following source code:

/* * A simple kernel FIFO implementation. * * Copyright (C) 2004 Stelian Pop <stelian@popies.net> * * This program was free software; Can redistribute it and/or modify * it under the terms of the GNU general public License as published by * the Ftware Foundation;
Either version 2 of the License, or * (at your option) any later version. * * This program was distributed in the hope that it'll be useful, * but without any WARRANTY; Without even the implied warranty of * merchantability or FITNESS for A particular PURPOSE.
See the * GNU general public License for more details. * * You should has received a copy of the GNU general public License * along with this program;
If not, write to the free software * Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA. * */#include <linux/kernel.h> #include <linux/module.h> #include <linux/slab.h> #include <linux/e rr.h> #include <linux/kfifo.h> #include <linux/log2.h>/** * kfifo_init- Allocates a new FIFO using a preallocated buffer * @buffer: The preallocated buffer to be used.
* @size: The size of the internal buffer, this has to be a power of 2. * @gfp_mask: Get_free_pages mask, passed to Kmalloc () * @lock: The lock to being used to protect the FIFO buffer * * does not P The Kfifo to Kfifo_free () after use!
Simply free, the * &struct Kfifo with Kfree (). */struct Kfifo *kfifo_init (unsigned char *buffer, unsigned int size, gfp_t gfp_mask, spinlock_t *lock) {struct Kfifo *fi
 
Fo
 
/* Size must be a power of 2 */bug_on (!is_power_of_2 (size));
FIFO = Kmalloc (sizeof (struct kfifo), gfp_mask);
 
if (!FIFO) return err_ptr (-ENOMEM);
Fifo->buffer = buffer;
fifo->size = size;
Fifo->in = fifo->out = 0;
 
Fifo->lock = lock;
return FIFO;
 
} export_symbol (Kfifo_init); 
/** * Kfifo_alloc-allocates a new FIFO and its internal buffer * @size: The size of the internal buffer to be allocated. * @gfp_mask: Get_free_pages mask, passed to Kmalloc () * @lock: The loCK to is used to protect the FIFO buffer * * The size would be rounded-up to a power of 2. */struct KFIFO *kfifo_alloc (unsigned int size, gfp_t gfp_mask, spinlock_t *lock) {unsigned char *buffer; struct Kfifo *r
 
Et
/* * Round up to the next power of 2, since we ' let the indices * wrap ' technique works with this case.
 
*/if (!is_power_of_2 (size)) {bug_on (Size > 0x80000000); size = roundup_pow_of_two (size);}
Buffer = Kmalloc (size, gfp_mask);
 
if (!buffer) return err_ptr (-ENOMEM);
 
RET = kfifo_init (buffer, size, gfp_mask, lock);
 
if (Is_err (ret)) kfree (buffer);
return ret;
 
} export_symbol (Kfifo_alloc);
/** * Kfifo_free-frees the FIFO * @fifo: The FIFO to is freed.
*/void Kfifo_free (struct Kfifo *fifo) {kfree (Fifo->buffer); Kfree (FIFO);}
 
Export_symbol (Kfifo_free);
/** * __kfifo_put-puts some data into the FIFO, no locking version * @fifo: The FIFO to is used.
* @buffer: The data to is added.
* @len: The length of the data to be added. * * This function copies At most @len bytes from the @buffer into * the FIFO depending in the free space, and returns the number of * bytes copied
. * Note that with only one concurrent reader and one concurrent * writer, you don ' t need extra locking to use these funct
Ions.
 
*/unsigned int __kfifo_put (struct Kfifo *fifo, const unsigned char *buffer, unsigned int len) {unsigned int l;
 
len = min (len, fifo->size-fifo->in + fifo->out);
/* * Ensure that we are sample the Fifo->out index-before-we * Start putting bytes into the Kfifo.
 
*/SMP_MB (); /* First put the data starting from Fifo->in to buffer end */L = min (len, Fifo->size-(Fifo->in & (Fifo-&gt
; size-1)));
 
memcpy (Fifo->buffer + (Fifo->in & (fifo->size-1)), buffer, L);
 
/* Then put the rest (if any) at the beginning of the buffer */memcpy (fifo->buffer, buffer + L, len-l);
/* * Ensure that we add the bytes to the kfifo-before-* We update the fifo->in index.
 
*/SMP_WMB (); Fifo->in + = LeN
return Len;
 
} export_symbol (__kfifo_put);
/** * __kfifo_get-gets Some data from the FIFO, no locking version * @fifo: The FIFO to is used.
* @buffer: Where the data must be copied.
* @len: The size of the destination buffer.
* * This function copies at most @len bytes from the FIFO into the * @buffer and returns the number of copied bytes. * Note that with only one concurrent reader and one concurrent * writer, you don ' t need extra locking to use these funct
Ions.
 
*/unsigned int __kfifo_get (struct Kfifo *fifo, unsigned char *buffer, unsigned int len) {unsigned int l;
 
len = min (len, fifo->in-fifo->out);
/* * Ensure that we are sample the Fifo->in index-before-we * Start removing bytes from the Kfifo.
 
*/SMP_RMB (); /* First get the data from Fifo->out until the end of the buffer */L = min (len, Fifo->size-(Fifo->out & (f
(ifo->size-1)));
 
memcpy (buffer, Fifo->buffer + (Fifo->out & (Fifo->size-1)), L); /* Then get the rest (if Any) from the beginning of the buffer */memcpy (buffer + L, Fifo->buffer, len-l);
/* * Ensure that we remove the bytes from the kfifo-before-* We update the fifo->out index.
 
*/SMP_MB ();
 
Fifo->out + = Len;
return Len; } export_symbol (__kfifo_get);

In order to better understand the above source code, here incidentally, the implementation of some of the use of this article is not related to the topic of the technique: the use and operation to find the ring buffer subscript, compared to take-out operation to remove the target of the efficiency is much higher. Use and operation to take off the subject is the ring buffer size must be 2 of the N-square, in other words, the size of the ring buffer is a binary number of only 1, then index & (size–1) for the subscript (which is not difficult to understand) used in and out two cable The in and out are always incremented (this is a clever practice), which avoids some complex conditional judgments (some implementations, in = =, cannot distinguish whether the buffer is empty or full)

Here, index in and out are accessed by two threads. In and out indicate the bounds of the actual data in the buffer, that is, the in and out order of the access to the buffer data, and because the synchronization mechanism is not used, it is necessary to use the Memory barrier for the order relationship. Both index in and out are modified by only one thread and are read by two threads. __kfifo_put first uses in and out to determine how much data can be written to the buffer, at which point the out index should be read before the data in the user buffer is actually written to the buffer, so the SMP_MB (), corresponding, __kfifo_get also use SMP_MB () to ensure that the data in the buffer has been successfully read and written to the user buffer before the out index is modified. For in index, in __kfifo_put, through SMP_WMB () to ensure that the buffer before the data is written to modify the in index, because here only need to ensure that the write operation is ordered, so the use of write operation Barrier, in __kfifo_get, through SMP_RMB () to ensure The in index is read first (when the in index is used to determine how much readable data actually exists in the buffer) before the data in the buffer is read (and written to the user buffer), as this is only necessary to ensure that the read operation is orderly, so the use of read operation barrier.

Here, Memory Barrier has been introduced.


Original link: http://blog.csdn.net/world_hello_100/article/details/50131497

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.