Barrier (WMB, MB, RMB) and cache coherence

Last Update:2018-12-03 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Http://www.linuxforum.net/forum/gshowflat.php? Cat = & board = linuxk & number = 428239 & page = 5 & view = collapsed & SB = 5 & O = All & fpart =

Note: The barrier here refers to WMB, RMB, and MB.

The relationship between barrier and cache coherence cannot be found. the basic usage of barrier is described in <LDD> <ulk>. LDD focuses on barrier's role in dealing with peripherals. another example is to use barrier to implement lockless linked list operations.

This lock-free operation implemented by barrier is confusing. another example is big read lock (brlock. c brlock. h ). no reason or conditions can be found to indicate that barrier must be used. there is also softirq. in C, void init_bh (int nr, void (* routine) (void) also uses barrier.

This should be the case: Barrier forces the CPU to implement strong order, while cache conherence focuses on the copy consistency of memory in the cache of multiple CPUs. The problem is that,

Does cache conhernece require barrier participation?
What technologies are involved in CPU reorder memory read/write? Write buffer is counted as one, but is the cache consistent for data in writer buffer? (It should be consistent. If so, it means that barrier and cache conherence are essentially two different things. cache conherence is completely transparent to the software, but barrier requires the participation of the software)

I want to consider using barrier in the following scenarios: Both CPUs operate on the same data (write needs to be mutually exclusive), or read and write two data, but the two data have a certain relationship, for example, the operation of another data is determined based on the value of one data.
This description is not satisfactory. It is also purely Reasoning (of course, it makes sense that per-CPU data will never use barrier. The reoreder problem will not affect a single CPU. (fix me )).

In addition, when barrier is used, some books say: Make changes visible to all CPUs immediately. This may be inappropriate. How should we understand this problem?

Why is there MB ()

My understanding is similar to yours:
1. cache coherence does not require barrier participation. The hardware protocol guarantees cache coherence.
2. technologies that cause CPU reorder include load forwarding and load scheduling.
3. The memory ordering problem occurs only when there are at least two memory addresses and two functional units (CPU or peripherals) for memory access.
4. The improper "Make changes visible to all CPUs immediately" should be to make the two memory operations visible to other CPUs in a certain order (release, acquire or fence ).

Reoreder problems also affect a single CPU.
The course "advanced computer system structure" learned last semester is a classic example.
The store command of the PowerPC server is executed in reverse order immediately after a load command. If memory barrier is not used for the following programs, an error occurs:

1 While (TDH = 0 );
2 TDR = char1;
3 ASM ("eieio"); // memory barrier
4 While (TDH = 0 );
5 TDR = char2;

Program Description: assume that the above TDE is the Status Register of a peripheral. 0 indicates that the peripheral is busy. 1 indicates that the peripheral can receive and process one character, at this time, the user can write a character to be processed to the TDR register of the peripheral device. After the write, the TDE becomes 0, and the processing character of the peripheral device takes some time. After this time, the TDE changes from 0 to 1.

If the false settings are removed from the third line of the above program, the program will continue to run down when the first line of execution is set to 1. However, there is a store command (TDR = char1 ;) followed by a load command (while (TDH = 0);), the load command in the first sentence is executed first, and then the store command in the second line is executed, in this way, the register value loaded from the fourth line must be 1, so that the fifth line is executed immediately. As a result, the peripherals receive the second character in the busy state, which will certainly cause errors.
Therefore, you must add the third line to ensure that the store command is executed before the load command.

This peripheral (non-smp cpu) problem caused by reoreder is well understood. The key is that the impact on the SMP Environment Program design is not so intuitive.

If our ideas are correct, we should look at the init_bh and ksoftirqd problems. But it seems not intuitive.

I am also eager to understand this.
In addition, the LDD interruption chapter cannot be understood.

This Code, though simple, represents the typical job of an interrupt handler. It, in
Turn, CILS short_incr_bp, which is defined as follows:
Static inline void short_incr_bp (volatile unsigned long * index,
Int delta)
{
Unsigned long new = * index + delta;
Barrier ();/* Don't optimize these two together */
Why is barrier used here?

* Index = (New> = (short_buffer + page_size ))? Short_buffer: new;
}
This function has been carefully written to wrap a pointer into the circular buffer
Without ever exposing an incorrect value. by assigning only the final value and
Placing a barrier to keep the compiler from optimizing things, it is possible
Manipulate the circular buffer pointers safely without locks.

Expert advice.
All atomic operations in Linux contain MB and barrier (to prevent GCC from reorder commands ). corresponding to x86, operations such as test_and_setbit are implemented in volatile/lock/: memory mode.

The reason is also obvious:

Spin_lock (Lock );
Some read/write
...........

If there is no MB, some read/write operations may be executed before the execution of the spin_lock operation, which of course is not allowed.

This article describes how to implement lock Free searching. It is very helpful for understanding barrier and its research is far more valuable than translation.




Data Dependencies and wmb()

Version 1.0 







Goal



Support lock-free algorithms without inefficient and ugly read-side code. 

Obstacle Some CPUs do not support synchronous invalidation in hardware. 





Example Code 
Insertion into an unordered lock-free circular singly linked list, 

while allowing concurrent searches. 





Data Structures 


The data structures used in all these examples are 

a list element, a header element, and a lock. 





 struct el {

  struct el *next;

  long key;

  long data;

 };

 struct el head;

 spinlock_t mutex;





Search and Insert Using Global Locking



The familiar globally locked implementations of search() and insert() are as follows: 



 struct el *insert(long key, long data)

 {

  struct el *p;

  p = kmalloc(sizeof(*p), GPF_ATOMIC);

  spin_lock(&mutex);

  p->next = head.next;

  p->key = key;

  p->data = data;

  head.next = p;

  spin_unlock(&mutex);

 }



 struct el *search(long key)

 {

  struct el *p;

  p = head.next;

  while (p != &head) {

   if (p->key == key) {

    return (p);

   }

   p = p->next;

  }

  return (NULL);

 }



 /* Example use. */



 spin_lock(&mutex);

 p = search(key);

 if (p != NULL) {

  /* do stuff with p */

 }

 spin_unlock(&mutex);



These implementations are quite straightforward, but are subject to locking bottlenecks. 



Search and Insert Using wmb() and rmb()  


The existing wmb() and rmb() primitives can be used to do lock-free insertion. The 

searching task will either see the new element or not, depending on the exact timing, 

just like the locked case. In any case, we are guaranteed not to see an invalid pointer, 

regardless of timing, again, just like the locked case. The problem is that wmb() is 

guaranteed to enforce ordering only on the writing CPU --

 the reading CPU must use rmb() to keep the ordering. 





 struct el *insert(long key, long data)

 {

  struct el *p;

  p = kmalloc(sizeof(*p), GPF_ATOMIC);

  spin_lock(&mutex);

  p->next = head.next;

  p->key = key;

  p->data = data;

  wmb();

  head.next = p;

  spin_unlock(&mutex);

 }



 struct el *search(long key)

 {

  struct el *p;

  p = head.next;

  while (p != &head) {

   rmb();

   if (p->key == key) {

    return (p);

   }

   p = p->next;

  };

  return (NULL);

 }



(Note: see read-copy update for information on how to delete elements from this list 

while still permitting lock-free searches.) 





The rmb()s in search() cause unnecessary performance degradation on CPUs (such as the 

i386, IA64, PPC, and SPARC) where data dependencies result in an implied rmb(). In 

addition, code that traverses a chain of pointers would have to be broken up in order to 

insert the needed rmb()s. For example: 



 d = p->next->data;



would have to be rewritten as: 

 q = p->next;

 rmb();

 d = q->data;



One could imagine much uglier code expansion where there are more dereferences in a 

single expression. The inefficiencies and code bloat could be avoided if there were a 

primitive like wmb() that allowed read-side data dependencies to act as implicit rmb() 

invocations. 





Why do You Need the rmb()? 


Many CPUs have single instructions that cause other CPUs to see preceding stores before 

subsequent stores, without the reading CPUs needing an explicit rmb() if a data dependency 

forces the ordering. 



However, some CPUs have no such instruction, the Alpha being a case in point. On these 

CPUs, a wmb() only guarantees that the invalidate requests corresponding to the writes 

will be emitted in order. The wmb() does not guarantee that the reading CPU will process 

these invalidates in order. 



For example, consider a CPU with a partitioned cache, as shown in the following diagram: 







Here, even-numbered cachelines are maintained in cache bank 0, and odd-numbered cache 

lines are maintained in cache bank 1. Suppose that head was maintained in cache bank 0, 

and that a newly allocated element was maintained in cache bank 1. The insert() code's 

wmb() would guarantee that the invalidates corresponding to the writes to the next, key, 

and data fields would appear on the bus before the write to head->next. 

But suppose that the reading CPU's cache bank 1 was extremely busy, with lots of pending 

invalidates and outstanding accesses, and that the reading CPU's cache bank 0 was idle. 

The invalidation corresponding to head->next could then be processed before that of the 



three fields. If search() were to be executing just at that time, it would pick up the 

new value of head->next, but, since the invalidates corresponding to the three fields

had not yet been processed, it could pick up the old (garbage!) value corresponding to 

these fields, possibly resulting in an oops or worse. 

Placing an rmb() between the access to head->next and the three fields fixes this 

problem. The rmb() forces all outstanding invalidates to be processed before any 

subsequent reads are allowed to proceed. Since the invalidate corresponding to the three 

fields arrived before that of head->next, this will guarantee that if the new value of 

head->next was read, then the new value of the three fields will also be read. 

No oopses (or worse). 



However, all the rmb()s add complexity, are easy to leave out, and hurt performance of 

all architectures. And if you forget a needed rmb(), you end up with very intermittent 

and difficult-to-diagnose memory-corruption errors. Just what we don't need in Linux! 



So, there is strong motivation for a way of eliminating the need for these rmb()s. 

Solutions for lockfree search and insertions



Search and Insert Using wmbdd()



It would much nicer (and faster, on many architectures) to have a primitive similar to 

wmb(), but that allowed read-side data dependencies to substitute for an explicit rmb(). 



It is possible to do this (see patch). With such a primitive, the code looks as follows: 



 struct el *insert(long key, long data)

 {

  struct el *p;

  p = kmalloc(sizeof(*p), GPF_ATOMIC);

  spin_lock(&mutex);

  p->next = head.next;

  p->key = key;

  p->data = data;

  wmbdd();

  head.next = p;

  spin_unlock(&mutex);

 }



 struct el *search(long key)

 {

  struct el *p;

  p = head.next;

  while (p != &head) {

   if (p->key == key) {

    return (p);

   }

   p = p->next;

  }

  return (NULL);

 }





This code is much nicer: no rmb()s are required, searches proceed 

fully in parallel with no locks or writes, and no intermittent data corruption. 



Search and Insert Using read_barrier_depends()



  Introduce a new primitive read_barrier_depends() that is defined to be an rmb() on 

Alpha, and a nop on other architectures. This removes the read-side performance 

problem for non-Alpha architectures, but still leaves the read-side 

read_barrier_depends(). It is almost possible for the compiler to do a good job of 

generating these (assuming that a "lockfree" gcc struct-field attribute is created 

and used), but, unfortunately, the compiler cannot reliably tell when the relevant lock

 is held. (If the lock is held, the read_barrier_depends() calls should not be generated.) 



After discussions in lkml about this, it was decided that putting an explicit 

read_barrier_depends() is the appropriate thing to do in the linux kernel. Linus also 

suggested that the barrier names be made more explict. With such a primitive, 

the code looks as follows: 



 struct el *insert(long key, long data)

 {

  struct el *p;

  p = kmalloc(sizeof(*p), GPF_ATOMIC);

  spin_lock(&mutex);

  p->next = head.next;

  p->key = key;

  p->data = data;

  write_barrier();

  head.next = p;

  spin_unlock(&mutex);

 }



 struct el *search(long key)

 {

  struct el *p;

  p = head.next;

  while (p != &head) {

   read_barrier_depends();

   if (p->key == key) {

    return (p);

   }

   p = p->next;

  }

  return (NULL);

 }





A preliminary patch for this is barriers-2.5.7-1.patch. The future releases of this 

patch can be found along with the RCU package here. 





Other Approaches Considered





Just make wmb() work like wmbdd(), so that data dependencies act as implied rmb()s. 

Although wmbdd()'s semantics are much more intuitive, there are a number of uses of 

wmb() in Linux that do not require the stronger semantics of wmbdd(), and strengthening 

the semantics would incur unnecessary overhead on many CPUs--or require many changes to 

the code, and thus a much larger patch. 



Just drop support for Alpha. After all, Compaq seems to be phasing it out, right? There 

are nonetheless a number of Alphas out there running Linux, and Compaq (or perhaps HP) 

will be manufacturing new Alphas for quite a few years to come. Microsoft would likely 

focus quite a bit of negative publicity on Linux's dropping support for anything (never 

mind that they currently support only two CPU architectures). And the code to make Alpha 

work correctly is not all that complex, and it does not impact performance of other CPUs. 



Besides, we cannot be 100% certain that there won't be some other CPU lacking a 

synchronous invalidation instruction...

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Barrier (WMB, MB, RMB) and cache coherence

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Barrier (WMB, MB, RMB) and cache coherence

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support