Memory Level/barriers/fences

Source: Internet
Author: User

Translated from: Martin Thompson-memory barriers/fences

In this article, I will discuss the most basic technology in concurrent programming-known as the memory level or barrier, which makes the memory status in the process visible to other processes.

The CPU uses many technologies to try and adapt to the fact that the performance of the CPU Execution Unit far exceeds the performance of the main memory. In my "Writing combining" article, I just talked about one of the technologies. The most common technology used by the CPU to hide memory latency is pipeline-based instructions, and then a great deal of effort is made to retry these pipelines with resources to minimize latency related to cache miss.

When a program is executed, it does not care if the re-ordered command provides the same final result. For example, if the cycle calculator is not used in a loop, it does not matter when the cycle counter is updated. The compiler and CPU can freely re-order the instruction to maximize the CPU usage until the next iteration is about to begin (loop counter ). It is also possible that this variable may be stored in a register during the execution of a loop and will never be pushed to the cache or the main memory. Therefore, it will never be visible to other CPUs.

The CPU core contains multiple execution units. For example, a modern Intel CPU contains six execution units, which can be used as a combination of mathematical, conditional logic, and memory operations. Each execution unit can combine these tasks. These execution units operate in parallel and allow parallel execution of commands. If we observe from other CPUs, this introduces another layer of uncertainty in program order.

In the end, when a cache miss occurs, the modern CPU can make a hypothesis based on the results of Memory loading, and then continue to execute based on this assumption until the actual data is loaded.

The "program order" is provided to keep the CPU and compiler free to do what they think can improve performance.


Loading and storage (store) to cache and master memory are buffered and reordered, using load, storage, and write combination (writing-combining) cache. These caches are associated queues that allow quick search. This search is required. When a later load needs to read a previously stored value that has not reached the cache. It depicts a simplified view of modern multi-core CPUs. It shows how execution units use local registers and caches to manage memory and send back and forth with the cache subsystem.

In a multi-threaded environment, some technologies are required to make the program results visible in a timely manner. I will not involve cache consistency in this article. It is assumed that once the memory is pushed to the cache, a protocol message will occur to ensure that the cache of all shared data is consistent. This technology that makes memory visible to the processor core is called a memory level or fence.

The memory level provides two attributes. First, they retain the externally visible program order, by ensuring that all the instructions on both sides of the level show the correct program order if observed from other CPUs. Second, they make the memory visible by ensuring that data is transmitted to the cache subsystem.

The memory level is a complex topic. Their implementation in different CPU architectures is very different. Intel CPU has an associated strong memory model. This article will be based on the x86 CPU.

Store barrier)

The storage level is the "sfence" command on x86, which forces all storage commands before the level commands to occur before the level, and let store buffers refresh to the CPU cache that releases this command. This will make the program State visible to other CPUs so that they can respond to it if needed. A good example is the following simplified batcheventprocessor class from disruptor. When sequence is updated, other consumers and producers will know the progress of this consumer and make appropriate responses. All memory updates before the level are now visible.

Private volatile long sequence = ringbuffer. initial_cursor_value; // from inside the run () methodt event = NULL; long nextsequence = sequence. get () + 1l; while (running) {try {//: Barrier reads values of other sequence, so there is a load barrier command. Final long availablesequence = barrier. waitfor (nextsequence); While (nextsequence <= availablesequence) {event = ringbuffer. get (nextsequence); Boolean endofbatch = nextsequence = availablesequence; eventhandler. onevent (event, nextsequence, endofbatch); nextsequence ++;} sequence. set (nextsequence-1l); // store barrier is inserted here !!! } Catch (final exception ex) {exceptionhandler. Handle (ex, nextsequence, event); sequence. Set (nextsequence); // store barrier is inserted here !!! Nextsequence ++ ;}}
Load barrier)

The "lfence" command is run on x86 to force all the commands after the command is loaded to occur after the level, and then wait for the CPU's load buffer to be emptied. This makes the program State exposed by other CPUs visible to this CPU before making more progress. A good example is that when the sequence of the batcheventprocessor referenced earlier is read by another producer or consumer, the disruptor has an equivalent command.

Full barrier

Full barrier is the "mfence" command on x86 and a combination of loading and storage levels on the CPU.

Java Memory Model)

In the Java storage model, a storage level is inserted after the volatile field is written, and the loading level is inserted before reading. Fields modified as final in the class are inserted with a storage command after they are initialized to ensure that these fields are visible when the constructor is complete and can be referenced to this object.

Atomic commands and software locks)

Atomic commands, such as "lock…" in x86 ..." Commands are efficient full Barrier. They lock the storage subsystem to execute operations. They have a guaranteed full order relationship (total order), even across CPUs. Software locks generally use storage levels or atomic commands to achieve visibility and preserve program order.

Performance impact of memory barriers)

The storage level prevents the CPU from executing many techniques that hide memory latency. Therefore, they have significant performance overhead and must be considered. In order to achieve maximum performance, it is best to model the problem so that the processor can work as a unit, and then let all the necessary storage levels occur on the boundary of the unit. This method allows the processor to optimize the work unit without restriction. It is helpful to group the required storage levels. In this way, the overhead of buffer Refresh after the first one will be smaller, because it is not necessary to refill it.

For more information about disruptor, see my previous article: http://coderbee.net/index.php/open-source/20130812/400.

Original article address: http://coderbee.net/index.php/concurrent/20131211/624;

Memory Level/barriers/fences

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.