CPU Disorderly Execution

Source: Internet
Author: User
Tags prefetch

Http://blog.163.com/zhaojie_ding/blog/static/1729728952007925111324379/?suggestedreading and concurrent execution of the processor

The current advanced processor, in order to improve the utilization of internal logic components to improve the speed of operation, usually using multi-directive emission, disorderly implementation of various measures. Some of the hyper-scalar processors that are commonly used now can execute multiple instructions concurrently within a single instruction cycle. After the processor has pre-fetching a batch of instructions from the L1 I-cache, it analyzes the instructions that are not associated with each other and executes concurrently, and then sends it to several separate execution units for concurrent execution. such as the following code (assuming that the compiler does not optimize):

z = x + y;
p = m + N;

It is possible for the CPU to send these two lines of unrelated code to two arithmetic units to execute simultaneously. The embedded processor, like Freescale's MPC8541, can load 4 instructions, emit 2 instructions to the pipeline, and execute concurrently with 5 separate execution units in a single instruction cycle.

Typically, the instruction cycle (which is performed by the LSU unit) may require a large number of instructions (possibly dozens of or even hundreds of cycles), whereas general arithmetic instructions are usually done in a single instruction cycle. So it's possible that the code's fetch instruction takes more than a few cycles to complete execution, and several other execution units may have executed a number of logically unrelated arithmetic instructions, creating a disorderly sequence.

In addition, there is a problem of disorderly ordering between the visit instructions. The advanced CPU can reorder the fetch instructions according to the organizational characteristics of its cache. Access to some contiguous addresses may be performed first because the cache hit rate is high at this time. Some also allow access to the non-blocking, that is, if the previous visit instruction because the cache is not hit, resulting in long-delayed storage accesses, the subsequent fetch instructions can be executed in order to fetch the number from the cache. There is a potential for erroneous consequences of a write order's access disorder, so the processor usually has a special mechanism (usually a buffer) to ensure that the result of the write instruction behind the exception point can be discarded when an exception or error occurs.

The branch prediction capability of the processor can also cause concurrent execution. The Branch prediction unit of the processor is likely to directly prefetch the instructions of the two branches and execute them concurrently. Wait until the result of the branch judgment comes out, then discard the result of the error branch. This allows a 0-cycle jump to be achieved in many cases. such as this code (assuming the compiler does not do optimizations):


if (Z < 0) Then
p = m + N;
Else
p = m-n;

It seems that if z is not calculated, it cannot continue. However, it is possible for the CPU to calculate the three additions at the same time, and then pick the correct P-values directly based on the z=x+y results.

Therefore, even if the order is correct from the assembly, the order of execution is unpredictable. The processor guarantees that concurrency and random execution do not result in incorrect results, but if the operation of some hardware registers does not allow for a disorderly ordering, the programmer must tell the CPU the situation. The way to tell is through a set of synchronization instructions provided by the CPU, usually in the CPU's documentation there is a description of the use of synchronization instructions. The memory barrier (RMB/WMB/MB) inside the system function library is actually implemented by these synchronous instructions. Therefore, in the C code, as long as the memory barrier is set, you can tell the CPU which code can not be disorderly order.

Compiler-Ordered optimization

Constrained by the capacity of the processor Prefetch unit, the processor can only analyze the concurrency of a small block of instructions at a time, and if the instructions are far apart, there is nothing to do. But from the compiler's point of view, the compiler is able to analyze a large range of code, to distinguish from a larger range of commands that can be concurrent, and to keep it as close to the arrangement as possible to make the processor easier to prefetch and execute concurrently, taking full advantage of the processor's chaotic concurrency capabilities. Therefore, the modern high-performance compiler has the ability to optimize the command in order to optimize the target code. Further disorderly ordering of the stored instructions is possible, reducing logically unnecessary access and maximizing cache hit rate and CPU's LSU (Load/store unit) productivity. So after the compiler optimizations are turned on, it is normal to see that the generated assembler code is not strictly in the logical order of the code. As with the processor, if you want to tell the compiler not to order some order optimization, there are some ways to tell the compiler. It is often possible to suppress (note, not prohibit) The compiler's access optimizations to related variables by using the volatile keyword. As an example:





*q = *p;

In this way, the compiler typically optimizes the previous write to *p (logically redundant) and writes only 2 to *p. When the *q assignment, the compiler thinks that the result of *q at this time should be the last *p value, will optimize the process of taking the number from the *p, the value of the *p stored in the register directly to *q (POWRPC assembly):


Li R5, 2
STW R5, 0 (R3)
STW R5, 0 (R4) //write R5 to *q

However, if you add the volatile keyword to the P pointer, the situation is different:






*q = *p;

In this case, when the compiler sees *p as volatile, it will:

    1. Do not generate random order instructions for *p operations (usually, see the explanation below)

    2. Each time you fetch data from *p, you will be sure to do a check-in operation, even if the value of *p is placed in the register shortly before.

    3. Do not merge write operations on *p (also just as usual, see after explanation)

So the results of this return are as follows (POWRPC assembly):


Li R5, 1
STW R5, 0 (R3)
Li R5, 2
STW R5, 0 (R3)
Lwz R5, 0 (R3)
STW R5, 0 (R4) //write R5 to *q

This allows the compiler to ensure that the command is ordered and does not optimize the memory-out operation at the sink encoding level. It is common to simply use the volatile keyword to solve the compiler's problem of confusion, but these instructions can still be scrambled to the processor when it is executed. A set of memory barrier functions (barrier) is required for the avoidance of processor order execution.

Important

The vast majority of compilers typically do not optimize access to volatile objects, and generally maintain a sequence of read and write operations for the same volatile object (but there is no guarantee of ordering between different volatile objects).

However, this is not absolute. Because the ANSI C99 standard has nothing to do with the compiler's absolute guarantee against disorderly ordering (reorder) and forbidden access merging (combine access) when accessing volatile objects! Simply encourage the compiler not to optimize access to volatile objects, and the only mandatory requirement is that the compiler ensure that access to volatile objects is not optimized across "sequence point" (so-called sequence Point refers to some key points, such as external function calls, conditions, or circular jumps, which are defined in detail in the C99 standard.

That is, if a compiler optimizes volatile variables in the same way as a normal variable between two sequence point, it is fully compliant with the C99 standard! Like what:

volatile int A;
if (...) { ... } Sequence Point
A = 1;
A = 2;
A = 3;
PRINTK ("..."); Sequence Point

Between two sequence point, if a compiler merges the assignment of a (that is, write 3 only) or disorderly (such as write 1 and write 2 swap), it is fully compliant with the C99 standard. Therefore, when we use, we can not expect to use volatile after the absolute generation of an orderly complete sink code, that is, do not expect volatile to ensure the orderly. In essence, the most important effect of volatile is to ensure that each use of the value from memory, and does not guarantee that the compiler does not do any other optimizations (after all, volatile literally means "variable" rather than "orderly". The compiler only guarantees immediate updates to volatile objects, but does not guarantee that access is orderly or unjustifiable.

From another point of view, even if the assembler code generated by the compiler is ordered, the processor is not necessarily guaranteed to be orderly. Even if the compiler generates an ordered sink code, there is no doubt that the processor will be executed in code order. So even if the compiler is guaranteed to be orderly, the programmer still need to add a memory barrier to the code to ensure absolute access to order, this is not as good as the compiler simply forget, because the memory barrier itself is a sequence point, after joining has been able to ensure that the compiler is also orderly.

Therefore, for the actual need to protect the order of the code, even if the current use of compilers can compile an orderly target code, we still have to set the memory barrier to ensure orderly, otherwise are not rigorous, there are hidden dangers.

Barrier Barrier function

The barrier function can set a barrier in the code that blocks the compiler's optimizations or blocks the processor's optimizations.

For the compiler, setting any barrier can guarantee:

    1. The compiler's chaos optimization does not cross the barrier, that is, the code before and after the barrier will not be disorderly;

    2. All operations on the variable or address after the barrier are re-evaluated from memory (equivalent to refreshing a copy of the variable in the register).

For processors, depending on the barrier there are different manifestations (the following are just 3 of the simplest types of barriers):

    1. reading barrier RMB ()
      The processor's fetch instruction (LOAD) before and after the read barrier is guaranteed to be orderly, but does not necessarily guarantee the order of other arithmetic instructions or written instructions. The completion time of the execution of the read instruction is not guaranteed, that is, it does not guarantee that the read instruction before the barrier must be completed, only that the read instruction before the barrier must be completed before the read instruction behind the barrier.

    2. Write Barrier wmb ()
      The processor's write instruction (STORE) before and after the barrier is guaranteed to be orderly, but does not necessarily guarantee the order of other arithmetic instructions or reading instructions. The completion time of the execution of the write instruction is not guaranteed, that is, it does not guarantee that the write instruction before the barrier must be completed, only to ensure that the write instruction before the barrier must be completed before the write instruction behind the barrier.

    3. Universal Memory barrier MB ()
      The processor guarantees that only after the barrier's inbound operations (including read and write) are complete, the barrier-only post-fetch operation is performed. That is, the order can be guaranteed between read and write (but also cannot guarantee the time of completion of the instruction). This barrier has a greater negative impact on the efficiency of the processor's execution unit than simply using a reading barrier or writing barrier. For example, for PowerPC, this generic barrier is usually implemented using the Sync command, in which case the processor discards all prefetch instructions and empties the pipeline. Therefore, frequent use of memory barriers can reduce the efficiency of the processor execution unit.

For driver developers, some operations on device registers are usually guaranteed to be orderly. In most cases, it is generally a write operation. For an ordered write operation, a write barrier (WMB) must be set:

Example: Using a write barrier in a drive



IM_INTCTL->IC_SIMRL = 0x00000000;
WMB ();  

Im_intctl->ic_sipnrh = 0xFFFFFFFF;
IM_INTCTL->IC_SIPNRL = 0xFFFFFFFF;

This is an example of an interrupt controller operation. When setting the value of the two mask registers, these two writes are not ordered, so they can be used without a barrier. However, the setting of the ACK register must be completed after the mask register is set, so the Write barrier WMB () is added in the middle to ensure that the two sets of registers are written in order.

Similarly, for a series of read-only operations, it is easy to use RMB () to ensure order.


Attention

Any one RMB () or WMB () can be replaced by a MB (). However, because of the efficiency of the MB () mentioned above, it is recommended to use MB () only when you need to read barriers and write barriers at the same time. Otherwise, the appropriate barrier should be chosen according to the actual situation. Of course, when the device is initialized, even the use of MB () does not affect performance, because the device is typically initialized only once. However, in the event of very frequent device operation (such as the transmission and transmission of the network port), the effect of MB () on performance should be taken into account.

a MB () barrier is required if the driver needs not only to be ordered between simple read instructions or write commands, but also to ensure an orderly order between read and write instructions. An example of this is shown below:

Example: Using the MB () barrier to ensure read and write order

We assume that there is a device that needs to write to the reg1~3 three registers (written to the device read command) in order to read the device information, and then read the REG4 and REG5 in turn to get the information returned by the device.


WMB (); Ensure REG1 and REG2 are written in order

WMB (); Ensure REG2 and REG3 are written in order
REG3 = C;
MB (); Ensure that the previous configuration operation is complete (sequential between read and write) before the device is read

RMB (); Ensure the REG4 and REG5 reading order
*e = REG5;
MB ();
Return
    • For Reg1~3 writes, the write barrier can be set to ensure order;

    • Before REG4 and 5 read, because it is necessary to ensure that the previous register write operation can be read, so the need to set a memory barrier MB () to ensure that the pre-face register write is completed to ensure the order between read and write instructions;

    • The following two read operations can be set by the reading barrier to ensure orderly;

    • In the end, we generally need to ensure that the operation of the device is completed before returning from the device operation function. So the next time the device operation, we can ensure that the device has completed the last operation, to avoid repeated calls to the device operation function caused by the chaos between the problem. So at the end of the set a memory barrier MB (), secure and future access to the device in order.

Further reading

If you want to learn more about memory barriers, especially about memory barriers in multiprocessor systems, you can read:

Linux KERNEL MEMORY barriers by David Howells

CPU Disorderly Execution

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.