One series II

Source: Internet
Author: User

According to the plan in the previous article, this article should analyze from the perspective of practice how to pay attention to the impact of the Out of order problem in lock free code. But I don't want to go through so many things in this period of time, including the bug in implementing the. NET memory model, which makes people pay more attention to this part. In this article, we will write a script for the first few articles.ComprehensiveExplanation and correctionArticleShown inErrors of understanding, or errors that are easy to be understood.

 

First, attach the links to the previous articles:

 

One series: One

"Loads are not reorderd with other loads" is a fact !!

"Loads are not reorderd with other loads" is a fact !! Continued: Do not count on volatile

"Loads are not reorderd with other loads" is a fact !! Continued:. Net mm is broken

Choosing a virtual machine environment to intervene in multi-thread programming is a good choice.

 

We started to introduce the memory model from the very beginning. In fact, the problem about the memory model isCodeIt is especially important for the writing staff (specifically, the memory model is used for writingLock-free codeOfProgramPersonnel are very important !) The comrades who use the advanced synchronization structure are not so obvious. We have always stressed the following issues about the memory model:Your code is not necessarily executed as you think. In fact, this is not all. I forget that the memory model is determined by a level.

 

I. Memory consistency Model

 

To correctly compile parallel programs on the shared memory platform, you need to know how the read/write operations in memory are executed on multiple processors. This description of the read/write behavior of the memory in a multi-processor is called the memory consistency model. That is, the so-called "Memory Model" that we are concerned ".(See Hans-J. Boehm, threads cannot be implemented as a library, Internet Systems and Storage laboratory, HP laboratories Palo Alto, November 12,200 4)

 

The memory consistency model provides a formal description of the impact of the Memory System in a multi-processor shared memory system on program developers. Resolved the difference between developers' expectations and actual system behaviors. To ensure the effectiveness, the memory consistency model often adds some restrictions to the shared memory data that can be returned. Intuitively, for example, a "read" operation should return the data written by the last corresponding "write" operation. In a single processor system, the "last" is for the program sequence, that is, the order in which memory operation commands appear in the program. However, in a multi-processor system, this concept is no longer applicable. For example, for the above programs, reading and writing data is not measured in the program order, because they are executed on two different processors. However, it is feasible to directly apply the single processor model to a multi-processor system. This model is called the sequence consistency model. Inaccurate, the sequential consistency model means that memory access must be executed one by one, and all of the processors are executed in the order specified by the program. In this way, we can ensure that the data read every time is the data written at P1.(See Sarita adve, PhD: designing memory consistency models for shared-memory multiprocessors, University of Wisconsin-Madison Technical Report #1198, December 1993)

 

The sequential memory consistency model provides a simple and intuitive program model. However, this model actually prevents the hardware or compiler from performing most optimization operations on program code. To this end, many loose (relaxed) memory models have been proposed, giving the processor the right to adjust memory operations. For example, the Alpha processor, PowerPC processor, and the x86 and x64 series processor we are currently using. These memory models are different, which brings many obstacles to cross-platform applications.

 

The memory consistency model is an interface that contacts program developers and the system. It affects the development of shared memory applications (developers must use these conventions to ensure the correctness of their applications ). It should be noted that the memory consistency model will affect the system performance to some extent, because it partially limits the hardware or software optimization mechanism. In addition, even if there is a memory consistency model, the different memory consistency models between different systems will also cause difficulties for program porting.

 

Note: As described above, the complexity of the memory consistency model makes it more difficult to develop parallel programs across platforms. Therefore, to achieve cross-platform consistency, You need to unify the memory model to a certain extent. Of course, this will bring a certain performance impact.

 

Note: The problems we have ignored in the previous article are described below. That is, the memory consistency model is not only determined by the CPU, but also a combination of all levels of the system.

 

The memory consistency model must beVarious levels of programs and systemsDefines the behavior of memory access. At the level of machine code and, its definition will affect the hardware designers and machine code developers. At the high-level language level, the definition affects high-level language developers, compiler developers, and hardware designers. Therefore, the so-called programmability, performance, and portability must be considered at all levels.

 

Note: Therefore, the memory model of the compiler is language-level. For example, C ++ currently has a weak description of the memory consistency model, therefore, when programming, we are very dependent on the compiler's memory consistency model (of course, the result is that the correctness of the program depends on the compiler ).

 

Therefore, the memory consistency model not only affects parallel program developers from the perspective of programs, but also affects all aspects of parallel systems (processor, memory system, interconnection, compiler and program language ). If we can ensure a certain loose (relaxed) memory consistency model on different actual platforms. This ensures the versatility and correctness of multi-threaded libraries while maintaining performance.

[End:Memory consistency Model]

 

It can be seen that the. NET Framework of the so-called Unified memory model, its memory model is not only determined by the CLI implementers, but by the CLI implementers, compilers, memory systems, and processors. This is what we mentioned in the first article: in other words, you wrote a piece of code one day --> you found that compilation on the compiler was successful (but your code sequence may have been changed during the compilation process, this change should be an execution order change) --> you try to run the target code (the order of the Code in the target file should be the program order) --> your code starts to be executed (but the CPU executes memory operations out of order during execution, which is a change in the execution order)

 

With the memory consistency model, we have a general understanding of the memory operation rules. Is there any way to limit the order of memory access?Yes! That's the memory gate!Here we will repeat this part because the previous articles have explained this part.Easy to misunderstand. This is a complete disconnection.

 

Ii. Memory Access failure

 

If the shared memory model of the running platform is determined, the multi-threaded library compiled according to the model can run correctly on any underlying platform that supports the running platform. However, sometimesAlgorithm(Especially lock-free algorithms) We need to implement the semantics of a specific memory operation on our own to ensure the correctness of the algorithm. In this case, we need to explicitly use some commands to control the order of memory operation commands and their visibility definitions. Such commands are called memory barriers (memory barriers ).

 

We mentioned it just now. The memory consistency model defines the memory access behavior at various layers of the program and system. At the level of machine code and, its definition will affect the hardware designers and machine code developers. At the high-level language level, the definition affects high-level language developers, compiler developers, and hardware designers. That is, Memory Operation disorder exists at all levels. Here, there are three execution sequence of the program:

 

(1) program sequence: the sequence of code that runs on a specific CPU and executes memory operations. This refers to the order of commands in the compiled program binary image. The compiler does not necessarily orchestrate binary code in strict program order. The compiler can disrupt the execution sequence of commands during code optimization according to the established rules, that is, the program sequence mentioned above. In addition, the compiler can optimize the performance based on the specific behavior of the program. This optimization may change the algorithm form and complexity of the algorithm execution. (For example, converting a switch to a table-driven sequence)
(2) execution sequence: The execution sequence of independent memory-related code executed on the CPU. The execution sequence and program sequence may be different, which is the result of compiler and CPU optimization. During the execution period (runtime), the CPU breaks the compiled instruction sequence based on its memory model (unrelated to the compiler) to optimize the program and maximize resource utilization.

(3) Perception sequence: the sequence in which a CPU perceives its own or other CPU operations on the memory. The perceived sequence and execution sequence may be different. This is caused by Cache Optimization or memory optimization system.

(See memory ordering in modern microprocessors)

The final form of the shared memory model is determined by the three "sequences. FromSource codeAt least three levels of code sequence adjustments were made to the final execution, which were completed by the compiler and the CPU respectively. As mentioned above, although this change in the code execution sequence does not cause any side effects in a single-threaded program, this role cannot be ignored in multi-threaded programs, it may even cause completely incorrect results. Therefore, in multi-threaded programs, we sometimes need to manually limit the order of memory execution. However, this restriction is implemented throughDifferent Levels.

Note: We did not mention this before.Different levels of memory barriersIn fact, if you are using Visual Studio 2005/2008, you will find _ readbarrier, _ writebarrier, _ readwritebarrier, and so on in the API. Let's see how msdn knows that this is a memory barrier, but what level of memory barriers does this mean??

 

Strictly speaking, memory barriers are only applied to hardware layers rather than software. That is, the memory gate is not for the compiler. However, because the compiler changes the code sequence, the concept of memory barrier is introduced. The definition is as follows:

    • The compiler memory barrier means that the compiler ensures that the Code on both sides of the memory barrier does not span the memory barrier, but cannot prevent the CPU from changing the code execution sequence.
    • Memory gate failure refers to a series of commands that force the CPU to execute memory commands in a certain order according to the consistency rules on both sides.

Take the Visual C ++ 8.0/9.0 compiler as an example. The compiler rules stipulate that the visual C ++ compiler has the right to adjust the sequence of operations on variables declared as volatile to achieve optimization. Therefore, in the Platform SDK, the compiler memory barriers -- _ readbarrier (), _ writebarrier (), and _ readwritebarrier () are introduced (). Properly using these functions ensures that the execution sequence of the Code in multi-threaded mode is not changed due to compiler optimization. Otherwise, the program behavior after optimization may be changed. It can be seen that this optimization is only for the compiler level. The execution process is not guaranteed.

 

Therefore, in "loads are not reorderd with other loads" is a fact !! Continued:. Net mm is broken. This Code is available:

 

Static   Long Interlockedexchange ( Volatile   Long * PTR, Long Value)
{
Long Result;
Volatile   Long * P = PTR;
_ ASM
{
_ ASM mov edX, P
_ ASM mov eax, Value
_ ASM Lock Xchg [edX], eax
_ ASM result, eax
}
Load_with_acquire ( * PTR );
Return Result;
}

Template < Typename t >
Static   Long Load_with_acquire ( Const T &   Ref )
{
Long To_return =   Ref ;
# If (_ Msc_ver> = 1300)
_ Readwritebarrier ();
# Endif
Return To_return;
}

 

Why do I still need load_with_acquire in interlockedexchange? The comments in the original text are not taken into consideration. They are actually because they prevent the compiler from optimizing voltaile. If the compiler ensures that volatile does not adjust the sequence, for example, the intel compiler does not need to repeat load_with_acquire. The reader may ask, if _ readwritebarrier is only at the compiler level, will the execution order not change? Er, simply for this load is not, because the Code is under the IA-32, the CPU has guaranteed the load semantics, as long as the compiler does not change it is OK.

 

However, the real memory barriers are hardware-level. Some specific commands provided by the CPU are used. For example, in Microsoft platform SDK, The memorybarrier macro is the type of memory gate. The definition is as follows (on x86, x64, and IA64 platforms ):

 

# Ifdef _ amd64 _
# Define Memorybarrier _ faststorefence
# Endif

# Ifdef _ IA64 _
# DefineMemorybarrier _ MF
# Endif

//X86

Forceinline
Void
Memorybarrier (Void)
{
Long barrier;
_ ASM {
Xchg barrier, eax
}
}

 

The above instructions show that if you want to get the correct memory operation sequence, you need to properly use the software or real memory barriers in the program. Excessive use of memory barriers will cause serious program performance degradation (because the CPU Memory Operation Sequence Optimization and Cache Optimization cannot play a role ); improper use may cause very concealed and difficult debugging errors.

 

[End:Memory Access failure]

 

Iii. cache consistency

 

In "one", there is a sentence: "CPU 1 operates Memory Unit 1 and then Memory Unit 2, however, another CPU first saw the change in memory unit 2 and then saw the change in memory unit 1. "It should be noted that this does not exist in most processors,Most processors ensure that the visibility of this sequence is consistent with the operation sequence..(For Intel processors, see intel 64 and IA-32 ubuntures software developer's manual, Volume 3A: System Programming Guide, 10.4)

 

In most articles about multithreading, we will mention that due to memory optimization, visibility... needs to be clarified that the difference in visibility is not caused by cache in most cases. It should be said that memory optimization caused a good result. This is because the cache consistency protocol has been studied to avoid inconsistent cache. Intel has used four statuses of Alibaba Cloud for its cache, which is actually called Bus listening protocol.

 

In "loads are not reorderd with other loads" is a fact !! In the series of articles, an example is provided to illustrate that. Net mm is faulty. But if there is no problem with cache, what causes this problem? The answer is, store buffer !! Let's take a look at the cache work:

 

If the CPU finds that the system memory operations can be cached (not all memory units can be cached), the CPU reads all the cache lines to the appropriate cache L1, l2 or l3 (if any ). This operation is called cache line fill. If the CPU finds that the required operand already exists in the cache, the CPU directly reads the operand from the cache instead of in the memory. This situation is called cache hit.

When the CPU tries to write operations to the system memory that can be cached, it first checks whether the operation has been cached in the cache. If a valid (valid) cache line does exist in the cache, the CPU (based on the current write operations policy) can write the operations into the cache rather than in the system memory. This operation is called write hit. Otherwise, if the cache does not contain the address of this operand, the CPU executes the cache line fill operation once, writes the operand to the cache, and at the same time (based on the current write operand Policy) you can write the operands back to the system memory. If you want to write the operands back to the memory,First, it writes back to the storage buffer, and then waits for the bus to be idle and writes back from the storage buffer to the memory..(See the Intel 64 and IA-32 ubuntures software developer's manual, Volume 3A: System Programming Guide, 10.2)

 

Now we know what store buffer is. If there is no store buffer, we have to wait for the bus to be idle and then write the operands back to the memory. But now, even if the bus is not idle cache write-back, it can return immediately! However,When a store is temporarily in the store buffer, it satisfies the load operation of its own processor but cannot satisfy the load operation of other processors.. That is,In this case, if other CPUs need to load this address, they will not be able to see the new value! This is not because of inconsistent cache, but it may be because of inconsistent cache.!(See intel 64 architecture memory ordering White Paper, 2.4)

 

Therefore, if we want to correct this error, we need to flush store buffer,Instead of flush Cache! How to flush store buffer? List as follows:

 

When a CPU exception or interruption occurs;
When an ordered command is executed;
When an I/O command is executed;
When a lock operation is executed;
When a binit operation is executed;
When lfence is used to limit the adjustment of operations;
When mfence is used to limit the operation adjustment;

(See the Intel 64 and IA-32 ubuntures software developer's manual, Volume 3A: System Programming Guide, 10.10)

 

And we know that system. Threading. thread. memorybarrer () is actually an xchg (for IA-32), belonging to a lock operation, so it has the flush store buffer function!Therefore, we can say that system. Threading. thread. memorybarrer () can correct this error, but the reason is not cache, but store buffer.!

 

So far, the description and correction of the above article can be included in a paragraph. Learn with you :-)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.