In the previous article, "loads are not reorderd with other loads" is a fact !! Continued: Do not count on volatile as mentioned above .. Net memory model implementation on volatile load is incorrect. This is now the final conclusion of the semi-official forum. For the conclusions of this discussion, refer to "a bit more formalism as to why CLR's mm is broken on x86/x64 ".
The problem with Memory Model (mm) is boring, and the acronyms have never been inferior to those in other fields. In order to facilitate the illustration, let's take a look at this opportunity.
Memory Model (mm), in fact we are concerned with the memory consistency model (memory consistency model for shared memory multi-processor) located on the multi-processor shared memory processor, A soft Memory Model specified by a standard. For example, you can run the managedCode. Net Virtual Machine is managing memory like a real physical computer. What is the behavior of memory read/write, this requires the memory model (this is a soft Memory Model ).
Well, there are two main problems with the influence of the Memory Model on multithreading. The first problem is the order adjustment problem, that is, the memory reorder problem. The second problem is visibility. These two problems are related, but they must be separated! They will see that they are not the same thing in a moment.
Why do we need reorder instead of simply executing it in sequence ~ This is because it can maximize the performance and resources of the processor. Of course, this feature makes MultithreadingProgramA bug will be introduced as soon as possible (a joke in programmer said: If programmers are asked to make cars, then the fuel consumption of cars within kilometers will be a tenth of the original, the current performance is indeed 10 times better. But one thing is, the car must have exploded once a year, and none of them were spared ).
Memory reorder is very common in modern processors. For example, for intel X86 processors, there are the following rules (see the Intel memory ordering white paper ):
(1) the load operation will not adjust the order with other load operations
(2) store operations will not be adjusted with other store operations
(3) store operations will not be adjusted with the previous load operations
(4) load may change the order of the previous store operations, but it is limited to the memory at different locations. If the operation memory is the same, it will not be exchanged.
(5) In a multi-processor system, memory operation order adjustment represents the order of visibility (Note that only order does not represent the result of such order. For this case, see section 2.4 In the Intel White Paper or refer to the previous article.)
(6) In a multi-processor system, the store in the same location is not adjusted.
(7) In a multi-processor system, the lock modifier command is not adjusted.
(8) All load and store modified by lock will not be adjusted.
Well, the above is the content of memory reorder. The following problems are visibility issues. What's crazy is that the end of the memory command does not mean visibility. Why ~ This is the result of CPU memory structure optimization. We can write another articleArticle.
Therefore, for another (logical) processor, the content of memory reorderPossibleIt is not truth, but it must be visible to you. Unfortunately, we were not aware of this problem before. It is assumed that it has the semantics of acquire or release. (Except for lock, of course)
And so on. What are the meanings of acquire and release. Here we provide a strict definition:
Acquire semantics: If an operation has the acquire semantics, other (logical) processors in the system will be able to see the result of the subsequent command.
Release semantics: If an operation has release semantics, other (logical) processors in the system will be able to see the results of all the commands processed before this.
It can be seen that acquire and release specify the direction in which the command can be adjusted. That is, if an instruction has the acquire syntax, the previous command may be adjusted to the subsequent execution, however, the subsequent commands cannot be executed before they are executed. If a command has the release syntax, the subsequent commands may be adjusted to the previous commands, however, the previous commands will not be adjusted to the subsequent commands. Furthermore, the acquire and release Semantics define visibility. This is terrible.
Okay. Now, let's look at our original example:
X = 1; y = 1;
R0 = x; r2 = y;
R1 = y; R3 = X;
Let's assume that both X and Y are volatile. In this way, if the mm of. NET is correctly implemented, all volatile loads have the acquire syntax. All stores, whether volatile or not, have the release attribute! Due to the correlation, x = 1 cannot be exchanged with R0 = x, and R0 = x cannot be exchanged with R1 = y. According to the visibility, no matter how it is done, there will be no result of R1 = 0, R3 = 0 after all execution is complete. But this result is exactly the same!
Okay. How can we solve the visibility problem? The answer is the appropriate memory fence. Reference <Intel 64 & IA-32 ubuntures software developer's Manual>
Lfence: forces the load operation in order, and all subsequent load operations will wait until the previous load is globally visible!
Sfence: forces store operations in order. All subsequent store operations will wait until the previous store is globally visible!
Mfence: forces the load and store operations in order, and the subsequent load or store operations wait until the previous load or store operations are globally visible!
Slow, what about lock? Er, in fact, the Manual guarantees that there will not be multiple processors to update its content at the same time (whether locking the bus or locking the cache), but the issue of visibility is not mentioned. In fact, lock and implicit xchg do not have such a guarantee. Let's take a look at the implementation of interlockedexchange:
1 Static Long Interlockedexchange ( Volatile Long * PTR, Long Value)
2 {
3 Long Result;
4 Volatile Long * P = PTR;
5 _ ASM
6 {
7 _ ASM mov edX, P
8 _ ASM mov eax, Value
9 _ ASM Lock Xchg [edX], eax
10 _ ASM result, eax
11 }
12 Load_with_acquire ( * PTR ); // Make sure it is visible
13 Return Result;
14 }
15
16 Template < Typename t >
17 Static Long Load_with_acquire ( Const T & Ref )
18 {
19 Long To_return = Ref ;
20 _ Readwritebarrier ();
21 Return To_return;
22 }
Now, the problem is coming to an end. Finally, I would like to remind you that if your program only has one thread and does not share resources across processes, you can simply ignore these issues, because for a single thread, although there are a lot of "things you don't know" in the middle of the processor, the result "looks" as if it were the same as the results of sequential execution. For multithreaded developers, when a synchronization object or interlockedxxx is used, the appropriate memory fence is implicitly included. Therefore, you don't have to worry about it. If you are a bottom-layer employee, you have to solve these complicated problems.