The lock-free programming is really fascinating and hateful. The blogger has previously written several articles about lock-free programming. For example, about lockless programming and concurrent data structures: Fascinating atoms. For more in-depth understanding and practice of lock-free programming, refer to CLR 2.0 Memory Model and Concurrent Data Structure: Stack. This article does not intend to continue to explain how to use the lock-free technology, but to discuss its negative impact. This gives you a more comprehensive understanding of lock-free.
Speaking of lock-free programming, CAS primitives are often used in reality. CAS is short for Compare and Swap. On Windows and. NET platforms, it is written as an Interlocked API for historical reasons. The Assembly commands for atomic operations in the x86 architecture CPU include XCHG, CMPXCHG, INC, and so on. Of course, you must add LOCK as the prefix (for more information, see the Concurrent Data Structure: Charming atom ).
CAS primitives can indeed greatly improve program performance in the case of mild and moderate contention. However, there are some advantages and disadvantages in everything. The CAS primitive is extremely detrimental to the scalability of the Program (for other disadvantages, see lockless programming ). You may think this is a bit extreme, but it is true. Please let us know in detail:
- The atomicity of CAS depends entirely on hardware implementation. Most Intel and AMD CPUs use a MOSEI cache consistency protocol to manage the cache. In this architecture, CAS operations in the processor cache are relatively low. However, once resources are used for competition, the cache becomes invalid and the bus is occupied. The more invalid the cache, the more occupied the bus, and the more delayed the CAS operation. Cache contention is the killer of program scalability. Of course, this is also true for non-CAS memory operations, but the CAS situation is even worse.
- CAS operations consume more CPU cycles than normal memory operations. This is due to the additional burden of cache classification, the ability to refresh the write buffer, the limitations and requirements of traversing the memory barrier, and the ability of the compiler to optimize CAS operations.
- CAS is often used to optimize parallel operations. This means that a CAS operation failure will cause some commands to be retried (a typical rollback operation ). Even if there is no competition, it will do something useless. Both success and failure increase the risk of competition.
Most CAS operations occur when the lock enters and exits. Although the lock can be built by a single CAS operation, the. net clr Monitor class uses two (one in the Enter method and the other in the Exit method ). The lock-free algorithm often uses the CAS primitive instead of the lock mechanism. However, due to memory restructuring, such algorithms often require explicit barrier, even if CAS commands are used. The lock mechanism is very evil, but most qualified developers know that it takes as little time as possible to hold the lock. Therefore, although the lock mechanism is annoying and affects performance. However, compared with a large number of CAS operations, it does not affect the scalability of the program.
For example, increase the count by 100,000,000 times. There are several ways to do this. If it is only running on a single-core single-processor, we can use common memory operations:
static volatile int counter = 0;static void BaselineCounter(){ for (int i = 0; i < Count; i++) { counter++; }}
Obviously, the above sample code is not thread-safe, but provides a good time benchmark for the counter. Here we use lock inc as the first thread security method:
static volatile int counter = 0;static void LockIncCounter(){ for (int i = 0; i < Count; i++) { Interlocked.Increment(ref counter); }}
Now the sample code thread is secure. We can also adopt another method to ensure thread security. If you need to perform some verification (such as memory overflow protection), we usually use this method. Is to use CMPXCHG (CAS ):
static volatile int counter = 0;static void CASCounter(){ for (int i = 0; i < Count; i++) { int oldValue; do { oldValue = counter; } while (Interlocked.CompareExchange(ref counter, oldValue + 1, oldValue) != oldValue); }}
Now I want to ask an interesting question: which method is slower during cache contention? The results may surprise you.
The test results on Intel 4-core processor are as follows:
In the figure, when the CPU uses two cores, The BaselineCounter method is 2.11 times that of single-core single-path. Other situations are similar. By comparing the results, we can know that more concurrency leads to more poor results. This is largely due to memory contention.
When the CAS operation fails, the performance of the CASCounter method on the multi-core processor can be improved by rotating the wait (For details, refer to the summer is a good season brother's hands-on implementation of a lightweight semaphore (1) and (2 )). This greatly reduces the time required for live locks and associated inline blocking locks.
Of course, this example is extremely challenging. It frequently modifies the same memory address. By inserting a specific function call during the period, the delay in accessing the shared memory can greatly relieve the pressure.
For example, if two function calls are inserted, the following data is obtained:
After 64 function calls are inserted, the data becomes as follows:
At this time, we can see that the time spent on multi-core is less than single-core. This is the acceleration brought about by the use of parallelism. Here, we may wonder if more than 64 function calls will be better since 2 to 64 function calls make the results better? In fact, after inserting 128 function calls, the acceleration has reached the limit. The result is as follows:
For details about how to calculate the acceleration ratio, refer to parallel thinking [II].
There is no free lunch in the world, and CAS is no exception. We should put the lock-free CAS code in our code with caution and be aware of the frequency of thread execution. We can use the following sentence as a summary: sharing is the devil. It fundamentally limits the scalability of applications and is best avoided as much as possible. Shared Memory requires concurrency control, while concurrency control requires CAS. CAS is very expensive, so shared memory is also very expensive. Many people have proposed the lock-free technology, transaction memory, and read/write locks to improve program scalability. Unfortunately, this situation rarely occurs. CAS is often worse than a solution that correctly implements the lock mechanism. The major cause is shared memory, optimistic failure attempts, and cache failure.
Update at, January 10, April 8, 2009
Overred raised a good question during review: when using the Interlocked API, shared variables do not need volatile modification.
To better illustrate this problem, I will write a simple code example, as shown below:
using System;namespace Lucifer.CSharp.Sample{ class Program { static volatile int x; static void Main(string[] args) { Foo(ref x); } static void Foo(ref int y) { while (y == 0) ; } }}
When we compile this code in Visual Studio, the IDE will give a compilation warning as follows:
Generally, we should pay enough attention to this compilation warning. For example, in the above example, the JIT compiler will think that y remains unchanged, leading to an endless loop. On the IA64 platform, this will be considered as a replacement of the special load-acquire access for common memory access, which may lead to some bugs in the CPU command restructuring. But one exception is Interlocked API and Thread. VolatileXXX methods and locks. These APIs explicitly require memory barrier and hardware atomic commands, regardless of whether the external shared variables are modified using volatile. Therefore, the test method used in this article is still very secure.
If you think this compilation warning is annoying, you can use the # pragma command to disable this warning, as shown below:
static volatile int x;static void Foo(){#pragma warning disable 0420 Interlocked.Exchange(ref x, 1);#pragma warning restore 0420}
Of course, you do not need the volatile modifier. The CLR memory model ensures this.
For details about how to use volatile correctly, refer to the Concurrent Data Structure: volatile variable.