Introduction to lock-free programming

Source: Internet
Author: User

Address: http://preshing.com/20120612/an-introduction-to-lock-free-programming

Lock-free programming is a challenge, not only because of its complexity, but also related to the difficulty of the first exploration of this topic.

I am very lucky. The first time I introduced lock-free programming, also known as lockless, is brucedawson's outstanding and comprehensive White Paper "considerations for lock-free programming". Like most people, I have the opportunity to use Bruce's suggestions to write and debug lockless code, such as development on the xbox360 platform.

Since then, I have written a lot of good materials, including abstract theories, proof of instance correctness, and some hardware details. There will be some references in the footer. Sometimes, information from one source may be orthogonal to other sources. For example, some materials assume order consistency, which avoids the memory Sorting Problem that troubles C/C ++ lockless code. The new C ++ 11 atomic library standard provides new tools that challenge existing lockless algorithms.

In this article, I will re-introduce lock-free programming, first define it, and then extract a few important concepts from numerous information. I will use flowcharts to show the relationships between concepts, and then we will be involved in some details. Any programmer who learns lock-free programming should be able to understand how to use mutex in multi-threaded code, and understand some advanced synchronization mechanisms, such as signals and events.

What is lockless programming?

Lockless programming is often described as a program that does not use mutex (a lock mechanism. This is a fact, but it is completely true. The widely accepted definition based on academic reports contains a broader meaning. In essence, lock-free programming describes a property of some code and does not describe how code is written too much.

Basically, some parts of your code comply with the following conditions and are considered lockless. Otherwise, if some parts of your code do not meet the following conditions, they will not be considered unlocked.

In this scenario, "locks" in "no lock" do not directly refer to mutex, but all possible ways to "Lock" the entire application, whether it is a deadlock, a live lock, or even a thread scheduling method, it is your biggest enemy. The last point sounds funny, but it is crucial. First, the shared mutex is excluded, because once a thread obtains the mutex, your biggest enemy will never be able to schedule that thread again. Of course, the real operating system does not do that, but is defined in this way.

The following simple example does not use mutex, but it is still not lockless. In the beginning, x = 0. As a exercise question for readers, consider how to schedule two threads so that both threads do not exit the loop?

while (X == 0){    X = 1 - X;}

No one expects the entire large application to be completely unlocked. Generally, we can identify a series of lock-free operations from the entire code base. For example, there are a few lockless operations in a lock-free queue, such as push, Pop, or isempty.

Herlihy and Shavit, authors of "multi-processor programming art", tend to develop these operation tables into a class approach and propose a simple lockless definition (PPT 150th page ): "In an infinite execution process, there will be endless calls ". In other words, while the program can continuously call these lockless operations, many calls are also constantly completed. In terms of algorithms, it is impossible to lock the system during the execution of these operations. An important feature of lock-free programming is that if a separate thread is suspended, other threads will not be blocked. As a group of threads, they use specific lockless operations to complete this feature. This reveals the value of lockless programming in interrupt processing programs and real-time systems. In these cases, specific operations must be completed within a specific period of time, regardless of the state of the program.

The final accurate description: The operation designed for blocking will not be disqualified from the lockless algorithm. For example, when the queue is empty, the pop-up operation of the queue will be blocked. However, other code paths are still considered unlocked.

Lock-free Programming Technology

It turns out that when you try to meet the non-blocking conditions of lock-free programming, there will be a series of techniques: Atomic operations, memory barrier, and ABA protection. Just a few examples are listed. From here on, things will soon become tricky.

How are these technologies related to each other? I will show them in the flowchart below. The following sections describe them one by one.

Atomic read-Modify-write operations

The so-called atomic operation is to operate the memory in an seemingly inseparable way: the thread cannot see the intermediate process of the atomic operation. In modern processors, many operations are atomic. For example, read and write operations that align simple types are generally atomic.

Read-Modify-write (rmw) operations go further, allowing you to operate more complex transactions in an atomic way. It is particularly useful when a lockless algorithm must support multiple write operations because multiple threads attempt to perform rmw on the same address, they queue up to perform these operations once. I have already involved rmw operations in my blog, such as lightweight mutex, recursive mutex, and lightweight log systems.

Examples of rmw operations include:_InterlockedIncrement,iOSOsatomicadd32 onAnd C ++ 11STD: Atomic <int>: fetch_add. Note that C ++ 11The atomic standards do not guarantee that the Implementation on each platform is unlocked. Therefore, it is best to understand the capabilities of your platform and tool chain. You can call STD: Atomic <>:: is_lock_freeTo confirm.

Different CPUsRmw supportThe method is also different. For example, powerpcAnd armLoad-link/store-conditionalCommand, which actually allows you to implement your custom underlying rmwOperation. Common rmwThe operation is enough.

As described in the flowchart above, even on a single processor system, the atomic rmwOperation is also a necessary part of lock-free programming. If there is no atomicity, a thread's transaction will be interrupted midway through, which may lead to a wrong state.

Compare-And-Swap Loop

Perhaps the most common rmwThe operation is compare-and-swap (CAS). In Win32On, CASIs passed through such as _ interlockedcompareexchangeAnd other commands. Generally, programmers use compare-and-swap in a transaction.Loop. This mode usually includes copying a shared variable to a local variable, doing some specific work (changes), and then trying to use casRelease these changes.

void LockFreeQueue::push(Node* newHead){    for (;;)    {        // Copy a shared variable (m_Head) to a local.        Node* oldHead = m_Head;        // Do some speculative work, not yet visible to other threads.        newHead->next = oldHead;        // Next, attempt to publish our changes to the shared variable.        // If the shared variable hasn't changed, the CAS succeeds and we return.        // Otherwise, repeat.        if (_InterlockedCompareExchange(&m_Head, newHead, oldHead) == oldHead)            return;    }}

Such a loop is still qualified as unlockedBecause if one thread fails to be detected, it means other threads are successful-although some architectures provide a weakCASVariant. At any timeCASLoop, must be very careful to avoidABAProblem.

Sequence consistency

Sequence consistency means that all threads agree on the sequence of memory operations. This order is consistent with the operation order in the program source code. Under the requirements of sequential consistency, it is no longer possible to re-Sort memory as I demonstrated earlier.

To achieve sequence consistency, a simple (but apparently impractical) method is to disable Compiler Optimization and force all your threads to run on a single processor. A processor will not see its own memory effect out of order,Some Programming Languages provide sequential consistency even when optimized code is run in a multi-processor environment. In C ++ 11, you can have atomic type variables with default memory ordering. In Java, variables are identified as volatile. The following is an example of rewriting in the C ++ 11 style.

std::atomic X(0), Y(0);int r1, r2;void thread1(){    X.store(1);    r1 = Y.load();}void thread2(){    Y.store(1);    r2 = X.load();}

In C ++ 11, the atomic type has sequence consistency, so the result of R1 = R2 = 0 is impossible. To achieve this goal, the compiler adds additional commands, typically memory barrier or rmw operations. These additional commands may reduce program efficiency compared to allowing programmers to directly process memory ordering.

Memory Ordering

As suggested in the preceding flowchart, do lock-free programming for multiple cores (or any symmetric multi-processor) at any time. If your environment cannot ensure sequence consistency, you must consider how to prevent Memory re-sorting.

In the current architecture, there are usually three types of tools to enhance memory ordering, to prevent the compiler from re-sorting and processor re-sorting:
1. A lightweight synchronous or barrier instruction will be detailed later;
2. A complete memory barrier Command, as described earlier;
3. Memory operations provide obtaining or releasing semantics.
Obtaining semantics prevents operations after it from being re-ordered and performed in programming order. Releasing semantics prevents previous operations from being re-ordered. These semantics are very suitable for producer-consumer relationships. One thread publishes some information and the other thread reads it. It will be discussed in detail later.

Different processors have different memory models

Different CPU families have different methods for Memory re-sorting. Each CPU vendor will describe these rules in its documentation, and the corresponding hardware strictly abides by them. For example, PowerPC and arm processors can manually modify the order of commands stored in the memory, while x86/64 processors (Intel and AMD) do not. We say the former has a loose memory model.
There is a template to abstract the specific details of these platforms. In particular, C ++ 11 provides us with a standard way to write portable lockless code. But now, I think most lock-free programmers should have at least some platform differences. If there is only one important difference to remember, it is that at the x86/64 command level, each load from the memory (operation) carries the retrieval semantics, and each operation is stored in the memory (Operation) both have release semantics. At least for the non-SSE command and non-Write-combined memory. Therefore, lockless code that can work on x86/64 platforms is often found in the past, but cannot work normally on other processors.
If you are interested in how the hardware handles the details of Memory re-sorting, I suggest you refer to Appendix C "is pararllel programming hard ". In any case, remember that Memory re-sorting may also be caused by the re-sorting of compiler commands.
In this article, I have not described many lock-free programming practices, such as: when to do it? What is actually required? I did not mention the importance of verifying your lock-free algorithm. I hope that the introduction in this article will provide some readers with a basic understanding of the lock-free concept, so that you do not have to be surprised when reading further. As usual, if you find any inaccuracy, please let us know.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.