"Go" A Fast General Purpose Lock-free Queue for C + +

Source: Internet
Author: User

From:http://moodycamel.com/blog/2014/a-fast-general-purpose-lock-free-queue-for-c++

So I ' ve been bitten by the Lock-free bug! After finishing my single-producer, Single-consumer lock-free queue, I decided to design and implement a more general mult I-producer, Multi-consumer queue and see how it stacked up against existing Lock-free C + + queues. After over a year of spare-time development and testing, it's finally time for a public release. Tl;dr:you can grab the c++11 implementation from GitHub (or jump to the benchmarks).

The the-the-the-queue works is interesting, I-think, so-so ' s-what-blog post is about. A much more detailed and complete (but also more dry) description are available in a sister blog post, by the the-the-by.

Sharing Data:oh, the woes

At first glance, a general purpose Lock-free queue seems fairly easy to implement. It isn ' t. The root of the problem is, the same variables necessarily need to being shared with several threads. For example, take a common linked-list based approach:at a minimum, the head and tail of the list need to be shared, Beca Use consumers all need to is able to read and update the head, and the producers all need to is able to update the tail.

This doesn ' t sound too bad and far, but the real problems arise when a thread needs to update more than one variable to Keep the queue in a consistent state--atomicity are only ensured for single variables, and atomicity for compound variabl ES (structs) is almost certainly going to result in a sort of lock (on most platforms, depending on the size of the VARIAB Le). For example, what if a consumer read the last item from the queue and updated only the head? The tail should not still point to it, because the object would soon be freed! But the consumer could is interrupted by the OS and suspended for a few milliseconds before it updates the tail, and Durin G that time the tail could is updated by another thread, and then it becomes too late for the first thread to set it to Nu LL.

The solutions to this fundamental problem of shared data is the crux of Lock-free programming. Often the best was to conceive of a algorithm that doesn ' t need to update multiple variables to maintain consistency I n the first place, or one where incremental updates still leave the data structure in a consistent state. Various tricks can be used, such as never freeing memory once allocated (this helps with reads from threads that aren ' t up To date), storing extra the last bits of a pointer (this works with 4-byte aligned pointers), and reference Counting pointers. But tricks as these only go so far; The real effort goes into developing the algorithms themselves.

My queue

The less threads fight over the same data, the better. So, instead of using a single data structure this linearizes all operations, a set of sub-queues is used instead--one fo R each producer thread. This means the different threads can enqueue items completely in parallel, independently of each of the other.

Of course, this also makes dequeueing slightly more complicated:now we had to check every sub-queue for items when deque Uing. Interestingly, it turns out that the order that elements is pulled from the sub-queues really doesn ' t matter. All elements from a given producer thread would necessarily still be seen on that same order relative to each other when de Queued (since the sub-queue preserves that order), albeit with elements from other sub-queues possibly interleaved. Interleaving elements is OK because even in a traditional single-queue model, the order that elements get put in from from Different producer threads is Non-deterministic anyway (because there ' s a race condition between the different producers) . [Edit:this is a true if the producers is independent, which isn ' t necessarily the case. See the comments.] The only downside to this approach are if the queue is empty, every single sub-queue have to being checked in order to Det Ermine This (also, by the time one Sub-queuE is checked, a previously empty one could has become non-empty--but in practice this doesn ' t cause problems). However, in the Non-empty case, there was much less contention overall because sub-queues can be ' paired up ' with consumers . This reduces data sharing to the near-optimal level (where every consumer are matched with exactly one producer), without L Osing the ability to handle. This pairing was done with a heuristic that takes to account the last Sub-queue a producer successfully pulled from (ESS Entially, it gives consumers an affinity). Of course, in order to do the pairing, some state have to being maintained between calls to Dequeue--that's done using con Sumer-specific "Tokens" That's the user is in charge of allocating. Note that tokens is completely optional--the queue merely reverts to searching every sub-queue for an element without O NE, which is correct, just slightly slower while many threads is involved.

So, that ' s the high-level design. What is the core algorithm used within each sub-queue? Well, instead of being based on a linked-list of nodes (which implies constantly allocating and freeing or re-using elemen TS, and typically relies on a compare-and-swap loop which can is slow under heavy contention), I based my queue on an ARRA Y model. Instead of linking individual elements, I have a "block" of several elements. The logical head and tail indices of the queue are represented using atomically-incremented integers. Between these logical indices and the blocks lies a scheme for mapping all index to their block and sub-index within that B Lock. An enqueue operation simply increments the tail (remember, there ' s only one producer thread for each sub-queue). A dequeue operation increments the head if it sees that the head was less than the tail, and then it checks-see if it AC Cidentally incremented the head past the tail (this can happen under contention--there ' s multiple consumer Threads per sub-queue). If it did over-increment the head, a correction counter is incremented (making the queue eventually consistent), and if no T, it goes ahead and increments another integer which gives it the actual final logical index. The increment of this final index all yields a valid index in the actual queue, regardless of what other threads is do ing or has done; This works because the final index was only ever incremented when there's guaranteed to being at least one element to dequeue (which was checked when the first index was incremented).

So there you have it. An enqueue operation is do with a single atomic increment, and a dequeue are done with both atomic increments in the fast- Path, and one extra otherwise. (Of course, this was discounting all the block allocation/re-use/referencing Counting/block mapping Goop, which, while Impo Rtant, is isn't very interesting-in any case, the most of the those costs is amortized over an entire block ' s worth of elements. ) The really interesting part of this design are that it allows extremely efficient bulk operations--in terms of atomic I Nstructions (which tend to is a bottleneck), enqueueing X items in a block have exactly the same amount of overhead as Enqu Eueing a single item (ditto for dequeueing), provided they ' re in the same block. That's where the real performance gains come in:-)

I heard there was code

Since I thought there is rather a lack of high-quality lock-free queues for C + +, I wrote one using this design I came up With. (While there is others, notably the ones in Boost and Intel's TBB, mine have more features, such as have no restrictions On the element type, and was faster to boot.) You can find it on GitHub. It's all contained with a single header, and available under the simplified BSD license. Just drop it in your project and enjoy!

Benchmarks, yay!

So, the fun part of creating data structures was writing synthetic benchmarks and seeing how fast yours was versus other Exi Sting ones. For comparison, I used the Boost 1.55 lock-free queue, Intel's TBB 4.3 concurrent_queue , another linked-list based lock-free queue of My own (a na?ve design for reference), a lock-based queue using std::mutex , and a normal std::queue (for reference against a Regula R data Structure This ' s accessed purely from one thread). Note that the graphs below only show a subset of the results, and omit both the Na?ve Lock-free and single-threaded std::queue Implementations.

Here is the results! Detailed raw data follows the pretty graphs (note that I had to use a logarithmic scale due to the enormous diffe Rences in absolute throughput).

"Go" A Fast General Purpose Lock-free Queue for C + +

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.