Performance Comparison and Analysis of lock-free programming and lock-free programming

Last Update:2014-08-11 Source: Internet

Author: User

Tags lock queue

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

A recently maintained network server encountered performance problems, so it made major changes to the original program framework. Most of the changes are the thread working mode and data transmission mode. The final result is to change the lock usage mode. After some improvement, the GMB Nic can basically work at full speed. After the performance was up to standard, I was wondering if I could use a more lightweight lock or remove the lock. For this reason, I searched some research results and made some experiments to verify these results, so there is this article. We hope that our colleagues who do similar jobs can learn from each other. If anyone has relevant experience, please feel free to contact me.

1. Introduction to lock-free programming

This section mainly summarizes the document [1] and introduces some basic knowledge.

The so-called lock programming means that when you need to share data, you need to access the data in an orderly manner. All operations that change the shared data must show atomic semantics, even if they are like ++ K, this operation also requires the use of locks. The temporary efficiency reduction, deadlock, priority reversal, and other problems of lock programming must be carefully optimized and solved by the designer. This article does not discuss these three issues.

In Lock-free programming, it does not mean that all operations are atomic, and only a very limited set of operations is atomic, which means that lock-free programming is very difficult. So does this limited set of operations exist? If so, what atomic operations are involved? Maurice Herlihy's paper "Wait-free synchronization" in 2003 [3] solved this problem. The conclusion of this article is as follows: for example, test-and-set, swap, fetch-and-add, or even atomic queue, lock-free cannot be implemented for multiple threads. The simplest and simplest primitive CAS (compare-and-swap) operation can complete all lockless functions, and other functions such as LL/SC (load linked/store conditional ). The CAS pseudo code is as follows:

template <class T>bool CAS(T* addr, T expected, T value) {   if (*addr == expected)    {      *addr = value;      return true;   }   return false;}

CAS compares expected with a memory address. If the comparison succeeds, it replaces the memory content with new. Currently, most machines implement this operation at the hardware level. On the Inter processor, this operation is cmpxchg, so CAS is the most basic atomic operation.

Comparison between wait-free/lock-free and locked

The wait-free process can be completed in limited steps, regardless of the speed of other threads.

The lock-free process ensures that at least one thread is being executed. Other threads may be delayed, but the system is still moving forward.

If a thread occupies a lock, other threads cannot execute the lock. Generally, locks are required to avoid deadlocks and live locks.

2 research and progress of lock-free programming

This section describes the lockless algorithms and data structures that have been implemented in [2.

Over the past two decades, researchers have conducted a lot of research on lock-free and wait-free algorithms and data structures. Implemented some wait-free and lock-free algorithms, such as the FIFO queue and LIFO stack, more complex optimization-level queues, hash tables, and the lock-free algorithm of the red and black trees are gradually known.

The implementation of the lock-free algorithm relies on the memory barrier, so it has platform relevance. The following describes the more mature implementations of atomic operations and algorithm data structures.

Midishare source code is available under the GPL license. midishare implements des implementations of lock-free FIFO queues and LIFO stacks.
Appcore is an SMP and hyperthread friendly library which uses lock-free techniques to implement stacks, queues, linked lists and other useful data structures. appcore appears currently to be for x86 computers running Windows. the licensing terms of appcore are extremely unclear.
Noble-a library of non-blocking synchronisation protocols. implements lock-free stack, queue, singly linked list, snapshots and registers. noble is distributed under a license which only permits non-commercial cial academic use.
Lock-free-lib published under the GPL license. supported des implementations of software transactional memory, multi-workd CAS primitives, skip lists, binary search trees, and red-black trees. for Alpha, MIPS, IA64, x86, PPC, and iSCSI.
Nonblocking multiprocessor/multithread algorithms in C ++ (for msvc/x86) posted by Joshua schlar to musicdsp.org, and are presumably in the public domain. included are queue, stack, reference-counted garbage collection, memory allocation, templates for atomic algorithms and types. this code is largely untested. A local mirror is here.
Qprof operations des the atomic_ops library of atomic operations and data structures under an MIT-style license. Only available for Linux at the moment, but there are plans to support other platforms. download available here
Amino concurrent building blocks provides lock free datastructures and STM for C ++ and Java under an Apache Software (2.0) licence.

Among them, Noble has been commercialized, and license is not cheap.

3. Performance Analysis

This section compares the mutex in pthread, the atomic increment in windows, and the CAS atomic operation, performance Comparison and Analysis of lock-free FIFO queues implemented in midishare and lock-free queues implemented in STL list are carried out, and the optimization methods are summarized.

3.1 Performance Test of atomic Increment

The CPU of the test machine is Intel e5300 2.60 GHz.

The first is to test simple incremental operations, including ++ operations without any synchronization mechanism, ++ operations protected by pthread_mutex, and atomic_add1 () of CAS semantic implementation () and interlockedincrease () in Windows for quantitative testing in the case of a single thread.

I ++	0.32 billion
Lock (p_mutex); I ++; unlock (p_mutex );	10 million
Cas_atomic_add1 (I)	10 million
Interlockedincrease (& I)	10 million

First, without any synchronization, the CPU can perform ++ operations 0.32 billion times per second, close to the CPU clock speed. When thread_mutex_lock () and unlock () operations are executed at ++ each time, the CPU can only execute tens of millions of times per second, this means that the CPU can perform the lock and unlock operations every second for a total of 10 million times, and the overhead of adding and unlocking is about 15 times of the execution of the addition command. CAS is slightly better, which is times per second. This speed is similar to the execution speed of interlockedincrease () in windows.

From the test results, the atomic increment operation in Windows is basically the same as the increment operation in CAS, it is estimated that the bottom layer of Windows uses the CAS command cmpxchg to implement atomic addition. Of course, pthread_mutex, as a mutex lock, also has a high efficiency. In the case of no sudden lock, the lock overhead is equivalent to the overhead of a CAS.

However, compared with the non-synchronous ++ operations, hardware-level synchronization also causes at least eight times of performance degradation.

Then, the pthread_mutex program is optimized logically, and the execution of ++ is tested for 8 and 20,100 times, respectively.

Lock (); For (k = 0; k <8; I ++, K ++); unlock ()	0.12 billion
Lock (); For (k = 0; k <20; I ++, K ++); unlock ()	0.2 billion
Lock (); For (k = 0; k <100; I ++, K ++); unlock ()	0.34 billion

Result The number of times that the CPU can execute ++ per second is 0.12 billion/0.2 billion/0.34 billion, which is consistent with the expectation, because the number of calls and unlocking times per second is the original 1/8, 1/20, and 1/100 respectively. When 100 + + is executed for one addition and unlocking, the performance has reached the performance without any synchronization. Of course, the atomic interlockedincrease () and the atomic_add1 () Implementation of CAS do not have the advantage of batch processing. If they are, their best execution is fixed.

For performance testing of atomic operations in Windows with single thread and multi-thread, refer to [4]. here only the conclusions are listed. The CPU of the test machine listed in this section is a intel2.66ghz dual-core processor.

A single thread executes 2 million atomic increment operations

Interlockedincrease	78 Ms
Windows criticalsection	172 Ms
OpenMP Lock Operation	250 ms

Two threads perform 2 million atomic increment operations on shared Variables

Interlockedincrease	156 Ms
Windows criticalsection	3156 Ms
OpenMP Lock Operation	1063 Ms

3.2 performance tests on lock-free and lock-free queues

The lock-free queue tested here is implemented by midishare, while the lock queue is implemented by pthread_mutex and C ++ STL list. Only test results are listed here.

When the same data is stored, the number of enque/deque times per second is calculated from the master thread enque and from the sub-thread deque. Of course, the two are basically the same.

The performance of the lock-free queue is between 150 and 200 million operations. This performance cannot be improved any more, because every time you leave the queue, the operation is mutually exclusive at the hardware level. For a lock queue, the following results are displayed based on the number of times the queue is added and unlocked:

Lock (); For (k = 0; k <X; I ++, K ++); unlock ()	Result (times/s)
X = 1	0.4 million
X = 10	1.9 million
X = 128	3.5 million
X = 1000	4 million
X = 10000	3.96 million

This means that the batch processing of data between locks can greatly improve the system performance. However, the atomic operation cannot improve the batch processing.

4 Conclusion

Through the above lockless and lockless performance tests, we can draw the conclusion that for the hardware-level Mutual Exclusion implemented by CAS, the performance of a single operation is more efficient than that of the application layer under the same conditions, however, when multiple threads are concurrent, the cost of hardware-level mutex introduction is the same as that of lock contention at the application layer. Therefore, it is impossible to use CAS lockless algorithms and related data structures to significantly improve the performance of applications. Hardware-level atomic operations slow down application-layer operations and cannot be optimized again. On the contrary, through a good design of multi-threaded programs with locks, the program performance can not be reduced, and a high degree of concurrency can be achieved.

However, we also need to see the advantages of no lock on the application layer. For example, programmers do not need to consider the difficult issues such as deadlocks and priority reversal. Therefore, the application is not very complex and the performance requirements are slightly higher, multiple Threads with locks can be used. The application-level lock-free algorithm can be used when the program is complex and the performance requirements are met.

For details about how to better schedule the multi-threaded working mode, refer to [5]. The document introduces a good working mode of Inter-thread cooperation, of course, the premise is that the machine has a large number of processors, which is sufficient to support parallel work of multiple groups of threads. If the number of processors is large, scheduling between multiple threads increases the overhead of System context switching on each core, resulting in a reduction in the overall system performance.

References

[1] Lock-Free Data structuring http://www.drdobbs.com/184401865

Some Notes on lock-free wait-free algorithms http://www.rossbencina.com/code/lockfree

[3] Wait-free synchronization http://www.podc.org/dijkstra/2003.html

[4] performance comparison of lock and atomic operation in OpenMP creation thread http://blog.163.com/kangtao-520/blog/static/772561452009510751068/

[5] thread grouping competition mode http://kangtao-520.blog.163.com/blog/static/77256145200951074121305/ in multi-core programming

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More