Performance Comparison and Analysis of lock-free programming and lock-free programming

Last Update:2018-12-04 Source: Internet

Author: User

Tags lock queue

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

A recently maintained network server encountered performance problems, so it made major changes to the original program framework. Most of the changes are the thread working mode and data transmission mode. The final result is to change the lock usage mode. After some improvement, we can basically achieve

GMB

Network Adapter working at full speed. In

After the performance was up to standard, I was wondering if I could use a more lightweight lock or remove the lock. For this reason, I searched some research results and made some experiments to verify these results, so there is this article. We hope that our colleagues who do similar jobs can learn from each other. If anyone has relevant experience, please feel free to contact me.

Lock-free programming Overview

This section focuses on the document

[1]

To give a general introduction to some basic knowledge.

The so-called lock programming means that when you need to share data, you need to access it in order. All operations that change the shared data must show atomic semantics, even if they are like

++ K

This operation also requires the use of locks. The temporary efficiency reduction, deadlock, priority reversal, and other problems of lock programming must be carefully optimized and solved by the designer. This article does not discuss these three issues.

In Lock-free programming, it does not mean that all operations are atomic, and only a very limited set of operations is atomic, which means that lock-free programming is very difficult. So does this limited set of operations exist? If so, what atomic operations are involved?

2003

Year

Maurice Herlihy

A Thesis

"Wait-free synchronization" [3]

Solved this problem. The conclusion of this article is as follows:

Test-and-set, swap, fetch-and-add

Even the atomic queue cannot be implemented by multiple threads.

Lock-free

. The simplest and simplest primitive

CAS (compare-and-swap)

The operation can complete all the lock-free functions.

Ll/SC (load linked/Store
Conditional)

CAS

The pseudo code is as follows:

Template <class T> bool CAS (T * ADDR, t expected, T value) { If (* ADDR = expected) { * ADDR = value; return true; } return false; }

CAS

Set

Expected

Compare with a memory address. If the comparison is successful, replace the memory content

New

. Currently, most machines implement this operation at the hardware level.

Inter

The operation on the processor is

Cmpxchg

Therefore

CAS

Is the most basic atomic operation.

Wait-free/lock-free

And

Lock comparison

Wait-free

The process can be completed by a limited number of steps, regardless of the speed of other threads.

Lock-free

The process ensures that at least one thread is being executed. Other threads may be delayed, but the system is still moving forward as a whole.

If a thread occupies a lock, other threads cannot execute the lock. Generally, locks are required to avoid deadlocks and live locks.

Research and progress of lock-free programming

This section describes

[2]

To introduce the lockless algorithms and data structures that have been implemented.

In the past two decades, researchers have

Lock-free

And

Wait-free

Algorithm and data structure. Implemented some

Wait-free

And

Lock-free

Such

FIFO

Queue and

LIFO

And more complex optimization-level queues,

Hash

Tables and

Lock-free

Algorithms are gradually known.

The implementation of the lock-free algorithm relies on the memory barrier, so it has platform relevance. The following describes the more mature implementations of atomic operations and algorithm data structures.

Midishare Source
Code
Is available under the GPL license. midishare primary des
Implementations of lock-free FIFO queues and LIFO stacks.
Appcore
Is an SMP and hyperthread friendly library which uses
Lock-free techniques to implement stacks, queues, linked lists and other
Useful data structures. appcore appears currently to be for x86 computers
Running Windows. The licensing terms of appcore are extremely unclear.
Noble
-A library
Non-blocking synchronisation protocols. implements lock-free stack, queue,
Singly linked list, snapshots and registers. Noble is distributed under
License which only permits non-commercial cial academic use.
Lock-free-lib
Published under the GPL license. Includes implementations of software
Transactional memory, multi-workd CAS primitives, skip lists, binary
Search Trees, and red-black trees. For Alpha, MIPS, IA64, x86, PPC, and
Type.
Nonblocking
Multiprocessor/multithread algorithms in C ++
(For msvc/x86) posted
Joshua schlar to musicdsp.org
, And are
Presumably in the public domain. Included are queue, stack,
Reference-counted garbage collection, memory allocation, templates
Atomic algorithms and types. This code is largely untested. A local mirror
Is here
.
Qprof
Schemdes the atomic_ops library of atomic operations and data structures
Under an MIT-style license. Only available for Linux at the moment,
There are plans to support other platforms. download available here
Amino concurrent
Building Blocks
Provides lock free datastructures and STM for C ++ and
Java under an Apache Software (2.0) licence.

Where

Noble

Has been commercialized,

License

Not cheap.

Performance Analysis

This section describes

Pthread

Mutex

Windows

Atomic increment, and

CAS

Atomic operations are compared and

Midishare

Lockless implemented in

FIFO

Queue and

STL

List

Performance Comparison and Analysis of the implemented lock queue, and the optimization methods are summarized.

3.1

Performance Test of atomic Increment

Testing Machine

CPU

Intel e5300 2.60 GHz

The first step is to test the simple incremental operations without any synchronization mechanism.

Operation,

Pthread_mutex

Protected

Operation, and

CAS

Semantic Implementation

Atomic_add1 ()

And

Windows

Under

Interlockedincrease ()

A quantitative test is conducted in the case of a single thread.

I ++	3.2 Billion
Lock (p_mutex); I ++; unlock (p_mutex );	2 10 million
Cas_atomic_add1 (I)	4 10 million
Interlockedincrease (& I)	4 10 million

First, without any synchronization,

CPU

Can be executed per second

Operation

3.2

Hundreds of millions of times, close

CPU

. And every time

Time execution

Thread_mutex_lock ()

And

Unlock ()

In case of operation,

CPU

Only execution per second

Tens of millions of times, that is

CPU

Locks and locks can be performed every second.

Tens of millions of times, the overhead of adding and unlocking is to execute addition commands.

Times or so. While

CAS

The condition is slightly better, for every second

Tens of millions of times. This speed and

Windows

Under

Interlockedincrease ()

The execution speed is very similar.

From the test results,

Windows

The atomic increment operation and

CAS

The increment operation cost is basically the same, estimated

Windows

The underlying layer also uses Assembly commands

Cmpxchg

CAS

To implement the atomic increment operation. Of course

Pthread_mutex

As a mutex lock, it also has a very high efficiency. In the case of no sudden lock, the lock overhead and

CAS

However

Operation, hardware-level synchronization also causes at least

Performance drops.

Pthread_mutex

The program is optimized logically and tested separately.

Run

Times,

20,100

In the case of a single unlock operation.

Lock (); For (k = 0; k <8; I ++, K ++); unlock ()

1.2

Billion

Lock (); For (k = 0; k <20; I ++, K ++); unlock ()

Billion

Lock (); For (k = 0; k <100; I ++, K ++); unlock ()

3.4

Billion

Result

CPU

Executed per second

The number of times is

1.2

Billion

/3.4

, Which is the same as expected, because the number of calls and unlocking times per second is the original

1/8

1/20

And

1/100

When

100

Times

After one addition and unlocking, the performance has reached the performance without any synchronization. Of course, atomic

Interlockedincrease ()

And

CAS

Implemented

Atomic_add1 ()

They do not have the advantage of Batch Processing Improvement, whether they are, their best execution is fixed.

For single-thread and multi-thread scenarios

Windows

For more information about the performance testing of atomic operations, see references.

[4]

Here, only the conclusions are listed. Testing machines listed in

CPU

Intel2.66ghz

Dual-core processor.

Single thread execution

Millions of atomic increment operations

Interlockedincrease

78 Ms

Windows
Criticalsection

172 Ms

OpenMP

Lock

Operation

250 ms

Two threads execute shared Variables

Millions of atomic increment operations

Interlockedincrease

156 Ms

Windows
Criticalsection

3156 Ms

OpenMP

Lock

Operation

1063 Ms

3.2

Test the performance of lock-free and lock-free queues

The lock-free queues tested here are composed

Midishare

The lock queue is implemented through

Pthread_mutex

And

C ++

STL list

. Only test results are listed here.

When the same data is stored

Enque

And from the sub-Thread

Deque

, Computing per second

Enque/deque

Of course, the two are basically the same.

The performance of lock-free queues is

150-200

This performance can no longer be improved for the next team-entry operation, because each team-entry operation is mutually exclusive at the hardware level. For a lock queue, the following results are displayed based on the number of times the queue is added and unlocked:

Lock (); For (k = 0; k <X; I ++, K ++); unlock ()	Result (times/s)
X = 1	40 10 thousand
X = 10	190 10 thousand
X = 128	350 10 thousand
X = 1000	400 10 thousand
X = 10000	396 10 thousand

This means that the batch processing of data between locks can greatly improve the system performance. However, the atomic operation cannot improve the batch processing.

Conclusion

Through the above lockless and lockless performance tests, we can draw the conclusion that

CAS

Hardware-level mutex is implemented. The performance of a single operation is more efficient than that of the application layer under the same conditions. However, when multiple threads are concurrent, the cost of hardware-level mutex introduction is the same as the application-layer lock contention. Therefore, if you want to use

CAS

Lockless algorithms and related data structures make it impossible to greatly improve program performance. Hardware-level atomic operations make application-layer operations slow and cannot be optimized again. On the contrary, through a good design of multi-threaded programs with locks, the program performance can not be reduced, and a high degree of concurrency can be achieved.

However, we also need to see the advantages of no lock on the application layer. For example, programmers do not need to consider the difficult issues such as deadlocks and priority reversal. Therefore, the application is not very complex and the performance requirements are slightly higher, multiple Threads with locks can be used. The application-level lock-free algorithm can be used when the program is complex and the performance requirements are met.

For more information about how to schedule the multi-thread mode, see references.

[5]

The document introduces a good working mode of Inter-thread cooperation. Of course, the premise is that the number of processors on the machine is large enough to support parallel work of multiple groups of threads. If the number of processors is large, scheduling between multiple threads increases the overhead of System context switching on each core, resulting in a reduction in the overall system performance.

References

[1] Lock-free data structures

Http://www.drdobbs.com/184401865

[2] Some Notes on lock-free
Wait-free algorithms http://www.rossbencina.com/code/lockfree

[3] Wait-free synchronization http://www.podc.org/dijkstra/2003.html

[4] OpenMP

Comparison of lock and atomic operation performance in the creation thread

Http://blog.163.com/kangtao-520/blog/static/772561452009510751068/

[5]

Thread grouping competition mode in multi-core programming

Http://kangtao-520.blog.163.com/blog/static/77256145200951074121305/

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More