A recently maintained network server encountered performance problems, so it made major changes to the original program framework. Most of the changes are the thread working mode and data transmission mode. The final result is to change the lock usage mode. After some improvement, we can basically achieve
GMB
Network Adapter working at full speed. In
After the performance was up to standard, I was wondering if I could use a more lightweight lock or remove the lock. For this reason, I searched some research results and made some experiments to verify these results, so there is this article. We hope that our colleagues who do similar jobs can learn from each other. If anyone has relevant experience, please feel free to contact me.
1
Lock-free programming Overview
This section focuses on the document
[1]
To give a general introduction to some basic knowledge.
The so-called lock programming means that when you need to share data, you need to access it in order. All operations that change the shared data must show atomic semantics, even if they are like
++ K
This operation also requires the use of locks. The temporary efficiency reduction, deadlock, priority reversal, and other problems of lock programming must be carefully optimized and solved by the designer. This article does not discuss these three issues.
In Lock-free programming, it does not mean that all operations are atomic, and only a very limited set of operations is atomic, which means that lock-free programming is very difficult. So does this limited set of operations exist? If so, what atomic operations are involved?
2003
Year
Maurice Herlihy
A Thesis
"Wait-free synchronization" [3]
Solved this problem. The conclusion of this article is as follows:
Test-and-set, swap, fetch-and-add
Even the atomic queue cannot be implemented by multiple threads.
Lock-free
. The simplest and simplest primitive
CAS (compare-and-swap)
The operation can complete all the lock-free functions.
Ll/SC (load linked/Store
Conditional)
.
CAS
The pseudo code is as follows:
Template <class T> <br/> bool CAS (T * ADDR, t expected, T value) <br/>{< br/> If (* ADDR = expected) <br/>{< br/> * ADDR = value; <br/> return true; <br/>}< br/> return false; <br/>}< br/>
CAS
Set
Expected
Compare with a memory address. If the comparison is successful, replace the memory content
New
. Currently, most machines implement this operation at the hardware level.
Inter
The operation on the processor is
Cmpxchg
Therefore
CAS
Is the most basic atomic operation.
Wait-free/lock-free
And
Lock comparison
Wait-free
The process can be completed by a limited number of steps, regardless of the speed of other threads.
Lock-free
The process ensures that at least one thread is being executed. Other threads may be delayed, but the system is still moving forward as a whole.
If a thread occupies a lock, other threads cannot execute the lock. Generally, locks are required to avoid deadlocks and live locks.
2
Research and progress of lock-free programming
This section describes
[2]
To introduce the lockless algorithms and data structures that have been implemented.
In the past two decades, researchers have
Lock-free
And
Wait-free
Algorithm and data structure. Implemented some
Wait-free
And
Lock-free
Such
FIFO
Queue and
LIFO
And more complex optimization-level queues,
Hash
Tables and
Lock-free
Algorithms are gradually known.
The implementation of the lock-free algorithm relies on the memory barrier, so it has platform relevance. The following describes the more mature implementations of atomic operations and algorithm data structures.
- Midishare Source
CodeIs available under the GPL license. midishare primary des
Implementations of lock-free FIFO queues and LIFO stacks.
- Appcore
Is an SMP and hyperthread friendly library which uses
Lock-free techniques to implement stacks, queues, linked lists and other
Useful data structures. appcore appears currently to be for x86 computers
Running Windows. The licensing terms of appcore are extremely unclear.
- Noble
-A library
Non-blocking synchronisation protocols. implements lock-free stack, queue,
Singly linked list, snapshots and registers. Noble is distributed under
License which only permits non-commercial cial academic use.
- Lock-free-lib
Published under the GPL license. Includes implementations of software
Transactional memory, multi-workd CAS primitives, skip lists, binary
Search Trees, and red-black trees. For Alpha, MIPS, IA64, x86, PPC, and
Type.
- Nonblocking
Multiprocessor/multithread algorithms in C ++(For msvc/x86) posted
Joshua schlar to musicdsp.org
, And are
Presumably in the public domain. Included are queue, stack,
Reference-counted garbage collection, memory allocation, templates
Atomic algorithms and types. This code is largely untested. A local mirror
Is here
.
- Qprof
Schemdes the atomic_ops library of atomic operations and data structures
Under an MIT-style license. Only available for Linux at the moment,
There are plans to support other platforms. download available here
- Amino concurrent
Building BlocksProvides lock free datastructures and STM for C ++ and
Java under an Apache Software (2.0) licence.
Where
Noble
Has been commercialized,
License
Not cheap.
3
Performance Analysis
This section describes
Pthread
In
Mutex
,
Windows
Atomic increment, and
CAS
Atomic operations are compared and
Midishare
Lockless implemented in
FIFO
Queue and
STL
Of
List
Performance Comparison and Analysis of the implemented lock queue, and the optimization methods are summarized.
3.1
Performance Test of atomic Increment
Testing Machine
CPU
Is
Intel e5300 2.60 GHz
The first step is to test the simple incremental operations without any synchronization mechanism.
++
Operation,
Pthread_mutex
Protected
++
Operation, and
CAS
Semantic Implementation
Atomic_add1 ()
And
Windows
Under
Interlockedincrease ()
A quantitative test is conducted in the case of a single thread.
I ++ |
3.2 Billion |
Lock (p_mutex); I ++; unlock (p_mutex ); |
2 10 million |
Cas_atomic_add1 (I) |
4 10 million |
Interlockedincrease (& I) |
4 10 million |
First, without any synchronization,
CPU
Can be executed per second
++
Operation
3.2
Hundreds of millions of times, close
CPU
. And every time
++
Time execution
Thread_mutex_lock ()
And
Unlock ()
In case of operation,
CPU
Only execution per second
2
Tens of millions of times, that is
CPU
Locks and locks can be performed every second.
4
Tens of millions of times, the overhead of adding and unlocking is to execute addition commands.
15
Times or so. While
CAS
The condition is slightly better, for every second
4
Tens of millions of times. This speed and
Windows
Under
Interlockedincrease ()
The execution speed is very similar.
From the test results,
Windows
The atomic increment operation and
CAS
The increment operation cost is basically the same, estimated
Windows
The underlying layer also uses Assembly commands
Cmpxchg
Of
CAS
To implement the atomic increment operation. Of course
Pthread_mutex
As a mutex lock, it also has a very high efficiency. In the case of no sudden lock, the lock overhead and
CAS
.
However
++
Operation, hardware-level synchronization also causes at least
8
Performance drops.
Next
Pthread_mutex
The program is optimized logically and tested separately.
++
Run
8
Times,
20,100
In the case of a single unlock operation.
Lock (); For (k = 0; k <8; I ++, K ++); unlock () |
1.2 Billion |
Lock (); For (k = 0; k <20; I ++, K ++); unlock () |
2 Billion |
Lock (); For (k = 0; k <100; I ++, K ++); unlock () |
3.4 Billion |
Result
CPU
Executed per second
++
The number of times is
1.2
Billion
/2
Billion
/3.4
, Which is the same as expected, because the number of calls and unlocking times per second is the original
1/8
,
1/20
And
1/100
When
100
Times
++
After one addition and unlocking, the performance has reached the performance without any synchronization. Of course, atomic
Interlockedincrease ()
And
CAS
Implemented
Atomic_add1 ()
They do not have the advantage of Batch Processing Improvement, whether they are, their best execution is fixed.
For single-thread and multi-thread scenarios
Windows
For more information about the performance testing of atomic operations, see references.
[4]
Here, only the conclusions are listed. Testing machines listed in
CPU
Is
Intel2.66ghz
Dual-core processor.
Single thread execution
2
Millions of atomic increment operations
Interlockedincrease |
78 Ms |
Windows Criticalsection |
172 Ms |
OpenMP Of Lock Operation |
250 ms |
Two threads execute shared Variables
2
Millions of atomic increment operations
Interlockedincrease |
156 Ms |
Windows Criticalsection |
3156 Ms |
OpenMP Of Lock Operation |
1063 Ms |
3.2
Test the performance of lock-free and lock-free queues
The lock-free queues tested here are composed
Midishare
The lock queue is implemented through
Pthread_mutex
And
C ++
Of
STL list
. Only test results are listed here.
When the same data is stored
Enque
And from the sub-Thread
Deque
, Computing per second
Enque/deque
Of course, the two are basically the same.
The performance of lock-free queues is
150-200
This performance can no longer be improved for the next team-entry operation, because each team-entry operation is mutually exclusive at the hardware level. For a lock queue, the following results are displayed based on the number of times the queue is added and unlocked:
Lock (); For (k = 0; k <X; I ++, K ++); unlock () |
Result (times/s) |
X = 1 |
40 10 thousand |
X = 10 |
190 10 thousand |
X = 128 |
350 10 thousand |
X = 1000 |
400 10 thousand |
X = 10000 |
396 10 thousand |
This means that the batch processing of data between locks can greatly improve the system performance. However, the atomic operation cannot improve the batch processing.
4
Conclusion
Through the above lockless and lockless performance tests, we can draw the conclusion that
CAS
Hardware-level mutex is implemented. The performance of a single operation is more efficient than that of the application layer under the same conditions. However, when multiple threads are concurrent, the cost of hardware-level mutex introduction is the same as the application-layer lock contention. Therefore, if you want to use
CAS
Lockless algorithms and related data structures make it impossible to greatly improve program performance. Hardware-level atomic operations make application-layer operations slow and cannot be optimized again. On the contrary, through a good design of multi-threaded programs with locks, the program performance can not be reduced, and a high degree of concurrency can be achieved.
However, we also need to see the advantages of no lock on the application layer. For example, programmers do not need to consider the difficult issues such as deadlocks and priority reversal. Therefore, the application is not very complex and the performance requirements are slightly higher, multiple Threads with locks can be used. The application-level lock-free algorithm can be used when the program is complex and the performance requirements are met.
For more information about how to schedule the multi-thread mode, see references.
[5]
The document introduces a good working mode of Inter-thread cooperation. Of course, the premise is that the number of processors on the machine is large enough to support parallel work of multiple groups of threads. If the number of processors is large, scheduling between multiple threads increases the overhead of System context switching on each core, resulting in a reduction in the overall system performance.
References
[1] Lock-free data structures
Http://www.drdobbs.com/184401865
[2] Some Notes on lock-free
Wait-free algorithms http://www.rossbencina.com/code/lockfree
[3] Wait-free synchronization http://www.podc.org/dijkstra/2003.html
[4] OpenMP
Comparison of lock and atomic operation performance in the creation thread
Http://blog.163.com/kangtao-520/blog/static/772561452009510751068/
[5]
Thread grouping competition mode in multi-core programming
Http://kangtao-520.blog.163.com/blog/static/77256145200951074121305/