Performance Comparison and Analysis of lock-free programming and lock-free programming

Source: Internet
Author: User
Tags lock queue

 

A recently maintained network server encountered performance problems, so it made major changes to the original program framework. Most of the changes are the thread working mode and data transmission mode. The final result is to change the lock usage mode. After some improvement, we can basically achieve

GMB

Network Adapter working at full speed. In

After the performance was up to standard, I was wondering if I could use a more lightweight lock or remove the lock. For this reason, I searched some research results and made some experiments to verify these results, so there is this article. We hope that our colleagues who do similar jobs can learn from each other. If anyone has relevant experience, please feel free to contact me.

 

1

Lock-free programming Overview

This section focuses on the document

[1]

To give a general introduction to some basic knowledge.

The so-called lock programming means that when you need to share data, you need to access it in order. All operations that change the shared data must show atomic semantics, even if they are like

++ K

This operation also requires the use of locks. The temporary efficiency reduction, deadlock, priority reversal, and other problems of lock programming must be carefully optimized and solved by the designer. This article does not discuss these three issues.

In Lock-free programming, it does not mean that all operations are atomic, and only a very limited set of operations is atomic, which means that lock-free programming is very difficult. So does this limited set of operations exist? If so, what atomic operations are involved?

2003

Year

Maurice Herlihy

A Thesis

"Wait-free synchronization" [3]

Solved this problem. The conclusion of this article is as follows:

Test-and-set, swap, fetch-and-add

Even the atomic queue cannot be implemented by multiple threads.

Lock-free

. The simplest and simplest primitive

CAS (compare-and-swap)

The operation can complete all the lock-free functions.

Ll/SC (load linked/Store
Conditional)

.

CAS

The pseudo code is as follows:

Template <class T> <br/> bool CAS (T * ADDR, t expected, T value) <br/>{< br/> If (* ADDR = expected) <br/>{< br/> * ADDR = value; <br/> return true; <br/>}< br/> return false; <br/>}< br/>

 

CAS

Set

Expected

Compare with a memory address. If the comparison is successful, replace the memory content

New

. Currently, most machines implement this operation at the hardware level.

Inter

The operation on the processor is

Cmpxchg

Therefore

CAS

Is the most basic atomic operation.

Wait-free/lock-free

And


Lock comparison

Wait-free

The process can be completed by a limited number of steps, regardless of the speed of other threads.

Lock-free

The process ensures that at least one thread is being executed. Other threads may be delayed, but the system is still moving forward as a whole.

If a thread occupies a lock, other threads cannot execute the lock. Generally, locks are required to avoid deadlocks and live locks.

 

2

Research and progress of lock-free programming

 

This section describes

[2]

To introduce the lockless algorithms and data structures that have been implemented.

In the past two decades, researchers have

Lock-free

And

Wait-free

Algorithm and data structure. Implemented some

Wait-free

And

Lock-free

Such

FIFO

Queue and

LIFO

And more complex optimization-level queues,

Hash

Tables and

Lock-free

Algorithms are gradually known.

The implementation of the lock-free algorithm relies on the memory barrier, so it has platform relevance. The following describes the more mature implementations of atomic operations and algorithm data structures.

  • Midishare Source
    Code

    Is available under the GPL license. midishare primary des
    Implementations of lock-free FIFO queues and LIFO stacks.

  • Appcore

    Is an SMP and hyperthread friendly library which uses
    Lock-free techniques to implement stacks, queues, linked lists and other
    Useful data structures. appcore appears currently to be for x86 computers
    Running Windows. The licensing terms of appcore are extremely unclear.

  • Noble

    -A library
    Non-blocking synchronisation protocols. implements lock-free stack, queue,
    Singly linked list, snapshots and registers. Noble is distributed under
    License which only permits non-commercial cial academic use.

  • Lock-free-lib

    Published under the GPL license. Includes implementations of software
    Transactional memory, multi-workd CAS primitives, skip lists, binary
    Search Trees, and red-black trees. For Alpha, MIPS, IA64, x86, PPC, and
    Type.

  • Nonblocking
    Multiprocessor/multithread algorithms in C ++

    (For msvc/x86) posted
    Joshua schlar to musicdsp.org

    , And are
    Presumably in the public domain. Included are queue, stack,
    Reference-counted garbage collection, memory allocation, templates
    Atomic algorithms and types. This code is largely untested. A local mirror
    Is here

    .

  • Qprof

    Schemdes the atomic_ops library of atomic operations and data structures
    Under an MIT-style license. Only available for Linux at the moment,
    There are plans to support other platforms. download available here

  • Amino concurrent
    Building Blocks

    Provides lock free datastructures and STM for C ++ and
    Java under an Apache Software (2.0) licence.

Where

Noble

Has been commercialized,

License

Not cheap.

3

Performance Analysis

This section describes

Pthread

In

Mutex

,

Windows

Atomic increment, and

CAS

Atomic operations are compared and

Midishare

Lockless implemented in

FIFO

Queue and

STL

Of

List

Performance Comparison and Analysis of the implemented lock queue, and the optimization methods are summarized.

3.1

Performance Test of atomic Increment

Testing Machine

CPU

Is

Intel e5300 2.60 GHz

The first step is to test the simple incremental operations without any synchronization mechanism.

++

Operation,

Pthread_mutex

Protected

++

Operation, and

CAS

Semantic Implementation

Atomic_add1 ()

And

Windows

Under

Interlockedincrease ()

A quantitative test is conducted in the case of a single thread.

 

I ++

3.2

Billion

Lock (p_mutex); I ++; unlock (p_mutex );

2

10 million

Cas_atomic_add1 (I)

4

10 million

Interlockedincrease (& I)

4

10 million

 

First, without any synchronization,

CPU

Can be executed per second

++

Operation

3.2

Hundreds of millions of times, close

CPU

. And every time

++

Time execution

Thread_mutex_lock ()

And

Unlock ()

In case of operation,

CPU

Only execution per second

2

Tens of millions of times, that is

CPU

Locks and locks can be performed every second.

4

Tens of millions of times, the overhead of adding and unlocking is to execute addition commands.

15

Times or so. While

CAS

The condition is slightly better, for every second

4

Tens of millions of times. This speed and

Windows

Under

Interlockedincrease ()

The execution speed is very similar.

From the test results,

Windows

The atomic increment operation and

CAS

The increment operation cost is basically the same, estimated

Windows

The underlying layer also uses Assembly commands

Cmpxchg

Of

CAS

To implement the atomic increment operation. Of course

Pthread_mutex

As a mutex lock, it also has a very high efficiency. In the case of no sudden lock, the lock overhead and

CAS

.

However

++

Operation, hardware-level synchronization also causes at least

8

Performance drops.

Next

Pthread_mutex

The program is optimized logically and tested separately.

++

Run

8

Times,

20,100

In the case of a single unlock operation.

Lock (); For (k = 0; k <8; I ++, K ++); unlock ()

1.2

Billion

Lock (); For (k = 0; k <20; I ++, K ++); unlock ()

2

Billion

Lock (); For (k = 0; k <100; I ++, K ++); unlock ()

3.4

Billion

Result

CPU

Executed per second

++

The number of times is

1.2

Billion

/2

Billion

/3.4

, Which is the same as expected, because the number of calls and unlocking times per second is the original

1/8

,

1/20

And

1/100

When

100

Times

++

After one addition and unlocking, the performance has reached the performance without any synchronization. Of course, atomic

Interlockedincrease ()

And

CAS

Implemented

Atomic_add1 ()

They do not have the advantage of Batch Processing Improvement, whether they are, their best execution is fixed.

For single-thread and multi-thread scenarios

Windows

For more information about the performance testing of atomic operations, see references.

[4]

Here, only the conclusions are listed. Testing machines listed in

CPU

Is

Intel2.66ghz

Dual-core processor.

Single thread execution

2

Millions of atomic increment operations

Interlockedincrease

78 Ms

Windows
Criticalsection

172 Ms

OpenMP

Of

Lock

Operation

250 ms

Two threads execute shared Variables

2

Millions of atomic increment operations

Interlockedincrease

156 Ms

Windows
Criticalsection

3156 Ms

OpenMP

Of

Lock

Operation

1063 Ms

3.2

Test the performance of lock-free and lock-free queues

The lock-free queues tested here are composed

Midishare

The lock queue is implemented through

Pthread_mutex

And

C ++

Of

STL list

. Only test results are listed here.

When the same data is stored

Enque

And from the sub-Thread

Deque

, Computing per second

Enque/deque

Of course, the two are basically the same.

The performance of lock-free queues is

150-200

This performance can no longer be improved for the next team-entry operation, because each team-entry operation is mutually exclusive at the hardware level. For a lock queue, the following results are displayed based on the number of times the queue is added and unlocked:

Lock (); For (k = 0; k <X; I ++, K ++); unlock ()

Result (times/s)

X = 1

40

10 thousand

X = 10

190

10 thousand

X = 128

350

10 thousand

X = 1000

400

10 thousand

X = 10000

396

10 thousand

This means that the batch processing of data between locks can greatly improve the system performance. However, the atomic operation cannot improve the batch processing.

4

Conclusion

Through the above lockless and lockless performance tests, we can draw the conclusion that

CAS

Hardware-level mutex is implemented. The performance of a single operation is more efficient than that of the application layer under the same conditions. However, when multiple threads are concurrent, the cost of hardware-level mutex introduction is the same as the application-layer lock contention. Therefore, if you want to use

CAS

Lockless algorithms and related data structures make it impossible to greatly improve program performance. Hardware-level atomic operations make application-layer operations slow and cannot be optimized again. On the contrary, through a good design of multi-threaded programs with locks, the program performance can not be reduced, and a high degree of concurrency can be achieved.

However, we also need to see the advantages of no lock on the application layer. For example, programmers do not need to consider the difficult issues such as deadlocks and priority reversal. Therefore, the application is not very complex and the performance requirements are slightly higher, multiple Threads with locks can be used. The application-level lock-free algorithm can be used when the program is complex and the performance requirements are met.

 

For more information about how to schedule the multi-thread mode, see references.

[5]

The document introduces a good working mode of Inter-thread cooperation. Of course, the premise is that the number of processors on the machine is large enough to support parallel work of multiple groups of threads. If the number of processors is large, scheduling between multiple threads increases the overhead of System context switching on each core, resulting in a reduction in the overall system performance.

 

References

[1] Lock-free data structures

Http://www.drdobbs.com/184401865

[2] Some Notes on lock-free
Wait-free algorithms http://www.rossbencina.com/code/lockfree

[3] Wait-free synchronization http://www.podc.org/dijkstra/2003.html

[4] OpenMP

Comparison of lock and atomic operation performance in the creation thread

Http://blog.163.com/kangtao-520/blog/static/772561452009510751068/

[5]

Thread grouping competition mode in multi-core programming

Http://kangtao-520.blog.163.com/blog/static/77256145200951074121305/

 

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.