4. Lock-Performance Comparison and Analysis of lock-free programming and lock programming-Performance Comparison

Source: Internet
Author: User
Tags lock queue

4. Lock-Performance Comparison and Analysis of lock-free programming and lock programming-Performance Comparison
A recently maintained network server encountered performance problems, so it made major changes to the original program framework. Most of the changes are the thread working mode and data transmission mode. The final result is to change the lock usage mode. After some improvement, the GMb Nic can basically work at full speed. After the performance was up to standard, I was wondering if I could use a more lightweight lock or remove the lock. For this reason, I searched some research results and made some experiments to verify these results, so there is this article. We hope that our colleagues who do similar jobs can learn from each other. If anyone has relevant experience, please feel free to contact me.

 

1. Introduction to lock-free programming

This section mainly summarizes the document [1] and introduces some basic knowledge.

The so-called lock programming means that when you need to share data, you need to access the data in an orderly manner. All operations that change the shared data must show atomic semantics, even if they are like ++ k, this operation also requires the use of locks. The temporary efficiency reduction, deadlock, priority reversal, and other problems of lock programming must be carefully optimized and solved by the designer. This article does not discuss these three issues.

In Lock-free programming, it does not mean that all operations are atomic, and only a very limited set of operations is atomic, which means that lock-free programming is very difficult. So does this limited set of operations exist? If so, what atomic operations are involved? Maurice Herlihy's paper "Wait-Free Synchronization" in 2003 [3] solved this problem. The conclusion of this article is as follows: for example, test-and-set, swap, fetch-and-add, or even atomic queue, lock-free cannot be implemented for multiple threads. The simplest and simplest primitive CAS (compare-and-swap) operation can complete all lockless functions, and other functions such as LL/SC (load linked/store conditional ). The CAS pseudo code is as follows:

template <class T>bool CAS(T* addr, T expected, T value) {   if (*addr == expected)    {      *addr = value;      return true;   }   return false;} 


 

CAS compares expected with a memory address. If the comparison succeeds, it replaces the memory content with new. Currently, most machines implement this operation at the hardware level. On the Inter processor, this operation is CMPXCHG, so CAS is the most basic atomic operation.

Comparison between wait-free/lock-free and locked

The wait-free process can be completed in limited steps, regardless of the speed of other threads.

The lock-free process ensures that at least one thread is being executed. Other threads may be delayed, but the system is still moving forward.

If a thread occupies a lock, other threads cannot execute the lock. Generally, locks are required to avoid deadlocks and live locks.

2 research and progress of lock-free programming

This section describes the lockless algorithms and data structures that have been implemented in [2.

Over the past two decades, researchers have conducted a lot of research on lock-free and wait-free algorithms and data structures. Implemented some wait-free and lock-free algorithms, such as the FIFO queue and LIFO stack, more complex optimization-level queues, hash tables, and the lock-free algorithm of the red and black trees are gradually known.

The implementation of the lock-free algorithm relies on the memory barrier, so it has platform relevance. The following describes the more mature implementations of atomic operations and algorithm data structures.

  • MidiShare Source Code is available under the GPL license. MidiShare implements des implementations of lock-free FIFO queues and LIFO stacks.
  • Appcore is an SMP and HyperThread friendly library which uses Lock-free techniques to implement stacks, queues, linked lists and other useful data structures. appcore appears currently to be for x86 computers running Windows. the licensing terms of Appcore are extremely unclear.
  • Noble-a library of non-blocking synchronisation protocols. implements lock-free stack, queue, singly linked list, snapshots and registers. noble is distributed under a license which only permits non-commercial cial academic use.
  • Lock-free-lib published under the GPL license. supported des implementations of software transactional memory, multi-workd CAS primitives, skip lists, binary search trees, and red-black trees. for Alpha, Mips, ia64, x86, PPC, and iSCSI.
  • Nonblocking multiprocessor/multithread algorithms in C ++ (for MSVC/x86) posted by Joshua schlar to musicdsp.org, and are presumably in the public domain. included are queue, stack, reference-counted garbage collection, memory allocation, templates for atomic algorithms and types. this code is largely untested. A local mirror is here.
  • Qprof operations des the Atomic_ops library of atomic operations and data structures under an MIT-style license. Only available for Linux at the moment, but there are plans to support other platforms. download available here
  • Amino Concurrent Building Blocks provides lock free datastructures and STM for C ++ and Java under an Apache Software (2.0) licence.

Among them, Noble has been commercialized, and License is not cheap.

3. Performance Analysis

This section compares the mutex in PTHREAD, the atomic increment in windows, and the CAS atomic operation, performance Comparison and Analysis of lock-free FIFO queues implemented in MidiShare and lock-free queues implemented in STL list are carried out, and the optimization methods are summarized.

3.1 Performance Test of atomic Increment

The CPU of the test machine is Intel E5300 2.60 GHZ.

The first is to test simple incremental operations, including ++ operations without any synchronization mechanism, ++ operations protected by pthread_mutex, and atomic_add1 () of CAS semantic implementation () and interlockedIncrease () in windows for quantitative testing in the case of a single thread.

I ++

0.32 billion

Lock (p_mutex); I ++; unlock (p_mutex );

10 million

CAS_atomic_add1 (I)

10 million

InterlockedIncrease (& I)

10 million

First, without any synchronization, the CPU can perform ++ operations 0.32 billion times per second, close to the CPU clock speed. When thread_mutex_lock () and unlock () operations are executed at ++ each time, the CPU can only execute tens of millions of times per second, this means that the CPU can perform the lock and unlock operations every second for a total of 10 million times, and the overhead of adding and unlocking is about 15 times of the execution of the addition command. CAS is slightly better, which is times per second. This speed is similar to the execution speed of interlockedIncrease () in windows.

From the test results, the atomic increment operation in windows is basically the same as the increment operation in CAS, it is estimated that the bottom layer of windows uses the CAS command CMPXCHG to implement atomic addition. Of course, pthread_mutex, as a mutex lock, also has a high efficiency. In the case of no sudden lock, the lock overhead is equivalent to the overhead of a CAS.

However, compared with the non-synchronous ++ operations, hardware-level synchronization also causes at least eight times of performance degradation.

Then, the pthread_mutex program is optimized logically, and the execution of ++ is tested for 8 and 20,100 times, respectively.

Lock (); for (k = 0; k <8; I ++, k ++); unlock ()

0.12 billion

Lock (); for (k = 0; k <20; I ++, k ++); unlock ()

0.2 billion

Lock (); for (k = 0; k <100; I ++, k ++); unlock ()

0.34 billion

Result The number of times that the CPU can execute ++ per second is 0.12 billion/0.2 billion/0.34 billion, which is consistent with the expectation, because the number of calls and unlocking times per second is the original 1/8, 1/20, and 1/100 respectively. When 100 + + is executed for one addition and unlocking, the performance has reached the performance without any synchronization. Of course, the atomic interlockedIncrease () and the atomic_add1 () Implementation of CAS do not have the advantage of batch processing. If they are, their best execution is fixed.

For performance testing of atomic operations in windows with single thread and multi-thread, refer to [4]. here only the conclusions are listed. The CPU of the test machine listed in this section is a Intel2.66GHZ dual-core processor.

A single thread executes 2 million atomic increment operations

InterlockedIncrease

78 ms

Windows CriticalSection

172 ms

OpenMP lock Operation

250 ms

Two threads perform 2 million atomic increment operations on shared Variables

InterlockedIncrease

156 ms

Windows CriticalSection

3156 ms

OpenMP lock Operation

1063 ms

3.2 performance tests on lock-free and lock-free queues

The lock-free queue tested here is implemented by MidiShare, while the lock queue is implemented by pthread_mutex and c ++ STL list. Only test results are listed here.

When the same data is stored, the number of enque/deque times per second is calculated from the master thread enque and from the sub-thread deque. Of course, the two are basically the same.

The performance of the lock-free queue is between 150 and 200 million operations. This performance cannot be improved any more, because every time you leave the queue, the operation is mutually exclusive at the hardware level. For a lock queue, the following results are displayed based on the number of times the queue is added and unlocked:

Lock (); for (k = 0; k <x; I ++, k ++); unlock ()

Result (times/s)

X = 1

0.4 million

X = 10

1.9 million

X = 128

3.5 million

X = 1000

4 million

X = 10000

3.96 million

This means that the batch processing of data between locks can greatly improve the system performance. However, the atomic operation cannot improve the batch processing.

4 Conclusion

Through the above lockless and lockless performance tests, we can draw the conclusion that for the hardware-level Mutual Exclusion implemented by CAS, the performance of a single operation is more efficient than that of the application layer under the same conditions, however, when multiple threads are concurrent, the cost of hardware-level mutex introduction is the same as that of lock contention at the application layer. Therefore, it is impossible to use CAS lockless algorithms and related data structures to significantly improve the performance of applications. Hardware-level atomic operations slow down application-layer operations and cannot be optimized again. On the contrary, through a good design of multi-threaded programs with locks, the program performance can not be reduced, and a high degree of concurrency can be achieved.

However, we also need to see the advantages of no lock on the application layer. For example, programmers do not need to consider the difficult issues such as deadlocks and priority reversal. Therefore, the application is not very complex and the performance requirements are slightly higher, multiple Threads with locks can be used. The application-level lock-free algorithm can be used when the program is complex and the performance requirements are met.

For details about how to better schedule the multi-threaded working mode, refer to [5]. The document introduces a good working mode of Inter-thread cooperation, of course, the premise is that the machine has a large number of processors, which is sufficient to support parallel work of multiple groups of threads. If the number of processors is large, scheduling between multiple threads increases the overhead of System context switching on each core, resulting in a reduction in the overall system performance.

 

References

[1] Lock-Free Data structuring http://www.drdobbs.com/184401865

Some notes on lock-free wait-free algorithms http://www.rossbencina.com/code/lockfree

[3] Wait-Free Synchronization http://www.podc.org/dijkstra/2003.html

[4] performance comparison of lock and atomic operation in OpenMP creation thread http://blog.163.com/kangtao-520/blog/static/772561452009510751068/

[5] thread grouping competition mode http://kangtao-520.blog.163.com/blog/static/77256145200951074121305/ in multi-core programming


The latch has eight data inputs. How does it correspond to the I/O port of the single-chip microcomputer when connecting the circuit board? Why can it be controlled as long as p1 ^ 4 is used in programming?

I don't know what latches you are talking about. Generally, the 74X73 series only have eight input ports. It is used for low-end address locks in a single-chip microcomputer with an external bus. If I did not guess wrong, then you can connect these eight ports to the P0 port one by one. Then, you can connect the G foot to the ALE/PROG, and connect the power supply to other fully grounded ones! The output address is A0 ~ A7.

Digital password lock C language programming

In recent years, with the continuous improvement of living standards and the increasing personal wealth, people's requirements for security and anti-theft have also gradually increased. The secure, reliable, and easy-to-use electronic password lock has become the first choice for anti-theft. Using Max + Plus II (Multiple Array Matrix and ProgrammingLogic User System II, multi-Array Matrix and programmable logic User System II) as the working platform, the electronic password lock with music, designed Using PLD programmable devices and VHDL, provides features such as password preset, code locking, and unlocking music prompts. This design not only simplifies the system structure, reduces costs, but also improves the reliability and confidentiality of the system. A Digital System Developed Using PLD programmable logic devices can be easily upgraded and improved.

1. Design Concept
The password lock circuit consists of three functional modules: keyboard control, password setting, and music playing. Principle 1 is shown. Count, Keyvalue, Contrl, and Smdisplay constitute the keyboard control module. Songer is the music playing module, and Set is the password setting module.

1.1 keyboard control
The keyboard is used to input data to the system and send commands. It is a set of mechanical elastic button switches, using the combination of mechanical contacts, breaking to produce high and low level. Checks the level and level status to confirm whether the button is pressed or not. A voltage signal is shown in waveform 2 During disconnection and closure of mechanical contacts.

In this keyboard circuit, the Count module provides the keyboard's line scan Signal Q [3. 0]. When no button is pressed, the signal EN is high, and the line scan outputs the Signal Q [3 .. 0] The cycle change sequence is 0001 OO100100 1000 0001 (SCAN four rows of buttons in sequence); When a button is pressed, the signal EN is low, and the line scan output signal Q [3 .. 0] Stop scanning and lock the current row scan value. For example, if you press the button on the first line, Q [3. O] = 0001.
The main function of the Keyvalue module is to signal the Input key to Q [3 .. 0] and column signal 14 [3 .. 0] to determine the key value of the Input key.
The main function of the Contrl module is to reduce the jitter of keys and determine whether keys are pressed. Make sure that the key extraction is within the closed and stable time range shown in figure 2, which requires certain input clock signals of this module, in this design, the input clock frequency of this module is 64Hz. The Smdisplay module is mainly used to complete the digital tube dynamic scan and seven-segment decoding display functions.
1.2 music playing circuit Songer
According to the knowledge of vocal music, the pronunciation frequency and duration of each music note are two basic elements required for continuous playing. It is critical to obtain the values corresponding to these two elements and use hardware-only means to achieve the expected performance of the music. As shown in 3, the circuit must be composed of three modules: NOTETABS (tone generator), TONETABA, and SPEAKER (numerical control divider), which implement the functions of sound generation, beat control, and tone control respectively.

1.3 password settings
The Set module is the core module for implementing the password lock function. The main function is to Set the password. Set is the valid signal for setting the password, which can be used to change the password. En indicates the input password confirmation signal. after entering the six-digit password, confirm the input. Once the entered password is consistent with the set password, the output signal OP is valid (high level ); OP controls playing music, and the music starts to sound. If the password is incorrect, it indicates the input error and the number of inputs. After three invalid attempts, the password lock is locked. The RESET signal (start signal, give a low level) must be used to re-enable the password lock function.

2. VHDL description of the circuit
The keyboard control circuit, music playing circuit, and password setting module are designed using the Hardware Description language VHSIC Hardware Description Lan-guage (VHDL. For example, the VHDL model of TONETABA is as follows:

The VHDL language has strong circuit description and modeling capabilities. It can model and describe digital systems at multiple layers and supports various pattern design methods: top-down and bottom-up or mixed methods ,...... remaining full text>

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.