Lockless programming and distributed programming which is more suitable for multi-core CPU?

Source: Internet
Author: User

Lockless programming and distributed programming which is more suitable for multi-core CPU? In the previous article, we analyzed the acceleration ratio of three typical lock competitions in multi-core systems. In particular, the acceleration ratio of Distributed Lock competitions is proportional to the number of CPU cores, it has good acceleration performance. In recent years, lock-free programming has become a hot topic in the academic field. So can I achieve better acceleration performance using lockless programming? Or is lockless programming more suitable for multi-core CPU Systems? Lock-free programming uses atomic operations instead of locks to protect access to shared resources. For example, to add 1 to an integer variable, the code for the Lock protection operation is as follows: int a = 0; Lock (); a + = 1; Unlock (); if the code is decompiled, a + = 1 can be found; it is translated into the following three assembly commands: mov eax, dword ptr [a] add eax, 1 mov dword ptr [a], eax

In a single-core system, task switching may occur after any of the preceding three commands are executed. For example, after 1st commands are executed, task switching occurs, if another task is used to operate a, the task will continue to operate a after the task is switched back, and unpredictable results may occur, therefore, the above three commands must be protected by locks, so that other tasks cannot operate on a during this period. Note that in a multi-core system, multiple CPU cores are physically parallel and may be written simultaneously; therefore, you must ensure that when one CPU core writes to the shared memory, other CPU cores cannot write this memory. Therefore, there is a difference between a multi-core system and a single core. Even if there is only one command, lock protection is required. If an atomic operation is used to perform the preceding 1 addition operation, for example, InterlockedIncrement in VC, the following statement is required for adding 1 to a: InterlockedIncrement (& ); the actual Add 1 operation of this statement will be translated into the next Assembly command with the lock Prefix: lock xadd dword ptr [ecx]. When eax uses atomic operations, in actual write operations, the lock command is used to prevent other tasks from writing this memory and avoid data competition. The atomic operation speed is faster than the lock speed, generally more than doubled. The command with the lock prefix actually uses the memory barrier in the system. When an atomic operation is in the process, no other task can operate on the memory, it will affect the execution of other tasks. Therefore, this kind of atomic operation is actually a highly competitive lock, but because of its fast operation time, it can be seen as a very fine-grained lock. In the Lock-free programming environment, CAS (Compare and Swap) is the main atomic operation, and InterlockedCompareExchange or InterlockedCompareExchangeAcquire is the corresponding operation in VC; for 64-bit operations, use InterlockedCompareExchange64 or InterlockedCompareExchangeAcquire64. One of the biggest advantages of using this atomic operation to replace a lock is that it is non-blocking. According to Microsoft's MSDN instructions, InterlockedCompareExchange has a global full memory barrier. When full memory barrier is used, even atomic operations that do not access the same memory variable will compete. In terms of the competition form, there will be fixed lock competition or random lock competition, and the competition mode of Distributed Lock competition cannot be achieved, the competition for using common locks is more intense, so the final acceleration is worse than the competition for fixed locks described in the previous article. Full memory barrier is not used for atomic operations such as InterlockedCompareExchangeAcquire, therefore, the performance is theoretically much better than the atomic operation using full memory barrier (because this type of atomic operation is only supported on a specific machine at present, the specific performance has never been tested, microsoft's MSDN also describes the performance ). However, if the competition for fixed locks is adopted, the acceleration ratio is still calculated based on the above formula:

Since the atomic operation speed is faster than the lock, in fact, compared with the ordinary lock operation, it is equivalent to locking the unlock time 1 is reduced by 2 ~ About 3 times, we may as well calculate it by 2 times, and the corresponding Task Granularity will increase by a factor. In addition, because the lock calculation in an atomic operation is usually only one or two simple commands, therefore, the lock granularity is very small and can be approximately 0, so the acceleration ratio is:

Therefore, in the case of fixed lock competition, the limit value of the acceleration ratio is approximately twice the task granularity when the common lock is used, about twice the acceleration ratio when the common lock is used. The acceleration ratio does not linearly increase with the number of CPU cores. For the random lock competition, the acceleration ratio is: if the common lock operation is changed to an atomic operation and the lock granularity is considered as 0, the task granularity is very large, the increase in probability P is not large. When the task granularity is very small, the maximum probability P can be increased by approximately doubled, and the acceleration ratio can be improved to a certain extent compared with the ordinary lock. For the worst case of random competition for common locks, the acceleration ratio is: after being changed to an atomic operation, the acceleration ratio is: only a little higher than the competition for common locks, and cannot increase with the increase in the number of CPU cores. Note that the algorithm overhead of lock-free programming is not considered above. When using lock-free programming, a CAS operation must be completed in a loop. It may take many cycles to complete a write operation, therefore, the actual performance does not reach the above calculation results. Therefore, even if no lock programming is used, if the lock competition is still in the form of fixed competition or random competition, the acceleration ratio is still not optimistic, and it is still far from the acceleration ratio of the Distributed Lock competition, because the acceleration ratio of Distributed Lock competition can be close to the number of CPU cores in the worst case. Of course, some people will also propose that, since the acceleration ratio of Distributed Lock competition is so good, it is impossible to replace atomic operations with common locks for Distributed competition to achieve better acceleration than performance? Theoretically, if we use atomic operations without full memory barrier to replace common locks for Distributed competition, we can achieve a better acceleration ratio than common locks for Distributed competition, after atomic operations are performed, the task granularity increases by 2 ~ 3 times. When the task granularity is very small, for example, the task granularity is smaller than 0.5 (which is difficult to appear in practice), the acceleration ratio will be about doubled when the common lock is used, when the task granularity is large, the acceleration ratio is not obvious. The size of the task granularity depends largely on the division of tasks by the programmer. As long as the programmer does not make the task granularity too small, this reduces the impact of Task Granularity on the acceleration ratio. However, when the Distributed Lock competition is used, the performance can be close to that of the single-core multi-task program. It is very difficult to use lockless programming, and the program complexity is also very high, which is difficult for non-professionals to grasp, it is almost impossible for common programmers to implement lock-free programming. The difficulty of distributed programming is similar to that of Data Structure Algorithm Programming in the single-core multi-task era, which can be mastered by common programmers. Therefore, in actual situations, as long as the task granularity is not too small, there is no need to pursue performance too much. The competition for Distributed locks using common locks is sufficient. From the development of lock-free programming, the lockless algorithms that have been implemented are limited and their functions are limited. Moreover, lock-free programming is independent of the previous single-core programming, using lockless programming is almost impossible to reuse previous achievements. Distributed programming is developed on the basis of the original single-core multi-task programming. It can inherit the results of the previous single-core era. For example, the queue pool can inherit the existing queue algorithms, therefore, distributed programming can greatly reduce the workload of porting an existing single-core program to a multi-core system. You only need to reconstruct the existing program to fully support the multi-core CPU system. In conclusion, we can draw conclusions as shown in the following table.
  Compare items Lockless Programming Distributed programming
1 Acceleration Performance It depends on the competition mode. Unless distributed competition is also adopted, it is not as good as the competition performance of distributed locks. The acceleration ratio is proportional to the number of CPU cores, which is close to the performance of single-core multitasking.
2 Implemented Functions Limited Unrestricted
3 Programmer's knowledge of difficulty It is too difficult, too complicated, and cannot be mastered by ordinary programmers. Currently, only a few people in the world can master it. Similar to the data structure algorithm in the single-core era, common programmers can master
4 Porting existing software After the lock-free algorithm is used, previous algorithms need to be discarded and cannot be reused. You can inherit existing algorithms and reconstruct them based on existing programs.
From the comprehensive comparison of the four aspects in the table above, we can see that the practical value of lockless programming is far inferior to that of distributed programming. Therefore, distributed programming is more suitable for multi-core CPU Systems than lockless programming.

 

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.