[Reprinted] selecting, analyzing and measuring windows multi-thread synchronization mechanism based on efficiency considerations

Last Update:2018-12-03 Source: Internet

Author: User

Tags net time

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Link: http://waterwood.blog.163.com/blog/static/43596554200793033955/

Statement: the following content is reprinted from xinfan:NNTP: // news.newsfan.netNewsgroup: computer. Software. Programming. visualstudio
Author: igostan, the commentary is added by shuimu. for reprinting, please indicate the source.

Recently, I couldn't decide whether to use criticalsection or interlocked ** in a piece of code. I also wanted to have a quantitative understanding of this problem, so I did some tests, summarize the following article. I hope it will be helpful.

First of all, it should be clear that the relative efficiency of the synchronization is based on the premise that multi-thread synchronization is not "conflicted. That is to say, a thread excludes a resource and then releases it soon. During this period, no other thread tries to access the resource and enters the waiting state-the actual need to wait, the probability of occurrence must be small. This must be ensured in a good multi-threaded design. Otherwise, if you have several threads that are always waiting for you and you wait for me, it indicates that there are serious design problems (such serious problems can almost be regarded as Bugs ). If they can run only one of them at a time, what does multithreading mean? It is better to use a single thread to save at least the synchronization time.

Now let's look at criticalsection. Although event is used internally, It is not used every time. The first thread that successfully enters the criticalsection does not need to use the kernel object at all. Although I can't see the source code, I have considered how I would do it if I design it, and I believe that this is exactly what I actually do in windows, there will be more details to consider. I will only mention some important points here. If you are interested, you can try to implement a criticalsection yourself, which is also a good exercise ):
1. Use interlockedexchange (or a related function) to save the thread ID in the critical_section structure.
2. If the original ID is null or is the ID of the current thread (do not forget that the criticalsection allows the same thread to enter repeatedly), the call is successful.
3. if the criticalsection fails to be entered, the thread's ID number has been saved and waitforsingleobject can be directly used to wait for the contained event (therefore, the kernel-level wait function is required only when waiting is entered. In fact, the saved thread ID needs to constitute a linked list so that multiple threads can wait at the same time ).
4. When the thread that successfully enters the criticalsection leaves, use interlockedexchange again to restore the saved thread ID. If the saved thread ID is not null, it means that another thread is waiting and can call setevent once (but note that this is also a much slower function, fortunately, it is only required when there is a waiting thread ).

In msdn, some details disclosed in the description of initializecriticalsectionandspincount, a new function in Windows2000, can also be used as evidence of my above speculation.

Finally, to completely solve this problem, you need to understand more than 386 of the assembly languages, and then integrate the knowledge of the stars in msdn and Windows core programming. My estimation is as follows:

◎ The fastest is, of course, the addition and subtraction operations in C language. They directly correspond to simple machine commands;
◎ Interlocked *** and other functions. Second, their corresponding commands are actually very simple (at least in the X86 architecture, usually only 1 ~ 3 commands ). However, they need the CPU to discard the usual command optimization, and may also need to notify other CPUs through the system bus (the level-1 and level-2 Cache inside the CPU, until the memory may be implicated ). I guess this will slow down the speed by 5 ~ 10 times;
◎ Crititalsection should be equivalent to interlocked *****, but it is a little slower. Because it actually calls the former.
◎ Others, such as mutex and signal, are the slowest. I guess they are dozens or even 100 times slower than crititalsection. Because they need to switch to the kernel mode and then switch back, this requires executing a large number of commands (on x86, it seems that there are not many commands, but these commands correspond to the CPU of the RISC type, for example, a sub-Program on Alpha, a command is dozens of clock cycles ). Therefore, the speed difference between the mutex and other kernel objects is insignificant, and there is no need to consider it, Because kernel mode switching is the efficiency bottleneck.

The following are my actual results:

The system performance counter's frequency is: 3579545 repeat times for each test: 1000000

General ++/-- operators: T1 (++ only) = 0.00371197, T2 (-- only) = 0.00372959

Both time = 0.00754256, t1 + t2-both:-0.000101004, net time cost: 0.00754256
Interlocked (in/DE) crement: 0.115052

Critical section: 0.140891

Mutex: 1.47912.

General ++/-- operators: T1 (++ only) = 0.00376848, T2 (-- only) = 0.0037205
Both time = 0.00756789, t1 + t2-both:-7.89067e-005, net time cost: 0.00756789
Interlocked (in/DE) crement: 0.114687
Critical section: 0.141372
Mutex: 1.49058.

General ++/-- operators: T1 (++ only) = 0.00371739, T2 (--
Only) = 0.00380439
Both time = 0.00749592, t1 + t2-both: 2.58552e-005, net time
Cost: 0.00747007
Interlocked (in/DE) crement: 0.114999
Critical section: 0.14113
Mutex: 1.47861.

General ++/-- operators: T1 (++ only) = 0.00379342, T2 (--
Only) = 0.00371602
Both time = 0.00754006, t1 + t2-both:-3.06324e-005, net time
Cost: 0.00754006
Interlocked (in/DE) crement: 0.114092
Critical section: 0.141929
Mutex: 1.48663.

General ++/-- operators: T1 (++ only) = 0.00380472, T2 (--
Only) = 0.00377649
Both time = 0.00752712, t1 + t2-both: 5.40851e-005, net time
Cost: 0.00747303
Interlocked (in/DE) crement: 0.113746
Critical section: 0.140481
Mutex: 1.48115.

The test program is a single-threaded command line program. It is compiled using vc2005 and release without debugging information and is Speed Optimized first. When running, you can use a thread to repeatedly lock and release resources without considering other threads, depending on the execution speed. The tested machine is P4.
2.4 GB (with an older machine in 3 or 4 years), each call repeat 1 million times and perform a round robin test for 5 times. Performance output in the first line
Frequency is useless. This is a hardware-related value, indicating the timing accuracy on the platform I use. In each subsequent test, the time used to run the loop is measured in seconds. Among them, the ++ and -- operations are the basic C operators, but the purpose is volatile.
Long type (if volatile is not added, the compiler optimization will omit the entire loop, and this will not be a test of the multi-threaded environment ).
First, we only perform the ++ and -- operations, and then execute the ++ and -- operations in each loop. The sum of the time of the first two operations is as follows, the difference between the last one should be the time of the null loop. Therefore, in all subsequent test results, the time required for the null loop is generally deducted. However, unfortunately, this time is too small, and sometimes it may become a negative number (in this case, I will ignore this item ). The other cycles are called every time-one minus or one lock.

Conclusion: My estimation is not surprising, but there are some discrepancies in the specific speed ratio. In reality, the interlocdedincrement and interlockeddecrement speed is about + + and -- one of 16 points (maybe because I forgot to consider the overhead of the inbound/outbound stack and command jump when calling a function ). Criticalsection is only about 25 ~ slower than the former ~ 30%. This difference is almost negligible. This is the closest to my estimation, and it also shows that my speculation on the implementation method is correct. The technical nature of criticalsection is based on interlocked *****, so as to accomplish more complex synchronization needs at almost the same time cost. But don't forget, in reality, if you can use the critical
Section, usually use an interlocked ***** to independently call the things that can be solved, you need to enter and leave the criticalsection two calls, so the actual difference may have to be doubled. On the other hand, interlocked *** is definitely not worth considering when synchronous processing is more complex than a single value (and almost certainly cannot be applied at all ). Finally, the mutex is much faster than my estimation, and only 10 times slower than the former (it fully proves that my understanding of assembly programming is on paper ). Of course, the gap of an order of magnitude is still a non-negligible difference in programming. Therefore, the kernel synchronizers should be used only when cross-process execution is required.

Comments:

Multithreading and multi-process programming, especially multithreading programming, play an important role in practical applications, especially in network communication, graphic image processing, industrial control automation, and B/S development. Proper use of thread and thread synchronization can be used as one of the criteria to measure the basic quality of programmers. I will write an article about multithreading. Here I will only explain the author's points:

1. In multi-threaded and multi-process applications, the execution efficiency of programs is of great significance, especially for services with large data throughput or a large number of service terminals.

2. in the code for thread and process synchronization, a basic principle is to minimize the scope of the synchronization code while ensuring that the logic is correct. Otherwise, the advantages of multithreading will be lost. According to the traditional viewpoint, the code of the line program service should be as limited as possible, that is, the code in the thread should be compact, cyclic, and concise, and the CPU's "command prediction" capability should be fully utilized, the biggest benefit is that the service performance can be linearly increased if the number of CPUs increases.

3. For a single data synchronization operation, the simplest synchronous modification method is interlockedincreament, interlockeddecreament, interlockedexchange and other synchronous data modification functions, with the highest efficiency.

4. critical section synchronization is suitable for synchronizing code in a relatively large scope and is highly efficient. However, it can only implement logic similar to waitforsingle and waitall.

5. if more complex synchronization is required, system-level synchronization objects such as event and mutex should be used in combination with APIs such as waitforsingleobject (Ex) () and waitformultipleobjects (Ex) () to implement waitforsingle, waitall, waitany and other logic. At this time, the program execution efficiency is one order of magnitude lower than the first two, so it is important to allocate a proper Code Synchronization area.

6. in addition to the ability of C ++ to call these system APIs, the corresponding synchronization mechanism is in Java ,. net has similar encapsulation implementation (. net is actually an encapsulation call to the system API ).

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More