In the data input after the use of a separate data forwarder distributed to different queues, so as to avoid the competition between the threads, data forwarders in fact, the equivalent of our web site to use the load balancer, but also according to the data in the queue to choose the different queue or can be combined with the CPU, Dynamic scheduling of memory utilization. The improved design, of course, meets the performance requirements.
So in the multi-threaded development, try to avoid the use of locks, for must use the lock the situation to choose the appropriate lock, for the choice of what type of lock, my principle is: to meet the performance requirements, do not deliberately pursue fine-grained lock. Coarse-grained lock performance is low but easy to use and understand, fine-grained lock performance is high but difficult to use and understand, about the operating system of the lock can be referred to the Windows core programming thread Synchronization section, under the. NET platform can also refer to the CLR VIA C # thread related chapters.
Second, pay attention to the CPU cache failure, avoid frequent context switching.
caching is often the key to improving performance when developing multicore programs, where the cache refers to the CPU cache (L1,L2,L3). Caching is often used properly to improve performance by more than twice times.
1. Causes of CPU cache failure are:
(1) Frequent modification of in-memory data
(2) Using the atomic operation, lock lock and other synchronization methods.
(3) switch of thread context.
(4) pseudo-sharing causes frequent flush caches.
2. The reasons for frequent context switching are:
(1) The thread of the program itself is robbing CPU resources, that is, the CPU cannot dispatch other threads,
(2) Many threads are waiting for a mutex to be acquired, but only one thread can get it, and the other threads are constantly switching between wake and sleep.
3. Solutions for the above problems (for reference only, different programs for the project):
(1) Avoid using any type of lock.
(2) In the premise of satisfying performance, with the least number of threads to do the least thing.
(3) Redesign to avoid the modification of data, such as the previously developed real-time computing programs, all the data is only allowed to read not allowed to modify, need to modify the creation of a new data to replace, similar to the development of the Erlang program.
Specific information about caching basics can be found in the "in-depth understanding of computer operating Systems" in chapter sixth of the content.
Third, the thread pool of the use of the scene and attention points:
(1) For tasks with short execution times should be given to the thread pool for processing instead of opening a new thread, for tasks that require lengthy processing to be handled by a separate thread.
(2) The task of reading and writing files is not given to the thread pool, because thread Cheng threads belong to the background thread, and the application closes unexpectedly and the thread loses data.
(3) Never develop a thread pool on your own, it takes months and a lot of testing to actually use the product-level thread pools, or you will encounter problems too late.
Iv. optimization of the NUMA architecture machine.
(1) Modern servers are basically NUMA architectures, where we'd better open the. NET Server garbage collection mode on these machines so that when we allocate objects, we can allocate them in the memory of the CPU that we recently used.
(2) Do not use binding threads in. NET programs to specify the kernel or promote thread precedence, because performance is slower if the bound thread is competing against the garbage collection thread.
V. Choosing the RIGHT programming model
Writing parallel programs has a fixed programming model, and basically other models are the free combination of these models, a common programming model:
(1) Data parallelism
(2) Task parallelism
(3) Pipeline parallel
Six, go to the queue Lira data or queue actively push data?
in general, we write multi-threaded program will put the data in the queue first, then the function immediately return, and then other threads to deal with the purpose of fast response to the client, which involves the queue to send a signal to notify the thread processing or thread timing to go to the queue to fetch data, If the use of push can cause state loss and eventually some data is not processed and stay in the queue, if the use of pull the thread sleep time is not good grasp, sleep more data processing slow, sleep less and waste CPU, so I generally use the combination of the two ways, specific reference my fourth article. NET parallel Programming-4. Implementing High-performance asynchronous queues
Seven, asynchronous IO or synchronous IO?
Asynchronous IO solves the thread-blocking problem of synchronous IO (there are two types of IO: disk and network), and basically all Web servers use asynchronous network IO, but it is best not to use asynchronous disk IO for disk, except on machines with SSDs. Because asynchronous read-write to disk can cause fragmentation of disk storage data, an operation that could have been sequentially written may eventually become a random write.
Conclusion:
The design of this article is only a small part of the parallel program optimization, more content also need us to accumulate in practice.
What was intended to be written was ". NET parallelism", and the results were not found to be related to. NET, of course, these basic knowledge and language are not very much related.
The content of this article is only suggested, we should not be dogmatic in writing the program, reasonable is the best, some principles are not suitable for all situations, we do is to constantly explore adapt to change to achieve the purpose of improving program performance.
. NET parallel programming-6. Common optimization Strategies