Program optimization using multi-core and multi-thread

Source: Internet
Author: User
Document options

The document options for JavaScript are not displayed



Print this page

Send this page as an email

Sample Code

Http://www.ibm.com/developerworks/cn/linux/l-cn-optimization/index.html

Level: Intermediate

Yang Xiaohua (normalnotebook@126.com), software engineer

November 17, 2008

You may remember that in March 2005, Master C ++ Herb Sutter published an article titled "free lunch is over" on Dr. Dobb's Journal. The article points out that programmers are ignoring a series of performance indicators such as efficiency, scalability, and throughput, and many performance problems are solved by the faster CPU. However, in the near future, the CPU speed will deviate from Moore's Law and reach a certain limit. Therefore, more and more applications will have to face performance problems, and the solution to these problems is to adopt concurrent programming technology.

Sample program

Program function: from 1APPLE_MAX_VALUE (100000000)Add the accumulated sum and assign it to apple'saAndbCalculate the sum and loop of a [I] + B [I] in the orange Data StructureORANGE_MAX_VALUE (1000000)Times.

Note:

  1. Because the sample program is a model abstracted from the actual application, this document does not perform test. a = test. B = test. B + sum, intermediate variable (search table) and other similar optimizations.
  2. All the following program snippets are part of the Code. For the complete code, see the following attachment at the bottom of this article.

Listing 1. Sample program

#define ORANGE_MAX_VALUE      1000000
#define APPLE_MAX_VALUE 100000000
#define MSECOND 1000000

struct apple
{
unsigned long long a;
unsigned long long b;
};

struct orange
{
int a[ORANGE_MAX_VALUE];
int b[ORANGE_MAX_VALUE];

};

int main (int argc, const char * argv[]) {
// insert code here...
struct apple test;
struct orange test1;

for(sum=0;sum<APPLE_MAX_VALUE;sum++)
{
test.a += sum;
test.b += sum;
}

sum=0;
for(index=0;index<ORANGE_MAX_VALUE;index++)
{
sum += test1.a[index]+test1.b[index];
}

return 0;
}


Back to Top

K-Best measurement method

The K-times Optimal measurement method proposed by Randal E. Bryant and David R. o' Hallaron is used to detect the complex program running time. Assume that you repeatedly execute a program and record the fastest time of K times. If the measurement error ε is small, the fastest value of the measurement indicates the real execution time of the process, this method is called the "K-Best (K-Best) method". Three parameters must be set:

K: The number of measured values in a range close to the fastest value.

The approximate degree of the ε measurement value, that is, the measured values are listed in ascending order V1, V2, V3 ,... , Vi ,... And must satisfy (1 + ε) Vi> = Vk

M: The maximum number of measured values before the end of the test.

Maintain a K fastest time array in ascending order. For each new measurement value, if it is faster than the value at the current K, replace the element K in the array with the latest value, then, we sort the data in ascending order to continue the process and meet the error standard. In this case, the measured value has been converged. If the error standard cannot be met after M times, it is called non-convergence.

In all the subsequent experiments, K = 10, ε = 2%, M = 200 was used to obtain the program running time, and K Optimal measurement methods were also improved, the minimum value is not used to indicate the execution time of the program, but the average value of K measured values is used to indicate the real running time of the program. Due to the large error ε used, the time collection process of all test procedures can converge, but the problem can also be explained.

For portability, gettimeofday () is used to obtain the system clock time, which can be accurate to microseconds.



Back to Top

Test Environment

Hardware: Lenovo Dual-core machine, clock speed: 2.4 GB, memory 2 GB

Software: Suse Linunx Enterprise 10, kernel version: linux-2.6.16



Back to Top

Three Layers of software optimization

Doctors need to know the cause of the disease first, and then determine the cause, and then take the right medicine, if a random treatment, will not die. Everyone understands the truth, but in the software optimization process, they often like to make such errors. The results are often unsatisfactory when we come up here to make changes.

Software optimization can be divided into three levels: system, application, and microarchitecture. First, from the macro perspective, we should look at the problem, that is, optimization at the system level, collecting all program-related information to determine the cause. After determining the cause, we began to optimize the application and micro-architecture.

  1. System-level optimization: insufficient memory, slow CPU speed, and excessive PROCESSES IN THE SYSTEM
  2. Application-layer optimization: algorithm optimization, parallel design, etc.
  3. Optimization at the micro-architecture level: branch prediction, data structure optimization, command optimization, etc.

Software optimization can be performed at any stage of application development. Of course, the sooner the better, the less trouble it will be in the future.

In actual applications, the most application-layer optimization is adopted, and the microarchitecture optimization is also used. Compare certain optimization and maintenance costs, and usually select the latter. For example, branch prediction optimization and command optimization are usually used in large applications because the maintenance cost is too high.

This article will optimize the sample program at the application level and microarchitecture level. Multi-thread and CPU affinity technologies will be used for optimization at the application layer, and Cache Optimization will be used at the microarchitecture level.



Back to Top

Parallel Design

To design an application using a parallel programming model, you must pull your own thinking from the linear model, review the entire processing process, and repeat it from start to end, recognize the parts that can be executed in parallel.

Applications can be seen as a collection of interdependent tasks. Divides an application into multiple independent tasks and determines their dependencies. This process is calledDecomposition(Decomosition ). There are three main ways to resolve the problem: task decomposition, data decomposition, and data stream decomposition. For more information, see reference 1.

By carefully analyzing the sample program and using the task decomposition method, it is not difficult to find that computing the apple value and computing the orange value are completely unrelated operations, so they can be parallel.

The transformed two-thread program:

List 2. Two-thread Program

void* add(void* x)
{
for(sum=0;sum<APPLE_MAX_VALUE;sum++)
{
((struct apple *)x)->a += sum;
((struct apple *)x)->b += sum;
}

return NULL;
}

int main (int argc, const char * argv[]) {
// insert code here...
struct apple test;
struct orange test1={{0},{0}};
pthread_t ThreadA;

pthread_create(&ThreadA,NULL,add,&test);

for(index=0;index<ORANGE_MAX_VALUE;index++)
{
sum += test1.a[index]+test1.b[index];
}

pthread_join(ThreadA,NULL);

return 0;
}

In addition, through data decomposition, we can also find that the apple computing value can be divided into two threads, one for apple computingaValue. Another thread is used to calculate the valueb(Description: This solution is abstracted to the actual application ). However, the two threads have the possibility of simultaneously accessing apple, so you need to lock the access to this data structure.

The transformed three-thread program is as follows:

Listing 3. Three-thread Program

struct apple
{
unsigned long long a;
unsigned long long b;
pthread_rwlock_t rwLock;
};

void* addx(void* x)
{
pthread_rwlock_wrlock(&((struct apple *)x)->rwLock);
for(sum=0;sum<APPLE_MAX_VALUE;sum++)
{
((struct apple *)x)->a += sum;
}
pthread_rwlock_unlock(&((struct apple *)x)->rwLock);

return NULL;
}

void* addy(void* y)
{
pthread_rwlock_wrlock(&((struct apple *)y)->rwLock);
for(sum=0;sum<APPLE_MAX_VALUE;sum++)
{
((struct apple *)y)->b += sum;
}
pthread_rwlock_unlock(&((struct apple *)y)->rwLock);

return NULL;
}



int main (int argc, const char * argv[]) {
// insert code here...
struct apple test;
struct orange test1={{0},{0}};
pthread_t ThreadA,ThreadB;

pthread_create(&ThreadA,NULL,addx,&test);
pthread_create(&ThreadB,NULL,addy,&test);

for(index=0;index<ORANGE_MAX_VALUE;index++)
{
sum+=test1.a[index]+test1.b[index];
}

pthread_join(ThreadA,NULL);
pthread_join(ThreadB,NULL);

return 0;
}

After this transformation, can we really achieve what we want? The results of K-Best measurements are disappointing, for example:

Figure 1. time consumption comparison between a single thread and multiple threads

Why does multithreading consume more time than a single thread? The reason is that both thread Start and Stop and thread context switch will cause additional overhead, so it consumes more time than a single thread.

Why are the three threads after the lock still slower than the two threads? The reason is also very simple, so the read/write lock is the culprit. The Thread Viewer can also prove the results. The actual situation is not parallel execution, but serial execution. 2:

Figure 2. view the Three-line route using the Viewer

The bottom thread is the main thread, and the other isaddxThread, the other isaddyThe other two threads are executed in serial mode.

Multiple Threads are divided by data decomposition. There is another way for one thread to calculate from 1APPLE_MAX_VALUE/2 Is calculated fromAPPLE_MAX_VALUE/2+1ToAPPLE_MAX_VALUE But this model will be discarded in this article. If you are interested, you can try it.

When a program is designed using a multi-threaded method, if the additional overhead is greater than the job of the thread, there is no need for parallelism. The more threads, the better. The number of software threads should match the number of hardware threads as much as possible. It is best to determine the optimum number of threads through continuous optimization based on actual needs.



Back to Top

Lock and no lock

For the locking three-thread solution, because the two threads access different elements of apple, there is no need to lock at all, so modify apple's data structure (delete read/write lock code ), improve the performance without locking.

The test results are as follows:

Figure 3. Comparison between locking and no lock duration

As a result, some people may get confused again. Why is the efficiency lower without locking? The specific causes will be analyzed in the Cache Optimization section.

In the actual test process, the three-thread solution without locking is very unstable, and sometimes the time spent is more than 4 times different.

To improve the performance of parallel programs, we need to seek a compromise between less synchronization and more synchronization. Too little synchronization will lead to incorrect results. Too much synchronization will lead to low efficiency. Use private locks whenever possible to reduce the granularity of locks. Lock-free design has both advantages and disadvantages. The lock-free solution can fully improve the efficiency, but makes the design more complex and difficult to maintain and operate, and has to use other mechanisms to ensure the correctness of the program.



Back to Top

Cache Optimization

In the serial program design process, in order to save bandwidth or storage space, the direct method is to design the data structure to make the data compression more compact, reduce data movement to improve program performance. However, in multi-core and multi-threaded programs, this method is often counterproductive.

Data is not only moved between the execution core and the memory, but also between the execution core. Based on data relevance, there are two read/write modes that involve data movement: Write-after-Read and Write-after, because these two modes will lead to data competition, on the surface, it is parallel execution, but it can only be executed in serial mode, thus affecting the performance.

The minimum unit for processor switching is the cache row or cache block. In a multi-core system, two independent caches share the same cache row when they need to read the same cache, this cache row is written, and the cache row is read in another cache. Even if the Read and Write addresses do not overlap, data needs to be moved between these two caches. This is calledCache pseudo-sharingThe export execution core must pass the cache row back and forth on the storage bus. This phenomenon is called the "ping-pong effect ".

Similarly, when two threads write different parts of the same cache, they also compete with each other for this cache row, that is, the post-write issue. As mentioned above, the lock-free solution is slower than the lock-locking solution, which is the reason for competing with each other in cache.

On X86 machines, a cache row of some processors is 64 bytes. For details, refer to Intel's reference manual.

Since the bottleneck of the Three-thread solution without locking is the cache, let apple's two membersaAndbIn different cache rows, will the efficiency be improved?

The modified code snippet is as follows:

Listing 4. Cache Optimization

struct apple
{
unsigned long long a;
char c[128]; /*32,64,128*/
unsigned long long b;
};

Shows the measurement result:

Figure 4. Comparison of Cache time consumption

A small line of code brings such a high benefit, it is easy to see that we use space for time. Of course, readers can also use a simpler method: _ attribute _ (_ aligned _ (L1_CACHE_BYTES) to determine the cache size.

If the apple data structure in the Lock three-thread solution also adds a line of code with similar functions, will the efficiency be improved? The performance will not be improved. The reason is that the efficiency of the Three-thread locking solution is not caused by Cache failure, but the lock.

In the multi-core and multi-threaded programming process, the memory access needs of multiple threads should be fully considered, and the requirements of one thread should not be considered separately. When selecting a parallel task decomposition method, we should comprehensively consider the access bandwidth and competition issues, and put the data used by different processors and threads in different Cache lines, separate read-only and writable data.



Back to Top

CPU affinity

CPU affinity can be divided into two categories: Soft affinity and hard affinity.

The Linux kernel process scheduler has the affinity feature, which means that the process is usually not frequently migrated between processors. This state is exactly what we want, because a low migration frequency means a low load. But it does not mean that small-scale migration will not be performed.

CPU hard affinity refers to the process running on a certain processor rather than Frequent migration between different processors. This not only improves program performance, but also improves program reliability.

From the above, it is not difficult to see that hard affinity has some advantages over soft affinity to some extent. However, with the continuous efforts of kernel developers, the defect of 2.6 kernel soft affinity has greatly improved compared with 2.4 kernel.

On a dual-core machine, if you bind the computing apple thread to a CPU, bind the computing orange thread to another CPU, will the efficiency be improved?

The procedure is as follows:

List 5. CPU affinity

struct apple
{
unsigned long long a;
unsigned long long b;
};

struct orange
{
int a[ORANGE_MAX_VALUE];
int b[ORANGE_MAX_VALUE];
};

inline int set_cpu(int i)
{
CPU_ZERO(&mask);

if(2 <= cpu_nums)
{
CPU_SET(i,&mask);

if(-1 == sched_setaffinity(gettid(),sizeof(&mask),&mask))
{
return -1;
}
}
return 0;
}


void* add(void* x)
{
if(-1 == set_cpu(1))
{
return NULL;
}

for(sum=0;sum<APPLE_MAX_VALUE;sum++)
{
((struct apple *)x)->a += sum;
((struct apple *)x)->b += sum;
}

return NULL;
}

int main (int argc, const char * argv[]) {
// insert code here...
struct apple test;
struct orange test1;

cpu_nums = sysconf(_SC_NPROCESSORS_CONF);

if(-1 == set_cpu(0))
{
return -1;
}

pthread_create(&ThreadA,NULL,add,&test);

for(index=0;index<ORANGE_MAX_VALUE;index++)
{
sum+=test1.a[index]+test1.b[index];
}

pthread_join(ThreadA,NULL);

return 0;
}

The measurement result is:

Figure 5. Comparison of hard affinity time (two threads)

The measurement result is exactly what we want, but it takes more time than a single thread. The reason is similar to the above analysis.

Further analysis shows that the sample program consumes most of the time on the computing apple.aAndbThe value is distributed to different CPUs for computing. Considering the impact of the Cache, will the efficiency be improved?

Figure 6. Comparison of hard affinity time (three threads)

From the time point of view, the time spent by setting the affinity program is slightly higher than the three-thread solution using the Cache. The impact of Cache is taken into account, eliminating the bottleneck caused by a level-1 Cache. The extra time is mainly consumed by system calls and kernel. You can verify it by using the time command:

#time ./unlockcachemultiprocess
real 0m0.834s user 0m1.644s sys 0m0.004s
#time ./affinityunlockcacheprocess
real 0m0.875s user 0m1.716s sys 0m0.008s

Setting CPU affinity to take advantage of multi-core features provides a shortcut to improve application performance. It is also a double-edged sword. If you ignore factors such as server Load balancer and data competition, the efficiency will be greatly reduced, and even the result will be half the result.

In the specific design process, a good data structure and algorithm need to be designed to make it suitable for the performance characteristics of the data movement and processor of the application.



Back to Top

Summary

Based on the above analysis and experiments, we will make a comprehensive comparison of the testing time of all improvement schemes, as shown in:

Figure 7. Comparison of the time of each solution

Average time consumed by a single-thread original program: 1.049046 s, average time consumed by the slowest no-Lock three-thread solution: 2.217413 s, average time consumed by the fastest three-thread (Cache: 128): 0.826674 s, the efficiency is improved by about 26%. Of course, it can be further optimized to improve efficiency.

It is never difficult to draw a conclusion: adopting multi-core and multi-thread parallel design can effectively improve the performance. However, if the consideration is not comprehensive, such as ignoring bandwidth, data competition, improper data synchronization, and other factors, the efficiency will be reduced, program Execution is getting slower and slower.

If we leave the restrictions at the beginning of this article, we will use another data decomposition model mentioned above, and optimize the sample program with hard affinity. the test time is 0.54 s, the efficiency has increased by 92%.

Software optimization is a continuous process throughout the entire software development cycle from initial design to final completion. Before optimization, You need to identify the bottlenecks and hot spots. As Rob Pike, the greatest C language master, said:

If you cannot determine where the program will take time, and the bottleneck often appears in unexpected places, don't rush to find a place to change the code unless you have confirmed it is the bottleneck.

Send this sentence to all the optimization personnel to share with you.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.