Performance analysis of multi-core multithreading using Oprofile

Last Update:2018-07-26 Source: Internet

Author: User

Tags benchmark

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Use Oprofile performance analysis of multi-core multithreading

Yang Xiaohua

工欲善其事, its prerequisite

Brief introduction of---Mozi's performance analysis tools

In the process of continuous tuning of applications, in addition to the development of a complete test benchmark (Benchmark), but also need a straight in the key tool-performance analysis tools.

Depending on the complexity of the tool and the functionality provided, you can divide the performance tools into two levels:

Basic timing Tools

In ordinary life, the stopwatch is the simplest timing tool. According to this idea, you can put the timer function anywhere in the code and call it multiple times, so that you can measure the elapsed time of the entire application or part of it. The analysis method is not precise enough and the error is big.

Software analysis Tools

At present, there are two different kinds of software analysis tools: sampling and inserting piles.

Ø Sampling Type Analysis tool

Record related performance information, such as processor instruction pointers, thread IDs, processor IDs, and event counters, mainly through periodic interrupts. This method has little overhead and high precision. In Linux systems, Oprofile and the Intel VTune Performance Analyzer are more common.

Ø Inserting pile type analysis tool

That is, you can use a direct binary insert, or you can insert analysis code into your application by using the compiler. This approach is similar to adding a timer function to your application, and it brings a lot of overhead, but provides more functionality, such as call trees, call times, and function overhead. In Linux systems, GPROF and the Intel VTune Performance Analyzer are more common.

This paper will use the sampling tool Oprofile to analyze the performance of multi-core multithreaded program and play a useful role. ways to measure performance gains

With the development of science and technology, the structure of computer system is developing toward multi-core, which pushes concurrent programming to the spotlight, but how to measure the performance benefit of concurrent program design.

Had to think of the outstanding contribution of Gene Amdahl in 1967, he proposed the Amdahl law can calculate the parallel program relative to the optimal serial algorithm in the performance improvement of the theoretical maximum value.

Amdahl Law

Acceleration ratio = ————————

S+ (1-s)/n+h (n)

Where S represents the proportion of the serial portion of the executing program, n represents the number of processor cores, and H (n) represents the system overhead.

Because the Amdahl law itself makes several assumptions, but these assumptions in the real world is not necessarily correct, so that the computer industry disheartened for many years, that according to Amdahl Law, the development of greater parallelism of the performance gains may be negligible, Until the advent of the law of Gustafson, only to change the status quo.

On the basis of Sandia laboratory work, E.barsis proposed the Gustafson law:

Extended acceleration Ratio =n+ (1-n) *s

Where S represents the proportion of the serial part in the execution program, and N represents the number of processor cores.

Fortunately, Shi proved in 1996 that the Gustafson law is equivalent to the Amdahl law. Brief introduction of oprofile working principle

Depending on the structure of the CPU system, Oprofile supports two types of sampling: event-based (event Based) sampling and time Based based sampling.

If there is a performance count register inside the CPU, the Oprofile is based on event sampling, recording the number of occurrences of a particular event, such as a branch prediction event, and sampling once when the set value is reached. Conversely, based on time sampling, mainly by means of the operating system clock interrupt mechanism, every time the clock interruption occurs when the sample. It is not difficult to see, based on the time sampling method, the test program can not be shielded interrupt, its accuracy is lower than the event sampling.

For the x86 architecture, different models of CPUs, the sampling method is also different, the specific details as shown in the following table:

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More