System Performance Tuning tool under Perf event:linux 2011-05-27 10:35 Liu Ming Ibmdwfont Size:T | T
The Perf Event is a performance diagnostic tool that is distributed and maintained with the Linux kernel code, maintained and developed by the kernel community. Perf can be used not only for performance statistical analysis of applications, but also for performance statistics and analysis of kernel code. Thanks to its excellent architectural design, more and more new features have been added to the Perf, making it a versatile set of performance statistics tools. This article will introduce the application of Perf in application development.
AD:2014WOT Global Software Technology Summit Beijing Station course video release
The Perf Event is a performance diagnostic tool that is distributed and maintained with the Linux kernel code, maintained and developed by the kernel community. Perf can be used not only for performance statistical analysis of applications, but also for performance statistics and analysis of kernel code. Thanks to its excellent architectural design, more and more new features have been added to the Perf, making it a versatile set of performance statistics tools. This article will introduce the application of Perf in application development.
Perf Introduction
Perf is a tool used for software performance analysis.
With it, the application can use the Pmu,tracepoint and special counters in the kernel for performance statistics. It not only analyzes the performance issues of the specified application (per thread), it can also be used to analyze the performance problems of the kernel, but it can also analyze the application code and the kernel at the same time to fully understand the performance bottlenecks in the application.
Initially, it was called performance counter, the first appearance in 2.6.31. Since then he has become one of the most active areas of kernel development. It was formally renamed Performance event in 2.6.32 because Perf was no longer just an abstraction of PMU, but was able to handle all performance-related events.
With perf, you can analyze hardware events that occur during a program's operation, such as instructions retired, processor clock cycles, and so on, and you can analyze software events such as Page Fault and process switching.
This allows Perf to have numerous performance analysis capabilities, for example, using Perf to calculate the number of instructions per clock cycle, known as IPC,IPC Low, to indicate that the code is not using the CPU well. Perf can also perform function-level sampling of the program to see where the program's performance bottlenecks are, and so on. Perf can also replace Strace, you can add dynamic kernel probe points, you can also do benchmark to measure the quality of the scheduler ...
People may call it the "Swiss Army Knife" for performance analysis, but I don't like the analogy, and I think perf should be a rare day sword.
Many people in Jin Yong's novels have a penchant for blades, even if they don't deserve it, but they like it. I am afraid that as these people, so into the Pub Inn, meet or not familiar with people, will be excitedly to tell the story of the Day sword.
Background knowledge
Some background knowledge is needed to analyze performance issues. such as the hardware cache, and then the operating system kernel. The behavior details of the application are often involved with these things, and these underlying things affect the performance of the application in unexpected ways, such as the inability of some programs to take full advantage of the cache, resulting in degraded performance. such as unnecessary calls to excessive system calls, resulting in frequent kernel/user Switching. Wait a minute. All aspects, here is just for the follow-up content of this article to do some foreshadowing, about tuning there are a lot of things, I do not know more than know.
Performance-related processor hardware features, PMU introduction
When the algorithm has been optimized and the code is constantly streamlined, people will need to be preoccupied with the final adjustment. Cache Ah, the assembly line ah a class of things that usually do not pay attention must also be prudent.
Cache of hardware Features
Memory reads and writes are fast, but still cannot be compared to the speed of the processor's instruction execution. In order to read instructions and data from memory, the processor waits, measured by the processor's time, which is a long wait. The Cache is an SRAM that can read and write at a very fast rate and match the processing speed of the processor. As a result, the usual data is stored in the cache, and the processor does not have to wait for it to improve performance. The cache size is generally small, and making the most of the cache is a very important part of software tuning.
Hardware characteristics of the pipeline, superscalar system structure, disorderly execution
One of the most effective ways to improve performance is in parallel. The processor is also as parallel as possible in hardware design, such as pipelining, superscalar architecture, and disorderly execution.
Processor processing a single instruction needs to be done in several steps, such as taking the instruction first, then completing the operation, and finally outputting the result to the bus. Inside the processor, this can be seen as a three-level pipeline, as shown in:
Figure 1. Processor pipelining
Instructions from the left into the processor, the pipeline has three levels, a clock cycle can be processed at the same time three instructions, respectively, the different parts of the pipeline processing.
Superscalar (superscalar) is a pipelined machine architecture that emits multiple instructions in a clock cycle, such as Intel's Pentium processor, which has two execution units, allowing two instructions to be executed within a single clock cycle.
In addition, within the processor, the processing steps and clock cycles required for different instructions are different, and if executed in strict accordance with the execution order of the program, the processor's pipelining cannot be fully exploited. Therefore, the instructions may be executed in a disorderly order.
The above three parallel technologies have a basic requirement for the executed instructions, that is, the neighboring directives are not dependent on each other. If an instruction relies on the execution result data of the previous instruction, then pipeline loses its function because the second instruction must wait for the first instruction to complete. So good software must try to avoid this kind of code generation.
Branch prediction of hardware characteristics
Branch instructions have a significant impact on software performance. Especially when the processor adopts pipeline design, assuming that the pipeline has three levels, the current entry into the flow of the first instruction is a branch instruction. Assuming the processor sequential read instruction, then if the result of the branch is to jump to other instructions, then the processor pipeline pre-fetch two instructions will be discarded, thereby affecting performance. For this reason, many processors provide branch prediction capabilities, which are based on the historical execution record of the same instruction, and read the most likely next instruction, rather than sequential read instructions.
Branch prediction has some requirements for the software structure, and for the repetitive branch instruction sequence, the branch prediction hardware can get better prediction results, and for the program structure like switch case, the ideal prediction result is often not obtained.
Several of the processor features described above have a significant impact on the performance of the software, but the profiler mode, which relies on periodic sampling of clocks, does not reveal the program's use of these processor hardware features. For this scenario, the processor vendor adds a PMU unit, the Performance Monitor unit, to the hardware.
PMU allows the software to set counter for a hardware event, after which the processor begins to count the number of occurrences of the event, and when the number of occurrences exceeds the value set in counter, an interrupt is generated. For example, when the cache miss reaches a certain value, the PMU can generate a corresponding interrupt.
By capturing these interrupts, you can investigate the efficiency of the program's use of these hardware features.
Tracepoints
Tracepoint is a number of hooks scattered in the kernel source code, and once enabled, they can be triggered when a particular code is run, a feature that can be used by various trace/debug tools. Perf is one of the users of this feature.
If you want to know the behavior of the kernel memory management module while the application is running, you can take advantage of the tracepoint lurking in the slab allocator. When the kernel is running to these tracepoint, perf is notified.
Perf records the events generated by Tracepoint, generates reports, and by analyzing these reports, the tuning staff can understand the details of the kernel during the run-time of the program and make more accurate diagnoses of performance symptoms.
Basic use of Perf
The best way to illustrate a tool is to cite an example.
Examine the following example procedure. Where the function Longa () is a long loop, which is a waste of time. The functions Foo1 and Foo2 will call the function 10 times, and 100 times respectively.
Listing 1. Test procedure T1
Test.cvoid Longa () {int i,j;for (i = 0; i < 1000000; i++) j=i;//am i silly or crazy? I feel boring and desperate.} void Foo2 () {int i;for (i=0; i <; i++) Longa (); void Foo1 () {int i;for (i = 0; i<; i++) Longa ();} int main (void) {foo1 (); Foo2 ();}
Finding the performance bottleneck of this program does not require any tools to be read by the naked eye. Longa () is the key to this program, as long as it increases its speed, it can greatly improve the efficiency of the whole program.
But, because of its simplicity, it can be used to demonstrate the basic use of perf. If Perf tells you that the bottleneck in this program is elsewhere, you won't have to waste your valuable time reading this article.
Ready to use Perf
Installation of perf is very simple, as long as you have 2.6.31 or more of the kernel source code, then enter the Tools/perf directory and then typing the following two commands:
Makemake Install
The basic principle of performance tuning tools such as perf,oprofile is to sample the monitored objects, and the simplest case is to sample from the tick interrupt, which triggers the sampling point within the tick interrupt, judging the context of the program at the sampling point. If a program spends 90% of its time on function foo (), then 90% of the sample points should fall into the context of function foo (). Luck is unpredictable, but I think that as long as the sampling frequency is high enough, the sampling time is long enough, then the above inference is more reliable. So, with tick-triggered sampling, we can see where in the program is the most time-consuming and focus on analysis.
By extending the idea a little bit, we can find that changing the triggering conditions of the sampling allows us to obtain different statistics:
The distribution of program run times can be learned by taking a point-in-time (such as tick) as an event-triggered sample.
The cache Miss event triggers the sample to know the distribution of the cache miss, which is where the cache invalidation often occurs in the program code. And so on
So let's take a look at what the events in Perf are capable of triggering sampling.
Perf List,perf Events
Use the perf List command to list all events that can trigger a perf sample point. Like what
$ perf Listlist of pre-defined events (to be used in-e): cpu-cycles OR Cycles [Hardware event]instructions [Hardware Event ]...cpu-clock [software event]task-clock [software event]context-switches OR CS [software Event]...ext4:ext4_allocate_ inode [Tracepoint event]kmem:kmalloc [Tracepoint event]module:module_load [Tracepoint event]workqueue:workqueue_ execution [Tracepoint event]sched:sched_{wakeup,switch} [Tracepoint event]syscalls:sys_{enter,exit}_epoll_wait [ Tracepoint Event] ...
Different systems will list different results, in the 2.6.35 kernel, the list is quite long, but no matter how many, we can divide them into three classes:
Hardware event is generated by PMU hardware events, such as the cache hit, when you need to understand the application of hardware features, you need to sample these events;
Software event is the kernel software generated events, such as process switching, tick number, etc.;
The Tracepoint event is a static tracepoint in the kernel that triggers events that determine the details of the kernel's behavior during the run of the program, such as the number of allocations for the slab allocator.
Each of these events can be used to sample and generate a statistic that, to date, does not yet have a detailed explanation of the meaning of each event. I hope to work with you to find out more about the event as the goal ...
Perf Stat
It's best to be orderly in everything. Veteran often can do leisurely, gradual, while the novice is often east, west, at a loss.
In the face of a problem program, it is best to adopt a top-down strategy. The overall look at the program to run a variety of statistical events, and then in some direction in-depth details. And do not plunge into trivial details, will blinders.
Some programs are slow because the computation is too large, and most of the time should be used in the CPU to calculate, which is called the CPU bound type; Some programs are slow because of excessive IO, this time the CPU utilization should not be high, this is called IO bound type; For CPU bound program tuning and IO The tuning of the bound is different.
If you agree with these statements, Perf stat should be the first tool you use. It provides an overview of the overall situation and summary data that is run by the debugged program in a streamlined manner.
Do you remember the example program we prepared earlier? Now compile it as an executable file T1
Gcc–o t1–g test.c
The following shows the output of perf stat for program T1:
$perf Stat./t1performance counter stats for './t1 ': 262.738415 task-clock-msecs # 0.991 CPUs2 context-switches # 0.000 m/s EC1 cpu-migrations # 0.000 m/sec81 page-faults # 0.000 m/sec9478851 Cycles # 36.077 m/sec (scaled from 98.24%) 6771 Instruc tions # 0.001 IPC (scaled from 98.99%) 111114049 branches # 422.908 m/sec (scaled from 99.37%) 8495 branch-misses # 0.008 (scaled from 95.91%) 12152161 cache-references # 46.252 m/sec (scaled from 96.16%) 7245338 cache-misses # 27.576 m/sec (scal Ed from 95.49%) 0.265238069 seconds time Elapsed
The above tells us that the program T1 is a CPU bound type because Task-clock-msecs is close to 1.
Tuning T1 should find hotspots (the most time-consuming code snippets) and see if you can improve the efficiency of your hotspot code.
By default, in addition to Task-clock-msecs, perf Stat gives several other statistics that are most commonly used:
TASK-CLOCK-MSECS:CPU utilization, which indicates that the program spends most of its time on CPU computing rather than IO.
Context-switches: The number of process switches, which records how many process transitions have occurred during the program's operation, frequent process switching should be avoided.
Cache-misses: The overall cache utilization of the program during operation, if the value is too high, the program's cache use is not good
Cpu-migrations: Indicates how many CPU migrations occurred during the process T1 run, that is, the scheduler is moving from one CPU to another.
Cycles: Processor clock, a machine instruction may require multiple Cycles,
Instructions: number of machine instructions.
IPC: Is the ratio of instructions/cycles, the larger the better, the program takes full advantage of the characteristics of the processor.
Number of Cache-references:cache hits
The number of Cache-misses:cache failures.
By specifying the-e option, you can change the default event for Perf stat (for events, as explained in the previous section, which can be viewed through the perf list). If you already have a lot of tuning experience, you might use the-e option to see the special events you are interested in.
Perf Top
When using perf Stat, often you already have a tuning goal. Like the boring program I just wrote, T1.
There are times when you just find that the performance of the system degrades, and it is unclear which process has become the greedy hog.
You'll need a top-like command that lists all the questionable processes from which to find the guy who needs further scrutiny. Similar to the legal program in the case of police officers often do, by viewing the surveillance video from the vast crowd to find the behavior of those who are eccentric, rather than on the street to seize everyone to interrogate.
The Perf top is used to display the performance statistics of the current system in real time. This command is used primarily to observe the current state of the entire system, such as viewing the output of the command to see the current system's most time consuming kernel functions or a user process.
Let's design an example to illustrate it.
I don't know what you think, anyway, I think it's hard to do something good, but it's very easy to do something bad. I quickly thought of a program as shown in Listing 2:
Listing 2. A dead loop
while (1) i++;
I called him T2. Start T2, and then use perf top to observe:
The following are the possible outputs of the perf top:
perftop:705 irqs/sec kernel:60.4% [1000Hz cycles]--------------------------------------------------Sampl pcnt function DSO1503.00 49.2% t272.00 2.2% pthread_mutex_lock/lib/libpthread-2.12.so68.00 2.1% delay_tsc [kernel.kallsyms] 55.00 1.7% aes_dec_blk [aes_i586]55.00 1.7% drm_clflush_pages [drm]52.00 1.6% system_call [kernel.kallsyms]49.00 1.5% __m emcpy_ssse3/lib/libc-2.12.so48.00 1.4% __strstr_ia32/lib/libc-2.12.so46.00 1.4% unix_poll [kernel.kallsyms]42.00 1.3 % __ieee754_pow/lib/libm-2.12.so41.00 1.2% do_select [kernel.kallsyms]40.00 1.2% pixman_rasterize_edges libpixman-1.so.0.18.037.00 1.1% _raw_spin_lock_irqsave [kernel.kallsyms]36.00 1.1% _int_malloc/lib/libc-2.12.so^c
It is easy to find that T2 is a suspicious program that needs attention. But the modus operandi is too simple: to waste the CPU recklessly. So we don't have to do anything else to find out where the problem lies. But in real life, performance-impacting programs are generally not so stupid, so we often need to use other perf tools for further analysis.
By adding the-e option, you can list TopN processes/functions that cause other events. such as e-Cache-miss, to see who caused the most cache miss.
Use the Perf record to interpret the report
After you use top and stat, you probably already have a few. For further analysis, some finer-grained information is needed. For example, you have decided that the target program is computationally large, perhaps because some of the code is not thin enough to write. So in the face of long code files, which lines of code need to be further modified? This would require a single function-level statistic to be recorded using the perf record and the perf report to display the statistical results.
Your tuning should focus on a high percentage of hot code snippets, and if a piece of code takes up 0.1% of the entire program's run time, even if you optimize it to just one machine instruction, I'm afraid it will only improve overall program performance by 0.1%. As the saying goes, good steel is used on the blade, so I don't have to say more.
Still take T1 as an example.
Perf record–e cpu-clock./t1perf Report
The results are as follows:
Figure 2. Perf Report Example
As expected, hot spot is the Longa () function.
However, the code is very complex and difficult to say, T1 program Foo1 () is also a potential tuning object, why call 100 times that boring longa () function? But we can't find foo1 and Foo2 in the middle of the world, and we can't understand their differences.
I found myself writing a program almost half the time spent on several methods of the string class, String is the C + + standard, I can never write better than STL code. So I have to find the place where I use string too much in my own program. So I need the statistics to be displayed in the call relationship.
Use the-G option of perf to get the information you need:
Perf record–e cpu-clock–g./t1perf Report
The results are as follows:
Figure 3. Perf–g Report Example
Through the analysis of calling graph, it is easy to see that 91% of the time is spent in the foo1 () function, because it calls 100 times the Longa () function, so if Longa () is a function that cannot be optimized, then the programmer should consider optimizing foo1, reducing the The number of Longa () calls.
Examples of using PMU
Examples of T1 and T2 are simpler. The so-called magic feet, Tao can be a tall Zhang. To demonstrate perf's ability to be more powerful, I had to come up with a clever example of performance that I couldn't think of and had to help others. The following example T3 references the article "Branch and Loop reorganization to Prevent mispredicts" [6]
This example examines the utilization of the program's branch predictions for Pentium processors, which, as mentioned earlier, can significantly improve the performance of the processor, while branch prediction failures significantly degrade the performance of the processor. First, an example of the existence of a BTB failure is given:
Listing 3. There are examples of BTB failure procedures
test.c#include#includevoid foo () {int i,j;for (i=0; i<; i++) j+=2;} int main (void) {int i;for (i = 0; i< 100000000; i++) foo (); return 0;}
Generate test program T3 with GCC compilation:
Gcc–o T3–o0 test.c
Use perf Stat to investigate the usage of branch predictions:
[Email protected] perf]$/perf Stat./t3performance counter stats for './t3 ': 6240.758394 task-clock-msecs # 0.995 CPUS12 6 context-switches # 0.000 M/SEC12 cpu-migrations # 0.000 m/sec80 page-faults # 0.000 m/sec17683221 Cycles # 2.834 M/sec ( Scaled from 99.78%) 10218147 Instructions # 0.578 IPC (scaled from 99.83%) 2491317951 branches # 399.201 m/sec (scaled from 99.88%) 636140932 branch-misses # 25.534% (scaled from 99.63%) 126383570 cache-references # 20.251 m/sec (scaled from 99.68 %) 942937348 cache-misses # 151.093 m/sec (scaled from 99.58%) 6.271917679 seconds time Elapsed
Can see the situation of branche-misses more serious, about 25%. I tested the machine using the processor as PENTIUM4, whose BTB size was 16. and test.c in the loop in the iteration for 20 times, BTB overflow, so the processor branch prediction will not be accurate.
I will briefly explain the above sentence, but please read the references [6] for details on BTB.
The For loop is compiled as an IA assembly as follows:
Listing 4. Compilation of Loops
C codefor (i=0; i <; i++) {...} Assembly Code;mov esi, Datamov ecx, 0forloop:cmp ecx, 20jgeendforloop...add ecx, 1jmp forloopendforloop:
As you can see, there is a branch statement jge in each iteration of the loop, so there are 20 branches to judge during the run. Each branch judgment will be written to BTB, but BTB is a ring buffer,16 slot when it is full and will begin overwriting. If the number of iterations is exactly 16, or less than 16, then the complete loop will all be written to BTB, such as the number of iterations of the loop is 4 times, then BTB should be as shown:
Figure 4. BTB Buffer
This buffer completely and accurately describes the bifurcation of the entire loop iteration, so the next time you run the same loop, the processor can make exactly the right predictions. However, if the number of iterations is 20, the BTB will not fully accurately describe the branch prediction execution of the loop over time, and the processor would make a false judgment.
We will make a few changes to the test program, reduce the number of iterations from 20 to 10, in order to make the logic unchanged, J + + becomes j+=2;
Listing 5. No BTB-Fail code
#include #includevoid foo () {int i,j;for (i=0; i<; i++) j+=2;} int main (void) {int i;for (i = 0; i< 100000000; i++) foo (); return 0;}
At this point, the following results are obtained with perf stat sampling:
[Email protected] perf]$/perf Stat./t3performance counter stats for './t3:2784.004851 task-clock-msecs # 0.927 CPUS90 Context-switches # 0.000 M/sec8 cpu-migrations # 0.000 m/sec81 page-faults # 0.000 m/sec33632545 Cycles # 12.081 M/sec (SC Aled from 99.63%) 42996 instructions # 0.001 IPC (scaled from 99.71%) 1474321780 branches # 529.569 m/sec (scaled from 99.78 %) 49733 branch-misses # 0.003% (scaled from 99.35%) 7073107 cache-references # 2.541 m/sec (scaled from 99.42%) 47958540 CA Che-misses # 17.226 M/sec (scaled from 99.33%) 3.002673524 seconds time Elapsed
The branch-misses decreased.
This example is just to demonstrate that Perf's use of PMU, itself is meaningless, on the full use of processor tuning can refer to the Intel Company's tuning manual, other processors may have different methods, but also hope that readers Mingjian.
Summary
These perf usages are mainly focused on the performance statistical analysis of the application, and the second part of this article will continue to describe some of the special uses of perf, and focus on the performance statistics of the kernel itself.
Tuning is a job that requires comprehensive knowledge, and it needs to be constantly cultivated. Although Perf is a sword, but the sword with heroes, only martial arts high-strength warrior can use it at will. With my skill, I can only tell a few things about blades in hearsay. But if this article can arouse your interest in blades, then it is a bit of a function.
"Editor's recommendation"
- Microsoft Xperf User manual: Monitoring Windows performance New scenarios
- Five big Linux simple commands to resolve system performance issues
- Improve Linux operating system performance
The System Performance Tuning tool under Perf event:linux