Starting from kernel 2.6.31, the Linux kernel comes with a performance analysis tool perf, which can be used for function-level and command-level hotspot searches.
Perf
Performance analysis tools for Linux.
Performance counters for Linux are a new kernel-based subsystem that provide a framework for all things
Performance analysis. It covers hardware level (CPU/PMU, Performance Monitoring Unit) features and
Software features (software counters, tracepoints) as well.
Perf is a performance profiling tool built in the Linux kernel source code tree.
Based on the event sampling principle and performance events, it supports Performance Analysis of processor-related performance indicators and operating system-related performance indicators.
It is often used to find performance bottlenecks and locate hotspot code.
The CPU cycle (CPU-cycles) is the default performance event. The so-called CPU cycle refers to the minimum time unit that the CPU can recognize, usually several hundred million s,
It is the time required for the CPU to execute the simplest command, such as reading the content in the register, also known as clock tick.
Perf is a tool set containing 22 seed tools. The following are the five most commonly used tools:
Perf-list
Perf-stat
Perf-top
Perf-record
Perf-Report
Perf-list
Perf-list is used to view performance events supported by perf, including software and hardware.
List all symbolic event types.
Perf list [HW | SW | cache | tracepoint | event_glob]
(1) Distribution of performance events
HW: hardware event, 9
SW: Software event, 9
Cache: hardware cache events, 26
Tracepoint: tracepoint event, 775
SW is actually a kernel counter and has nothing to do with hardware.
HW and cache are related to the CPU architecture and depend on specific hardware.
Tracepoint is a kernel-based ftrace, which is supported by kernel versions above 2.6.3x on the main line.
(2) specify a performance event (with its attributes)
-E <event>: U // userspace
-E <event>: K // Kernel
-E <event>: H // hypervisor
-E <event>: G // guest counting (in KVM guests)
-E <event>: H // host counting (not in KVM guests)
(3) Example
Displays the functions that consume the most CPU cycles in the kernel and module:
# Perf top-e cycles: K
Display the function with the most cache allocated:
# Perf top-e kmem: kmem_cache_alloc
Perf-top
For a specified performance event (the default is the CPU cycle), the most consumed functions or commands are displayed.
System profiling tool.
Generates and displays a performance counter profile in real time.
Perf top [-E <event> | -- event = event] [<Options>]
Perf top is mainly used to analyze the popularity of each function in a certain performance event in real time, and can quickly locate hotspot functions, including application functions,
Module functions and kernel functions can even locate hotspot commands. The default performance event is CPU cycles.
(1) Output Format
# Perf top
Samples: 1M of event 'cycles', Event count (approx.): 73891391490 5.44% perf [.] 0x0000000000023256 4.86% [kernel] [k] _spin_lock 2.43% [kernel] [k] _spin_lock_bh 2.29% [kernel] [k] _spin_lock_irqsave 1.77% [kernel] [k] __d_lookup 1.55% libc-2.12.so [.] __strcmp_sse42 1.43% nginx [.] ngx_vslprintf 1.37% [kernel] [k] tcp_poll
Column 1: Percentage of performance events caused by the symbol. The default value is the percentage of CPU cycles occupied.
Column 2: the DSO (dynamic shared object) where the symbol is located. It can be an application, kernel, dynamic link library, or module.
Column 3: DSO type. [.] Indicates that this symbol is a user-mode ELF file, including executable files and dynamic link libraries ). [K] indicates that this symbol belongs to the kernel or module.
Column 4: Symbol name. Some symbols cannot be resolved as function names and can only be expressed as addresses.
(2) common interactive commands
H: Show Help
Up/down/pgup/pgdn/Space: up/down and turning pages.
A: annotate current symbol, indicating the current symbol. It can give the annotation of the assembly language and the sampling rate of each instruction.
D: filter out all symbols that do not belong to this DSO. It is very convenient to view symbols of the same category.
P: Save the current information to perf. Hist. n.
(3) common command line parameters
-E <event>: Specifies the performance event to be analyzed.
-P <pid>: profile events on existing process ID (comma sperated list). Only the target process and the created thread are analyzed.
-K <path>: Path to vmlinux. Required for annotation functionality. Path of the kernel image in the signed table.
-K: Do not display symbols that belong to the kernel or module.
-U: Do not display symbols of user State programs.
-D <n>: Ui refresh cycle. The default value is 2 s, because perf top reads data at one time every 2 s from the memory area of MMAP.
-G: Obtain the function call relationship diagram.
Perf top-G [fractal]. The path probability is relative. The total value is 100%. The call sequence is from bottom to top.
Perf top-G graph, where the path probability is the absolute value, adds up to the heat of the function.
(4) Example
# Perf top // Default Configuration
# Perf top-G // obtain the call Relationship Diagram
# Perf top-e cycles // specify performance events
# Perf top-P 23015,32476 // view the CPU cycles usage of the two processes
# Perf top-s comm, PID, symbol // display the process name and process number for calling symbol
# Perf top -- comms nginx, top // only display the symbols of the specified process
# Perf top -- symbols kfree // only display the specified symbol
Perf-stat
Analyzes the performance of a specified program.
Run a command and gather performance counter statistics.
Perf stat [-E <event> | -- event = event] [-A] <command>
Perf stat [-E <event> | -- event = event] [-A]-<command> [<Options>]
(1) Output Format
# Perf stat ls
Performance counter stats for 'ls': 0.653782 task-clock # 0.691 CPUs utilized 0 context-switches # 0.000 K/sec 0 CPU-migrations # 0.000 K/sec 247 page-faults # 0.378 M/sec 1,625,426 cycles # 2.486 GHz 1,050,293 stalled-cycles-frontend # 64.62% frontend cycles idle 838,781 stalled-cycles-backend # 51.60% backend cycles idle 1,055,735 instructions # 0.65 insns per cycle # 0.99 stalled cycles per insn 210,587 branches # 322.106 M/sec 10,809 branch-misses # 5.13% of all branches 0.000945883 seconds time elapsed
The output includes the LS execution time and 10 performance events.
Task-clock: the real time occupied by the task's processor, in ms. CPUs utilized = task-clock/time elapsed, CPU usage.
Context-switches: the number of context switches.
CPU-migrations: Number of processor migrations. To maintain load balancing among multiple processors, Linux
Migrate to another CPU.
Page-faults: number of page missing exceptions. When the page requested by the application is not created, the requested page is not in memory, or the requested page is included
When the ing between physical and virtual addresses is not established, a page missing exception is triggered. In addition, TLB does not hit, and the page access permission does not match.
Page missing exceptions are also triggered.
Cycles: Number of consumed processor cycles. If the CPU cycles used by LS is regarded as a processor, the clock speed is 2.486 GHz.
You can use cycles/task-clock to calculate.
Stalled-cycles-frontend: skipped.
Stalled-cycles-backend: skipped.
Instructions: How many commands are executed. How many commands are executed by IPC for each CPU cycle on average.
Branches: number of branch commands encountered. Branch-Misses is the number of branch commands that predict an error.
(2) Common Parameters
-P: stat events on existing process ID (comma separated list). Only the target process and the created thread are analyzed.
-A: system-wide collection from all CPUs. collect performance data from all CPUs.
-R: Repeat command and print average + stddev (MAX: 100). Run the command repeatedly to calculate the average value.
-C: count only on the list of CPUs provided (comma separated list), which collects performance data from the specified CPU.
-V: Be more verbose (show counter open errors, etc) to show more performance data.
-N: NULL run-don't start any counters. Only the execution time of the task is displayed.
-X Sep: Specifies the separator of the output column.
-O file: Specifies the output file, -- append specifies the append mode.
-- Pre <cmd>: The program that is executed before the target program is executed.
-- Post <cmd>: the program to be executed after the target program is executed.
(3) Example
Execute 10 procedures to give the ratio of standard deviation to expectation:
# Perf stat-R 10 ls>/dev/null
Show more details:
# Perf stat-V ls>/dev/null
Only display the task execution time, not the performance counter:
# Perf stat-n ls>/dev/null
The information on each CPU is provided separately:
# Perf stat-a-A ls>/dev/null
How many system calls have the LS command executed:
# Perf stat-e syscils: sys_enter ls
Perf-record
Collect sampling information and record it in the data file.
You can use other tools (perf-report) to analyze data files. The results are similar to those of perf-top.
Run a command and record its profile into perf. Data.
This command runs a command and gathers a performance counter profile from it, into perf. Data,
Without displaying anything. This file can then be inspected later on, using perf report.
(1) Common Parameters
-E: select the PMU event.
-A: system-wide collection from all CPUs.
-P: record events on existing process ID (comma separated list ).
-A: append to the output file to do incremental profiling.
-F: overwrite existing data file.
-O: output file name.
-G: Do call-graph (stack chain/backtrace) recording.
-C: collect samples only on the list of CPUs provided.
(2) Example
Record the performance data of the nginx process:
# Perf record-P 'pgrep-d', 'nginx'
Record the performance data when LS is executed:
# Perf record LS-G
Record the system calls during ls execution. You can know which systems are most frequently called:
# Perf record-e syscils: sys_enter ls
Perf-Report
Read the data file created by perf record and provide the hotspot analysis result.
Read perf. Data (created by perf record) and display the profile.
This command displays the performance counter profile information recorded via perf record.
(1) Common Parameters
-I: input file name. (default: perf. Data)
(2) Example
# Perf report-I perf. data.2
More
In addition to the above five commonly used tools, there are also some tools suitable for special scenarios, such as kernel locks, slab splitters, schedulers,
It also supports custom probe points.
Perf-Lock
Kernel lock performance analysis.
Analyze lock events.
Perf lock {record | report | script | info}
Support for compilation options: config_lockdep and config_lock_stat.
Config_lockdep defines acquired and release events.
Config_lock_stat defines contended and acquired lock events.
(1) common options
-I <File>: Input File
-K <value>: Sorting key. The default value is acquired. It can also be sorted by contended, wait_total, wait_max, and wait_min.
(2) Example
# Perf lock record ls // record
# Perf lock report // report
(3) Output Format
Name acquired contended total wait (ns) max wait (ns) min wait (ns) &mm->page_table_... 382 0 0 0 0 &mm->page_table_... 72 0 0 0 0 &fs->lock 64 0 0 0 0 dcache_lock 62 0 0 0 0 vfsmount_lock 43 0 0 0 0 &newf->file_lock... 41 0 0 0 0
Name: name of the kernel lock.
Aquired: the number of times the lock is directly obtained, because no other kernel path occupies the lock, so do not wait.
Contended: the number of times the lock is obtained after waiting, which is occupied by other kernel paths and needs to be waited.
Total wait: Total wait time to obtain the lock.
Max wait: maximum wait time to obtain the lock.
Min wait: minimum wait time to obtain the lock.
There is also a summary:
=== output for debug===bad: 10, total: 246bad rate: 4.065041 %histogram of events caused bad sequence acquire: 0 acquired: 0 contended: 0 release: 10
Perf-kmem
Performance Analysis of slab splitters.
Tool to trace/measure kernel memory (slab) properties.
Perf kmem {record | stat} [<Options>]
(1) common options
-- I <File>: Input File
-- Caller: Show per-callsite statistics, where kmalloc and kfree are called in the kernel.
-- Alloc: Show per-allocation statistics. The allocated memory address is displayed.
-L <num>: Print n lines only. Only num rows are displayed.
-S <key [, key2. ..]>: sort the output (default: Frag, hit, bytes)
(2) Example
# Perf kmem record ls // record
# Perf kmem stat -- caller -- alloc-l 20 // report
(3) Output Format
------------------------------------------------------------------------------------------------------ Callsite | Total_alloc/Per | Total_req/Per | Hit | Ping-pong | Frag------------------------------------------------------------------------------------------------------ perf_event_mmap+ec | 311296/8192 | 155952/4104 | 38 | 0 | 49.902% proc_reg_open+41 | 64/64 | 40/40 | 1 | 0 | 37.500% __kmalloc_node+4d | 1024/1024 | 664/664 | 1 | 0 | 35.156% ext3_readdir+5bd | 64/64 | 48/48 | 1 | 0 | 25.000% load_elf_binary+8ec | 512/512 | 392/392 | 1 | 0 | 23.438%
Callsite: the location where kmalloc and kfree are called in the kernel code.
Total_alloc/PER: total memory size allocated, the average memory size allocated each time.
Total_req/PER: total requested memory size, average memory size for each request.
Hit: the number of calls.
Ping-pong: The number of times when kmalloc and kfree are not executed by the same CPU, which leads to lower cache efficiency.
Frag: the percentage of shards. Fragment = allocated memory-requested memory, which is a waste.
If the -- alloc option is used, the alloc PTR is displayed, that is, the address of the allocated memory.
There is also a summary:
SUMMARY=======Total bytes requested: 290544Total bytes allocated: 447016Total bytes wasted on internal fragmentation: 156472Internal fragmentation: 35.003669%Cross CPU allocations: 2/509
Probe-sched
Analysis of the scheduling module.
Trace/measure scheduler properties (latencies)
Perf sched {record | latency | map | replay | script}
(1) Example
# Perf sched record sleep 10 // perf sched record <command>
# Perf report latency -- Sort Max
(2) Output Format
--------------------------------------------------------------------------------------------------------------- Task | Runtime ms | Switches | Average delay ms | Maximum delay ms | Maximum delay at | --------------------------------------------------------------------------------------------------------------- events/10:61 | 0.655 ms | 10 | avg: 0.045 ms | max: 0.161 ms | max at: 9804.958730 s sleep:11156 | 2.263 ms | 4 | avg: 0.052 ms | max: 0.118 ms | max at: 9804.865552 s edac-poller:1125 | 0.598 ms | 10 | avg: 0.042 ms | max: 0.113 ms | max at: 9804.958698 s events/2:53 | 0.676 ms | 10 | avg: 0.037 ms | max: 0.102 ms | max at: 9814.751605 s perf:11155 | 2.109 ms | 1 | avg: 0.068 ms | max: 0.068 ms | max at: 9814.867918 s
Task: process name and PID.
Runtime: the actual running time.
Switches: the number of process switches.
Average delay: Average scheduling latency.
Maximum delay: Maximum scheduling latency.
Maximum delay at: the time when the maximum scheduling latency occurs.
Perf-Probe
You can customize the probe points.
Define new dynamic tracepoints.
Example
(1) display which lines in Schedule () can be probed
# Perf probe -- Line Schedule
If there is a row number before it can be detected, it will not work if there is no row number.
(2) Add a probe on schedule () function 12th line.
# Perf probe-a schedule: 12
Add a probe point at 12 of the Schedule function.
Reference
[1]. Linux system-level performance profiling tool series, by Chenggang
Http://www.ibm.com/developerworks/cn/linux/l-cn-perf1/.
Http://www.ibm.com/developerworks/cn/linux/l-cn-perf2/.
Https://perf.wiki.kernel.org/index.php/Tutorial.
System-level performance analysis tool-Perf