How to analyze CPU bottlenecks and related operations

Source: Internet
Author: User
Tags bz2 syslog time interval

The following content comes from the reprint and its own initial experience.

Vmstat

[Root@master ~]# vmstat-n 3
procs-----------Memory-------------Swap-------io------system-- -----CPU------
R b swpd free buff cache si so bi bo in CS US sy ID WA St
0 0 115516 6043024 430340 8691840 0 0 2 22 1-1 1 0 99 0 0
0 0 115516 6043024 430340 8691840 0 0 0 0 1124-751 0 0 100 0 0
0 0 115516 6043148 430344 8691840 0 0 0 25 1070-762 0 0 100 0 0

PROC

If the sequence running in processes is more contiguous than the number of CPUs in the system, the system is now running slowly, with most processes waiting for the CPU.

If the output of R is more than 4 times times the number of available CPUs in the system, then the system is facing the problem of CPU shortage, or the CPU speed is too low, the system has a majority of the process in the waiting for the CPU, causing the process of the system running too slow.

SYSTEM

In: Number of interrupts generated per second

CS: The number of context switches generated per second

The larger the above 2 values, the greater the CPU time consumed by the kernel

Cpu

US: The percentage of CPU time consumed by the user process, where the US value is higher, which indicates that the user process consumes more CPU time; If the long-term use of over 50%, then we should consider the optimizer algorithm or to accelerate

SY: The percentage of CPU time consumed by the kernel process (the value of SY is high, which indicates that the system kernel consumes more CPU resources, is not benign performance, we should check the reason)

Wa:io the percentage of CPU time that is waiting to be consumed (when the value is high, it indicates that IO wait is more serious, which may be due to a large number of disk random access, or disk bottlenecks, such as block operations)

The percentage of id:cpu in the idle state, if the idle time lasts 0 and the system time is twice times the user's time, then the system faces a shortage of CPU resources

Workaround:

When the above problems occur, please adjust the application to CPU consumption, so that the application can be more efficient use of the CPU, but also consider adding more CPUs, on the use of CPU can also be combined with mpstat, PS aux, top, Mpstat- A and so on a number of appropriate commands to consider the use of specific CPU, and those processes in a large amount of CPU time, in general, the application of the problem will be larger.

Sar

Usage:sar [Options ...] [<interval> [<count>]]
Options are:
[-A] [b] [b] [-c] [d] [-I <interval>] [-P] [Q]
[-R] [-R] [-T] [-u] [-V] [-V] [W] [W] [-y]
[I {<irq> | SUM | All | Xall}] [-p {<cpu> | All}]
[-N {DEV | Edev | NFS | NFSD | Sock | All}]
[-X {<pid> | SELF | All}] [-X {<pid> | SELF | All}]
[-o [<filename>] |-f [<filename>]]
[-S [

In the command line, the N and t two parameters are grouped together to define the sampling interval and the number of times, T is the sampling interval, the required parameters, n is the number of samples, is optional, the default value is 1, and-o file represents the result of the command being stored in binary format

In the file, options are optional for the command:

-A: Sum of all reports.
-U:CPU Utilization
-V: Processes, I nodes, files, and lock table states.
-D: Hard drive usage reports.
-R: Usage statistics for memory and swap space.
-G: Case of serial I/O.
-B: Buffer usage.
-A: File read/write status.
-C: System call condition.
-Q: Report Queue Length and system average load
-R: The activity of the process.
-Y: terminal equipment activity.
-W: System Exchange activity.
-X {PID | SELF | All}: Reports the statistics for the specified process ID, the SELF keyword is the statistic of the SAR process itself, and the all keyword is the statistics for all system processes.

Analysis of CPU utilization using SAR

[Root@master ~]# sar-u 2 10
Linux 2.6.18-194.el5 (Master) 12/13/2012


06:50:01 PM CPU%user%nice%system%iowait%steal
06:50:03 PM All 1.50 0.08 0.58 7.24 0.00 90.60
06:50:05 PM All 3.25 0.17 0.58 6.74 0.00 89.26
06:50:07 PM all 1.33 0.08 0.67 8.01 0.00 89.91
06:50:09 PM All 1.25 0.00 0.67 7.35 0.00 90.73
06:50:11 PM All 1.08 0.25 0.42 7.75 0.00 90.50
06:50:13 PM all 1.33 0.08 0.58 8.00 0.00 90.00
06:50:15 PM all 1.42 0.08 0.42 7.18 0.00 90.90
06:50:17 PM all 1.25 0.08 0.42 8.01 0.00 90.24
06:50:19 PM all 1.33 0.08 0.50 8.17 0.00 89.92
06:50:21 PM All 1.25 0.25 0.42 7.17 0.00 90.92
Average:all 1.50 0.12 0.53 7.56 0.00 90.30

The contents of the display include:

%user:cpu percentage of time in user mode

%nice:cpu percentage of time in user mode with nice value

%system:cpu percent of time in system mode

%IOWAIT:CPU wait for the percentage of the input and output completion time

%steal: The virtual CPU's unconscious latency percentage when the hypervisor maintains another virtual processor

%IDLE:CPU Idle Time percent

In all of the shows, we should mainly pay attention to%iowait and%idle,

%iowait value is too high, indicating that the hard disk has I/O bottleneck, high%idle value, indicating that the CPU is more idle.

If the%idle value is high but the system response is slow, it is possible that the CPU waits to allocate memory, which should increase the memory capacity. Conversely, if the continuous less than 10, then the CPU processing capacity of the system is relatively low, indicating that the most need to solve the system of resources is the CPU.

Using SAR to run Process Queue Length analysis:

[Root@master ~]# sar-q 2 10
Linux 2.6.18-194.el5 (Master) 12/13/2012

06:57:55 PM runq-sz plist-sz ldavg-1 ldavg-5 ldavg-15
06:57:57 PM 0 1196 0.63 0.48 0.30
06:57:59 PM 0 1196 0.63 0.48 0.30
06:58:01 PM 0 1196 0.58 0.47 0.30
06:58:03 PM 0 1198 0.58 0.47 0.30
06:58:05 PM 0 1198 0.61 0.48 0.30

Runq-sz: Running queues for running processes

Plist-sz: Number of processes and threads in the process queue

Ldavg-1: System average load (load average) for the previous minute

Ldavg-5: System average load for first five minutes

Ldavg-15: System average load for first 15 minutes

By the way, the meaning of load average

The load avarage can be understood as the number of processes per second that the CPU waits to run.

In a liunx system, there are many commands that have the output of the system's average load average, so what is the system load?

Definition: The average number of tasks in a queue that runs within a specific time interval. A process is located in the run queue if the following conditions are true:

1, it is not waiting for I/O operation results

2, it does not actively enter the waiting state (that is, wait)

3, has not been stopped

For example:

[Root@master ~]# Uptime
09:34:05 up, 4:00, 1 user, load average:0.08, 0.02, 0.01

The final content of the command output represents the average number of processes running in the queue in the past 1, 5, and 15 minutes.

In general, as long as the current number of active processes per CPU is not greater than 3 then the performance of the system is good, if the number of tasks per CPU is greater than 5, then the performance of the machine is a serious problem.

For the above example, assuming the system has two CPUs, the current number of tasks per CPU is: 0.08/2=0.04, which means that the system's performance is acceptable.

Here is a question of thinking, when the CPU is supporting Hyper-threading, then this is divided by the number of physical or logical number ....

Iostat

[Root@master ~]# iostat-c 2 10
Linux 2.6.18-194.el5 (Master) 12/14/2012

AVG-CPU:%user%nice%system%iowait%steal%idle
1.08 0.15 0.14 0.05 0.00 98.58

AVG-CPU:%user%nice%system%iowait%steal%idle
0.00 0.00 0.00 0.00 0.00 100.00

AVG-CPU:%user%nice%system%iowait%steal%idle
0.42 0.25 0.00 0.00 0.00 99.33

AVG-CPU:%user%nice%system%iowait%steal%idle
0.00 0.08 0.08 0.00 0.00 99.83

AVG-CPU:%user%nice%system%iowait%steal%idle
0.00 0.00 0.00 0.00 0.00 100.00

Mpstat

is the abbreviation of multiprocessor statistics, is the real-time system monitoring tool. Some statistical information about the report and CPU, which is stored in the/proc/stat file.

In a multiple-CPU system, it is not only possible to see the average information of all CPUs, but also to view information about a particular CPU.

The Mpstat syntax is as follows:
Usage:mpstat [Options ...] [<interval> [<count>]]
Options are:
[-P {<cpu> | All}] [-V]

The parameter meaning is as follows:

-P {<cpu>| All} indicates which CPU,CPU is being evaluated in [0,CPU number-1]

Internal adjacent two-time sampling interval

Number of count samples, count can only be used with delay

When there are no parameters, the average of all information after the system is started is displayed. (Parameter interpretation obtains data from/proc/stat)

[Root@master ~]# mpstat-p 1 2 3
Linux 2.6.18-194.el5 (Master) 12/14/2012

09:56:35 AM CPU%user%nice%sys%iowait%irq%soft%steal%idle intr/s
09:56:37 AM 1 0.00 0.50 0.00 0.00 0.00 0.00 0.00 99.50 0.00
09:56:39 AM 1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 0.00
09:56:41 AM 1 0.00 0.50 0.00 0.00 0.00 0.00 0.00 99.50 0.00
Average:1 0.00 0.33 0.00 0.00 0.00 0.00 0.00 99.67 0.00

The meaning of the nice value under the added note

[Root@master ~]# ps-l
F s   uid     pid      PPID   c   pri   ni  addr       sz  WCHAN   tty          Time CMD
4 s        0 29493   29491   0     75     0           -16559         wait   pts/1    00:00:00 Bash
4 r       0 29871    29493   0     77     0           -15887               -    pts/1    00:00:00 PS

UID: Representing the performer's identity
PID: Represents the code for this process
PPID: Represents the process from which progress is derived, that is, the code of the parent process
PRI: Represents the priority that this process can be executed, the smaller the value, the sooner it is executed
NI: Represents the nice value of the process
The first three messages we have here are easy to understand, but the next two odd messages, one is the PRI and the other is NI. In contrast, the PRI is also relatively well understood, that is, the priority of the process, or the popular point is that the program is executed by the CPU sequence, the smaller the process of the higher priority level. What about NI. That's what we're talking about. Nice value, which represents the correction value of the priority that the process can be executed. As mentioned earlier, the smaller the PRI value is executed, the addition of the Nice value will make the PRI into: pri (new) =pri (old) +nice. Thus, when the nice value is negative, the program will have a lower priority value, that is, its priority is higher, and the faster it is executed.

Processes are not equal when they are created, they are given different priority values, for example, some programs that are vital to the operation of the computer itself must have a higher priority than other minor programs (then they have a smaller priority value). And as I said before, the value of Nice is to indicate that process priority values can be corrected data values, so each process is assigned a nice value when its plan is executed, so that the system can proactively intervene in the priority values of the process based on the resources of the system and the resource consumption of the individual processes. This process, users can also manually intervene, but to be given the appropriate permissions.
In UNIX systems or Linux systems, a variable number from 20 to +19 is used to represent the nice value (this is the case for Linux and Aix, where the value range of the HP-UX system ranges from 0 to 39), and in general, the child process inherits the system nice value of the parent process. Programs that have the highest priority, its nice value is the lowest, so in Unix and Linux systems, a value of 20 makes a task very important (HP-UX is 0); In contrast, if the task is good +19 (HP-UX 39), it is a noble, selfless task, Allow all other tasks to enjoy a greater share of the valuable CPU time than you have, which is also the tacit purpose of Nice's name.

[Root@master ~]# Nice
0

The root user adds nice plus 3.

[Root@master ~]# nice-n 3 ls
123.txt Desktop install.log jrockit-jdk1.6.0_29-r28.1.5-4.0.1 mysql-python-1.2.3.tar.gz Python-2.7. 3.tgz setuptools-0.6c8.tar.gz vmtouch.c
Anaconda-ks.cfg file1 install.log.syslog jrockit-jdk1.6.0_29.tar.gz part-00000.bz2 Root Soft Yum_r.txt
cmake-2.8.7.tar.gz file2 integer.sh mysql-python-1.2.3 Python-2.7.3 setuptools- 0.6c8 Vmtouch

The root user can give the child process a smaller nice value, as follows: [Root@dbbak root]# Nice
0
[Root@dbbak root]# nice-n-3 ls 123.txt Desktop install.log jrockit-jdk1.6.0_29-r28.1.5-4.0.1 MySQL -python-1.2.3.tar.gz python-2.7.3.tgz setuptools-0.6c8.tar.gz vmtouch.c
Anaconda-ks.cfg file1 install.log.syslog jrockit-jdk1.6.0_29.tar.gz part-00000.bz2 Root Soft Yum_r.txt
cmake-2.8.7.tar.gz file2 integer.sh mysql-python-1.2.3 Python-2.7.3 setuptools- 0.6c8 Vmtouch

For a background process, Nice will add 4 after the value it displays. When the "Nice Command &" command executes, it runs the program with the NICE=36 value (HP-UX system). The problem here is that if the user sets a nice value that exceeds nice's boundary value (Linux and Aix are 20 to 19,hp-ux 0 to 39), then the system takes nice's boundary value as the process's nice value.

There are 2 commands associated with the process, nice and renice, respectively.
The nice command is to set a nice value to execute the command process, in the form of nice–n Adjustment command Command_option, where you set the command's nice to execute, If adjustment is not specified here, the default is 10.
The Renice command is to set a nice value for a process that is already running, such as assuming that the nice value of the running process is 0,renice 3, then the nice value for the running process is 3. The execution of the Renice must have the appropriate permissions to execute. It can set the process's nice value based on the user, process ID, and process group.

For nice value an image analogy, assuming that in a CPU rotation, there are 2 runnable processes A and B, if their nice value is 0 (if it is HP-UX 20), plus the kernel will give each of them a 1k CPU time slice. But assuming that process a is 0, but B has a value of 10, then the CPU may allocate 1k and 1.5k of time slices to a and B respectively. It can be understood that the value of nice affects the amount of CPU time that the kernel allocates to the process, and the more time slices the process, the higher the priority, and the lower the priority value.

From the use of the top, PS, and other commands to see the nice value, is the process has a nice value, the use of Iostat, see%nice, is the user process space in the process of changing the priority of processes in the percentage of CPU, as in the example above said 0.5k/2.5k=1/5=20%.

So far, it is more important to emphasize that the nice value of the process is not the priority of the process, they are not a concept, but the process nice value affects the priority change of the process .

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.