High CPU load Problem troubleshooting

Last Update:2018-07-26 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

The concept of load average

Load average in the top command shows the system average load for the last 1 minutes, 5 minutes, and 15 minutes.

The system average load is defined as the average number of processes running in the queue during a specific time interval (running on the CPU or waiting to be run). If a process satisfies the following conditions, it is located in the run queue: it is not waiting for the I/O operation result it does not actively enter the wait state (that is, no call ' wait ') has not been stopped (for example: waiting to terminate)

In Linux, processes are divided into three states, one is blocked process blocked processes, one is a running process runnable processes, and the other is running processes running process.

When a process is operational, it is in a running queue run, competing with other running processes for CPU time. System load refers to the total number of processes running and ready to run. For example, now the system has 2 running processes, 3 can run processes, then the system load is 5. The load average is the load quantity for a certain amount of time.

In general, as long as the current number of active processes per CPU is not greater than 3 then the performance of the system is good, if the number of tasks per CPU is greater than 5, then the performance of the machine is a serious problem.

High CPU usage does not always mean CPU work is busy, it may be waiting for other subsystems. When performing profiling, it is important to consider all subsystems as a whole, because the cascade effect may occur in subsystems. The metric for measuring CPU load is load,load is the measure of how much load the computer system can bear, simply the length of the process queue. Simple examples such as the canteen has five windows, when there are less than five students to play rice, five windows can be processed in time, but when the number of students more than 5, will inevitably appear waiting students. Request is greater than the current processing power, there will be waiting, causing the load to rise.
Load Average is the average load within a period of time (1min,5min,15min). The optimal value for the average load is 1, which means that each process can be completed within a full CPU cycle. CPU Load High Troubleshooting ideas 1. First troubleshoot which processes have high CPU usage. Through the command PS UX

2. View the CPU usage of each thread corresponding to the Java process. by command: Ps-lp 15047 cu

3. Trace inside the thread to see why the load is too high. by command: Jstack 15047.

or print thread Jstack pidof java > Stack.out

Find the corresponding ThreadID, and then check the code back. General Experience

The surge in CPU load, on the one hand, may be related to the increase in the number of full GC, while it may be associated with a dead loop the general reason for the high load of the database system

1 The business concurrently invokes the full table scan/SQL statement with order by sort.
2 The SQL statement does not have an appropriate index/execution plan error/update/delete where scan the entire table, blocking other SQL execution that accesses the same table.
3 existence of a second kill similar business such as gathering value 10 o'clock or double 11 seconds kill, instantaneous massive access to the database impact.
4 Database do logical backup (need full table scan) or multiple instances of compressed backup (compression requires a lot of CPU calculation, will cause the system server load soar)
5 disk write changes such as writeback into write through
RAID cards have write cache (Battery backed write cache), and write cache to improve IO performance is very obvious, because power loss data, so must be supported by the battery.
Battery will charge and discharge regularly, generally 90 days, when the discovery of electricity below a certain threshold, will write cache strategy from writeback to Writethrough, equivalent to write cache will be ineffective, then if the system has a large number of IO operations, it may be obvious that the IO response speed slows , the CPU queue heap system load is high. identify and handle load high problems

Generally based on the number of CPUs to judge, that is, load average of less than the number of CPUs, the load of the normal value in different systems have a big difference. In a single core processor workstation, 1 or 2 is acceptable. On multi-core processor servers (such as 24 cores), load will reach 20, or even higher.

(a) database level
1 top-u mysql-c Check the process commands that currently consume the most CPU resources. -C is to show the process of the execution of the command statement, easy to see what operation caused the system load high.
2 get PID or MySQL port number according to different situation
3 If the MySQL database service causes Laod to soar, you can use the following command
Show Processlist;
SELECT * from INFORMATION_SCHEMA. Processlist WHERE COMMAND <> ' sleep ' and time>100;
Or
The ORZDBA tool checks the value of the logical read/thread active. Usage Orzdba--help
The Orztop tool checks for slow SQL that is currently executing, using Orztop-p $port
4 after getting the exception SQL, the rest is better solved. The combination of the first part of a few reasons
A Select the appropriate index
b Adjust SQL statements such as the corresponding order by paging using deferred correlation
C Business Level increase cache, reduce direct access to the database, etc.
b OS System level Check system IO

Use the Iostat command to view the R/S (read request), w/s (write request), Avgrq-sz (average request size), await (IO wait), SVCTM (IO response time)

R/S, W/S is the number of read/write requests per second.

Util is the utilization of the equipment. If it is close to 100%, it usually indicates that the device capability tends to be saturated (not absolutely, for example, the device has write caching). Sometimes more than 100% can occur, which is mostly caused by rounding at the time of calculation.
SVCTM is the average service time per request. Here is a formula: (R/S+W/S) * (svctm/1000) =util. For example: If the util reaches 100%, then Svctm=1000/(R/S+W/S), assuming the IOPS is 1000, then svctm about 1 milliseconds, if longer than this number, the system has a problem.
Await is the average waiting time for each request. This time includes the queue time and service time, that is to say, under normal circumstances, await is greater than SVCTM, their difference is smaller, the shorter the queue time, the other difference is larger, the queue time is longer, indicating that the system has a problem.
Avgqu-sz is the length of the average request queue. There is no doubt that the shorter the queue the better.

Resources

http://blog.csdn.net/u011183653/article/details/19489603

http://blog.itpub.net/22664653/viewspace-1262635/fleeting Susan http://www.cnblogs.com/lddbupt/

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More