The general flow of software projects is: design, coding, tuning, on-line. Tuning process often encounter system performance is not enough, but the performance is not good also normal, if you write code performance on the cow X a mess, may also not need so many so-called best prticace experience summed up.
Recently saw a book, "DevOps Troubleshooting", the book is very thin, the content may be in other books are explained, but he summed up very well, may be the system after the failure of the elimination process to do a general summary, for me, may be in the tuning stage to analyze the system bottleneck when there is a great help, so write down the study notes.
First we know that the main resources of the server include:
The system is out of the question what to do, I think the restart may be resolved, but this may have lost the opportunity to make you a master. If you can, log on to the system, there should be some tools to find out who is actually engaged in the plane (why should, because in the past I do not understand, but will soon know)
1 System load
Usually the first command is uptime:
03:11:10updays, 6:26, user, loadaverage: 2.03, 20.17, 15.09
- Load average the back 3 digits 0.08, 0.04, 0.00 represent the average load of the machine in 1 minutes, 5 minutes, and 15 minutes, respectively. The average load on a system is equal to the average number of processes in a running or non-intrusive state . The running process is either using the CPU or waiting for the CPU, and the non-intrusive process is waiting for the IO response.
- A single CPU system with an average load of 1 means that the CPU is at a constant load, and if the average load of a single CPU system is 4, the system is 4 times times more affordable, and all 3/4 of the processes are waiting for resources. Of course, a single CPU with a load status of 1 has the same amount of resources as a four-core CPU system with a load status of 4.
- The average load of 1 minutes, 5 minutes, and 15 minutes describes the average of the load in relative time. From the above example, it can be seen that the server in the past 1 minutes load of 2, but in the past 5 minutes has soared to 20, and the first 15 minutes load averaged 15. This shows that the machine has been in high load for the last 15 minutes and the load of the system has increased 5 minutes ago, but it has now weakened.
Look at one more:
03:11:10updays, 6:26, user, loadaverage: 17.29, 0.12, 0.01
- In the example above, the average load of 5 minutes, 15 minutes is very low, but the average load in 1 minutes is very high, so it is known that the spike in load occurred recently. So you can use the top command to see if the load is going up or down.
How high is the average load?
This depends on the cause of the high load . It is important to specify whether the load is CPU intensive (the process waiting for CPU resources), ram-intensive (especially if the ram being used is in the swap area) or IO-intensive (a process that robs disk or network IO resources).
- CPU-intensive systems typically have a higher impact than IO-intensive systems, so running the troubleshooter on those systems will have a good response time (or faster).
- For IO-intensive systems with high IO loads, it usually takes a while to log on to these systems because disk IO may be saturated.
- Systems that use up RAM often behave the same as IO-intensive systems, because once the system starts using swap storage on disk, it consumes disk resources, causing the process to slow down until it stops.
2 Troubleshooting load issues with the top command
To troubleshoot a high-load problem, the first tool is top.
Output of the top command
- The first line of output is consistent with the output of the uptime, which shows that the load on this machine is not very large.
top- 04:35:45updays, 7:50, users, loadaverage: 0.00, 0.00, 0.00
- CPU (s) provides information on how to run the situation
cpu (s ): 2.0 %us , 0 . 2 %sy , 0 . 0 %ni , 97.6 %id , 0 . 0 %wa , 0 . 0 %hi , 0 . 2 %si , 0 . 0 %st
- US: Percentage of CPU time running a non-graceful user process
- Sy: Percentage of CPU time running kernel and kernel processes
- NI: Elegant CPU time
- ID: represents the idle time ratio. If the system is running slowly, but this metric is particularly high, then the load is not due to high CPU load .
- WA: The percentage of waiting for an IO operation, which is a very valuable indicator when it comes to solving a slow system problem, and if the value is low, you can easily troubleshoot disk or network IO problems.
3 Troubleshooting high CPU load issues
symptom :%us CPU High, IO%wa low. It is necessary to determine which process in the system is consuming such a large amount of CPU resources.
In general, most high CPU loads are consumed by the CPU by one or more processes.
4 solving the problem of insufficient RAM
The following two lines in the top output provide RAM usage, such as
Mem 3849548k total, 3819152k used, 30396k free, 15144k buffersSwap 2097144k total, 1604548k used, 492596k free, 75248k cached
- The first line is how much physical memory is available, how much is occupied, how much is free, and how much memory is cached.
- The second line is the swap store and how much RAM is used by the Linux file cache.
As you can see, the system memory is really exhausted, because the system has only 30396KB of free memory, the file cache consumes 75248KB of memory (this part of the memory can also be used for other processes, but it is too small). Swap has been used for 1.6G, so the system's memory is obviously not enough.
There's really a problem with the memory, so if you determine which processes are consuming RAM. Top defaults to the CPU usage sort process, so it needs to be sorted by RAM usage, keep top open, press the M key, and all processes will be sorted by RAM usage.
- Note The%mem column, which lists all processes in the order in which the memory is used, so that we can find the process that consumes the most memory, and then we can analyze for the target process and why so much memory is used. (Haha, see, this is our online system of a machine's process list, fortunately, the current business is very small, otherwise unimaginable, and I have fixed this memory leak problem, a sense of accomplishment)
5 Resolving High IO wait latency issues
When the IO%wa is very high, the first thing to do is to check the machine for a lot of swap space, because the disk operation speed is much lower than RAM, so when the system memory is close, when the swap space is started, the performance of the system will be severely affected. So the first step is to see if the memory is exhausted, and if so, solve the problem first . If you have a lot of RAM, you need to be clear that the process is taking up most of the action.
It's hard to see which process is taking up a lot of IO resources, and Advanced commands come in:
* IOSTAT (This command is available in the Sysstat package)
Linux2.6. +-431. el6.x86_64 (Nj-figo-cui) 08/09/ -_x86_64_(4CPU) Avg-cpu:%user %nice %system %iowait %steal %idle 0. - 0.xx 0. One 0. on 0.xx 99.63Device:tps blk_read/sblk_wrtn/sBlk_read Blk_wrtn SDA0. - 12.25 16.48 9632378 12956192dm-0 2.24 12.24 16.45 9622250 12933488dm-1 0.xx 0.xx 0.xx 2640 0
- AVG-CPU:CPU Information
- TPS: This value lists the number of transmissions per second for the device. "Transport" is another way to express an IO request from a device.
- BLK_READ/S: Represents the amount of data read from the device per second.
- BLK_WRITE/S: Represents the amount of data written to the device per second.
- Blk_read: Represents the total amount of data read from the device.
Blk_write: Represents the total amount of data written to the device.
When the system is in a high IO load state, you can see which partition has the highest load, thus narrowing the scope. If you know a partition Io is high, then you can look at the process of the data stored in this partition (believe that the large data volume process is a few, so that the general can find the target process.) )
- Iotop
This command is similar to the top command, but the output of this command is based on the IO status of each process, and here is an example, which is not detailed here.
Total DISK read:0.00 b/S | Total DISK WRITE: 0.00b/S TID PRIO USER DISK READ disk WRITE swapin io> COMMAND1be/4Root0.00b/S0.00b/S0.00 % 0.00%Init2be/4Root0.00b/S0.00b/S0.00 % 0.00%[Kthreadd]3rt/4Root0.00b/S0.00b/S0.00 % 0.00%[migration/0]4be/4Root0.00b/S0.00b/S0.00 % 0.00%[ksoftirqd/0]5rt/4Root0.00b/S0.00b/S0.00 % 0.00%[migration/0]6rt/4Root0.00b/S0.00b/S0.00 % 0.00%[watchdog/0]7rt/4Root0.00b/S0.00b/S0.00 % 0.00%[migration/1]
6 High load handling after a problem occurs
It is very likely that the machine load is very high when the login is not up. So it's possible to record all-day performance data through tools, so if someone complains about a slow system at noon, you can go up and check the logs to see what's causing the problem.
There are two tools to use, atop and sysstat.
# Run system activity accounting tool every 10 minutes*/10 ** * * root /usr/lib64/sa/sa1 1 1# 0 * * * * root /usr/lib64/sa/sa1 600 6 &# Generate a daily summary of process accounting at 23:53* * * root /usr/lib64/sa/sa2 -A
The first command executes every 10 minutes, and the second command executes once every day 23:53 (the daily statistics of the generated process accounting).
- The contents of the collection are as follows, please see (sar-h) for specific usage methods
- Load average
- CPU Load
- Ram
- Disk IO
- Network IO, etc.
OK, said so much, in fact, in order to meet the bottleneck in the system can help some developers to locate the source of bottlenecks, specific experience has to rely on personal accumulation, if you have help, I have a sense of accomplishment!
Reference:
- "DevOps Troubleshooting-linux Server Operations Best Practices"-Kyle Rankin
Copyright NOTICE: This article for Bo Master original article, without Bo Master permission not reproduced.
Troubleshooting Linux Systems