A Free Trial That Lets You Build Big!
Start building with 50+ products and up to 12 months usage for Elastic Compute Service
Note: This content is excerpted from the "DevOps Troubleshooting: Best Practices for Linux server Operations" book
The average system load is probably the first basic metric to use when solving problems that cause the system to run slowly. When troubleshooting a slow system, the first command I usually perform is uptime:
650) this.width=650; "class=" Fit-image "src=" http://s1.51cto.com/wyfs02/M01/26/AA/ Wkiol1nsi6xzdziqaaad5zfasxa907.jpg "height=" "border=" 0 "width=" 498 "/>
The 3 digits 2.03, 20.17, and 15.09 behind the load average represent the average load of the machine in 1 minutes, 5 minutes, and 15 minutes, respectively. The average load on a system is equal to the average of the running or non-intrusive processes. The running process is either using the CPU or waiting for the CPU, and the non-intrusive process is waiting for the I/O response.
A single CPU system with an average load of 1 means that the CPU is at a constant load. If the average load on a single CPU system is 4, then the system is 4 times times more capable of withstanding the load, so 3/4 of the processes are waiting for resources. The average load on a system does not change because of the number of CPUs you have, so if the average load on a system with two CPUs is 1, then one of the CPUs is always at full capacity, that is, the system is in a 50% load state. Therefore, a single CPU system with a load status of 1 is the same as the amount of resources used by four CPU systems with a load status of 4.
The average load of 1 minutes, 5 minutes, and 15 minutes describes the average of the load over time, and these values are useful in determining the current state of the system. The average load in 1 minutes gives you a clear view of the current state of the system, so in the previous example you can see that the server has a load of 2 in the last 1 minutes, but the average load has soared to 20 in the last 5 minutes. In the first 15 minutes, the load average is 15. The machine has been in high load for the last 15 minutes, and the system load has started to grow again 5 minutes ago, but it is now weakened. Let's compare it to a completely different average load.
650) this.width=650; "class=" Fit-image "src=" http://s9.51cto.com/wyfs02/M00/26/AA/ Wkiom1nsi9wdlmdfaaaed3beips236.jpg "height=" "border=" 0 "width=" 498 "/>
In this example, the average load in 5 minutes and 15 minutes is very low, but the average load in 1 minutes is very high, so I know that the load spike has been relatively recent. In this case, I usually run multiple uptime commands consecutively (or use the top command) to see if the load is going up or down
2. What is the high average load?
A question worth studying is: How high is the average load? The simple answer is "it depends on the cause of the high load." Because the load describes the average number of active processes that are using resources, the spike in load reveals a lot of information. It is important to specify whether the load is CPU intensive (the process waiting for CPU resources), ram-intensive (especially, the frequently used RAM is moved into the swap area) or I/O-intensive (a process that competes for disk or network I/O resources).
For example, if you run an application that produces a large number of synchronization threads at different points in time, these threads will start simultaneously, and you may see the load soar to 20, 40, or higher, and they are competing for system resources. As these processes are gradually completed, the load will fall down.
Typically, CPU-intensive systems are more responsive than I/o-intensive systems. I've seen hundreds of CPU-intensive systems, and I can still run troubleshooting tools on those systems and have good response times. I've also seen I/o-intensive systems with relatively low I/O loads, but it takes a while to log in to these systems because their disk I/O is fully saturated. Systems that use up RAM resources typically behave the same as I/O-intensive systems, because once the system starts using swap storage on disk, it consumes disk resources, causing the process to slow down until it stops.
3. use the top command to troubleshoot load issues
When it comes to solving high-load problems, the first tool I think of is the top command. After entering the top command at the command line and pressing the ENTER key, you will immediately see a large amount of system information (see Figure 2-1). The data is constantly updated, so you can see real-time information about the system, including how long the system has been booted, the average load, how many processes are running in total, how much memory is in total, how much memory is in use, how much memory is left, and finally the list of processes and the number of resources they consume. Using the top command may not be able to see all the processes currently running on the system because they cannot be displayed on the screen. The top command is sorted by default by the CPU usage of the process, so you can see at a glance which processes are consuming CPU resources.
650) this.width=650; "class=" Fit-image "src=" http://s6.51cto.com/wyfs02/M00/26/AA/ Wkiol1nsjbgq4vhjaaffuqo87m4569.jpg "height=" 520 "border=" 0 "width=" 498 "/>
So what do you do if you find that a process consumes all of the CPU resources and you want to terminate the process? The first column of the output of the top command is the PID, which represents the unique ID assigned to each process by the program's process id-system. To terminate a process, simply press the K key and enter the PID you want to terminate, and finally, when the system prompts the process to terminate at Signal 15 o'clock, press the ENTER key.
The top command runs in noninteractive mode by default, and if you don't need to see the information that appears outside the screen, everything is fine. If you want to see the full output of the top command, or if you want to redirect the information to a file, you can run the command in batch mode. The-B option turns on batch mode, and the-N option can control how many times the information is refreshed before exiting the top command. For example, if you want to see the full output, just run the top command once and enter the following command:
650) this.width=650; "class=" Fit-image "src=" http://s1.51cto.com/wyfs02/M01/26/AA/ Wkiol1nsjczbtmwbaaakyf7pgio023.jpg "height=" PNS "border=" 0 "width=" 382 "/>
If you want to store this information in a file named Top_output, enter the following command:
650) this.width=650; "class=" Fit-image "src=" http://s2.51cto.com/wyfs02/M00/26/AA/ Wkiom1nsjfvswjd0aaanw2yhoyg487.jpg "height=" "border=" 0 "width=" 373 "/>
If you want to see the output of the top command and write the output to a file, you can use the convenient command-line tool Tee:
650) this.width=650; "class=" Fit-image "src=" http://s1.51cto.com/wyfs02/M02/26/AA/ Wkiom1nsjgcrp5h7aaaq9qbk8wu508.jpg "height=" "border=" 0 "width=" 418 "/>
When you use the top command to troubleshoot a system load problem, the basic step is to check the output of top to determine which resources are exhausted (CPU, RAM, or disk I/O). Once the problem is clear, you can try to check exactly which processes are consuming the resources. First, check the standard output of the top command in the system:
650) this.width=650; "class=" Fit-image "src=" http://s9.51cto.com/wyfs02/M02/26/AA/ Wkiol1nsjirhlfzfaabc6vz7yrq130.jpg "height=" 144 "border=" 0 "width=" 498 "/>
650) this.width=650; "class=" Fit-image "src=" http://s5.51cto.com/wyfs02/M01/26/AA/ Wkiom1nsjlrxkqjdaabafrhzaxo678.jpg "height=" 149 "border=" 0 "width=" 498 "/>
The first line of the top command output is the same as the output of the uptime command you saw earlier. As you can see in this example, for a machine with 4 CPUs, the system load is not very large.
650) this.width=650; "class=" Fit-image "src=" http://s7.51cto.com/wyfs02/M02/26/AA/wKiom1NsjMzDXK4_ Aaawiysratg343.jpg "height=" to "border=" 0 "width=" 498 "/>
However, in addition to the standard system load, the top command provides you with additional metrics. For example, this line of CPU (s) provides information about the current CPU performance:
650) this.width=650; "class=" Fit-image "src=" http://s4.51cto.com/wyfs02/M00/26/AA/ Wkiom1nsjncxghecaaadamiaj4y526.jpg "height=" "border=" 0 "width=" 498 "/>
If you don't know what these abbreviations mean, then they mean nothing to you, so I'll put them all down here:
US: User CPU time
Percentage of CPU time running a non-graceful user process (elegant, English "nicing" refers to a process that allows you to change priorities based on other processes).
SY: System CPU Time
The percentage of CPU time running the kernel and kernel processes.
NI: Elegant CPU time
If you change the priority of some processes, this indicator can tell you the percentage of CPU time they occupy.
ID:CPU Idle Time
This is one of the metrics you want to have high values for. It represents the idle time ratio of the CPU. If the system is running slowly, but this indicator is particularly high, then you can determine the cause of the problem is not the high CPU load.
This number represents the percentage of CPU time waiting to perform I/O operations. This is a very valuable metric when you're dealing with slow system problems, because if the value is low, you can easily troubleshoot disk or network I/O problems.
Hi: Hardware Interrupt
The percentage of time that the CPU is used to handle hardware interrupts.
Si: Software Interrupt
The percentage of time that the CPU used to process software interrupts.
ST: Elapsed Time
If you are running a virtual machine, this metric tells you the percentage of CPU time that other tasks performed in the virtual machine take up.
In the previous example, you can see that the system has more than 50% free time, which matches the 4 CPU and the system load of 1.70. when you are dealing with a slow system, one of the first metrics to observe is the I/O wait time, which can be used to troubleshoot disk I/O issues. If I/O wait time is low, then you can look at the percentage of CPU idle time, and if I/O wait time is high, then the next step is to determine what is causing the I/O wait time to be so high, I'll talk about it right away. If I/O waits and the percentage of CPU idle time are low, you are likely to see a very high percentage of user time, so you have to determine what causes this high percentage of user time. If the percentage of I/O waiting time is very low and the percentage of idle time is high, you know that the system is not running slowly for CPU resources, but should be looking for reasons elsewhere. This could mean looking at network problems or problems with the Web server, or checking for slow MySQL queries, and so on.
4. solve the problem of high user time
A common and relatively simple problem in solving a failure is a high load problem caused by a high percentage of user CPU time. This is common because services on the server are likely to account for the vast majority of the system load and are user processes. If you find that the percentage of user time is high but the I/O wait time percentage is low, it is clear that you need to determine which process in your system is consuming such a large amount of CPU resources. By default, the top command sorts from highest to lowest CPU usage per process.
650) this.width=650; "class=" Fit-image "src=" http://s5.51cto.com/wyfs02/M01/26/AA/ Wkiom1nsjoia8oymaabezepfi-m582.jpg "height=" 136 "border=" 0 "width=" 498 "/>
In this example, the MYSQLD process consumes 53% of the CPU time, and the nagios2db_status process consumes 12% of the CPU time. Note that this number represents the percentage of a single CPU, so if you have a machine with 4 CPUs, you may see multiple processes consuming 99% of CPU time.
You will see that most of the high CPU load is due to the CPU being depleted by one, two, or many processes. Any situation is easy to determine, because in the first case, the top one or two processes in the output have very high CPU percentages, and the remaining processes are relatively low percentage of CPU, at this point, the workaround is to terminate a large number of processes using CPU resources (by pressing the K key, Then enter the PID for the corresponding process).
In the case of multiple processes, you may have made the system do too much. For example, there may be a large number of Apache processes in the Web server, and some log parsing scripts running in cron. These processes may consume approximately the same amount of CPU resources. The solution to this problem is quite complex in the long run. As an example of a Web server, you do need to run all Apache processes, and you may also need log parsing tools. in the short term, you can terminate (or postpone) some processes until the load is reduced, but in the long run you may want to consider increasing the system resources or splitting the features into multiple servers .
5. troubleshoot out-of-memory issues
The following two lines in the top output provide very valuable information about RAM usage. It is important to troubleshoot memory issues before dealing with specific system problems.
650) this.width=650; "class=" Fit-image "src=" http://s2.51cto.com/wyfs02/M00/26/AA/ Wkiom1nsjtrcwyhoaaao02njxyk031.jpg "height=" "border=" 0 "width=" 498 "/>
Line 1th tells us how much physical memory is available, how much memory is occupied, how much memory is free, and how much memory is cached . 2nd Behavior We provide similar information about how much RAM is used by the swap store and the Linux file cache. at one glance, the system memory is almost exhausted because the system displays only 768KB of free memory. Many troubleshooting personnel are misled by the used and idle information that is associated with the Linux file cache in the output. Once Linux loads a file into RAM, it does not need to be removed from RAM when the program runs out of this file. If there is a ram,linux available, the file will be cached in RAM so that if a program accesses the file again, the access speed will be greatly improved. If the system does need to provide RAM for the active process, so many files will not be cached in RAM. Because of the existence of a file cache, it is usually shown that only a small amount of RAM is idle and the rest is consumed by the cache after the server has been running for quite a long time.
To find out exactly how much ram the process actually uses, you must shaving the file cache in RAM. As you can see the example code, in the used 997 408KB RAM, there is 286 040KB of RAM occupied by the file cache, so that means that only 711 368KB of RAM is actually used. In this example, the system still has a large amount of available memory resources, with virtually no swap storage. Even if you do see some swap storage in use, that's not enough to be a symptom of the problem. If a process becomes idle, Linux usually frees up the ram it consumes for other processes to use. A good way to tell if you're running out of RAM is to look at the file cache. If the actual memory minus the value of the file cache is large, and the value of the Exchange store is also high, it is likely that there is a memory problem.
If you do find a memory problem, the next step is to determine which processes are consuming RAM. The top default is sorted by CPU usage, so you'll need to change it to sort by ram utilization. Keep the top open, and then press the M key. This will cause all processes to be sorted according to the RAM usage rate.
650) this.width=650; "class=" Fit-image "src=" http://s8.51cto.com/wyfs02/M00/26/AA/ Wkiol1nsjsdt2t4laaaunjrcc9a589.jpg "height=" "border=" 0 "width=" 498 "/>
650) this.width=650; "class=" Fit-image "src=" http://s5.51cto.com/wyfs02/M02/26/AA/ Wkiom1nsju7bshc4aaa2q8tkidg153.jpg "height=" border= "0" width= "498"/>
Note%mem This column, you will see that the first few processes occupy a lot of RAM. If you find a lot of processes that use RAM, you can either terminate them, or, depending on the program, use specialized troubleshooting methods to find out what causes these processes to occupy a lot of RAM.
In fact, the output of the top command can be sorted by any column. To change the sort mode of the top output, press the F key to enter the interface for selecting the sorting sequence. After pressing the key corresponding to a specific column (for example, K corresponds to the CPU column), press ENTER to return to the top output interface.
The Linux kernel also has a memory exhausted (OOM) Terminator, which is involved if low memory causes the system to run dangerously. when the system memory is running out, the Oom Terminator begins to terminate the process. In some cases, it may be possible to terminate a process that consumes a lot of RAM, but it does not guarantee that a process that does not consume large amounts of RAM will be terminated. Sometimes it also terminates programs like sshd or other processes, rather than the real culprit. Many times, when Oom terminates some process, the system becomes unstable, so you have to restart the machine to ensure that all system processes are running properly. If the oom Terminator is involved, you will see the following line in/var/log/syslog:
650) this.width=650; "class=" Fit-image "src=" http://s1.51cto.com/wyfs02/M01/26/AA/ Wkiol1nsjtdhnbltaaa2l9ciqqy681.jpg "height=" "border=" 0 "width=" 498 "/>
6. Resolve high I/O wait time issues
when you see a high percentage of CPU time for I/O waiting time, the first thing to check is whether the machine is using a lot of swap space. Because the speed of the hard disk operation is much lower than RAM, the performance of the system can be severely affected when the system memory runs out and the swap space is started. Any operation that wants to access the hard drive completes the I/O exchange with the hard disk. So, the first step in troubleshooting is to see if the memory is exhausted, and if so, fix the problem first . If you still have a large amount of RAM available, you need to identify which process is consuming most of the I/O operations.
Sometimes it's hard to figure out which process is taking up a lot of I/O resources, but if there are multiple partitions in the system, you can narrow down and find which partition is performing a lot of I/O operations. To do this, you need to use the Iostat program, which is available in the Sysstat package based on Red Hat and Debian-based systems. If your machine is not installed, you can install it through the Package management tool.
Before you solve the problem, you'd better install the program first. After installing this program, you can run the Iostat without any parameters and observe the overall situation of the system.
650) this.width=650; "class=" Fit-image "src=" http://s9.51cto.com/wyfs02/M02/26/AA/ Wkiol1nsjxoqojguaabx7bum76g925.jpg "height=" 263 "border=" 0 "width=" 498 "/>
The first thing you see is CPU information similar to the top command, followed by the I/O status information for all hard disk devices and their partitions on the system. The following are the meanings of the columns represented:
This value lists the amount of transmission per second for the device. Transport (Transfer) is another way to express an I/O request to a device.
Represents the amount of data read from the device per second.
Represents the amount of data written to the device per second.
This column represents the total amount of data read from the device.
This column represents the total amount of data written to the device.
When the system is in a high I/O load state, the first step is to observe each partition to see which partition has the highest I/O load. For example, you have a database server, and the database itself is stored in the/dev/sda3 partition. If you see a lot of I/O operations coming from here, this is a good clue: the database is likely to occupy a large amount of I/O resources.
Once this is understood, the next step is to determine whether the I/O operation is mostly read or write. Suppose you suspect that backup work has led to an increase in I/O operations. Because the operations of the backup work mainly focus on reading files from the file system and then transmitting them over the network to the standby server, if a large number of I/O operations come from the write rather than the read operation, then it is probably possible to exclude the problem.
You may need to run the Iostat command multiple times to get the current precise I/O status of the system. If you specify a numeric parameter at the command line, the Iostat runs continuously and refreshes the output information based on the specified number of seconds. For example, if you want to see the output of Iostat once every 2 seconds, you can enter sudo iostat 2. If you have any NFS shares, iostat another very useful parameter is-n, and when you specify the-n parameter, Iostat will give you I/O statistics for all NFS shares.
In addition to Iostat, we have a very simple tool in the latest release. In fact, it is a mixture of top and iostat programs that can display all the running processes in the system and sort the processes according to I/O statistics. This software uses some of the new features of the Linux kernel, so it requires a 2.6.20 or newer kernel. If the program is not installed by default, you can find it in the Iotop package. This tool is included in the Debian-based version, but for Red Hat versions, you will need to find and install third-party RPMs online or from third-party repositories. Once this package is installed, you can run iotop with root privileges and see the following output:
650) this.width=650; "class=" Fit-image "src=" http://s5.51cto.com/wyfs02/M00/26/AA/ Wkiom1nsjawatxdhaab5qon4ak0830.jpg "height=" 317 "border=" 0 "width=" 498 "/>
In this example, you will see that the rsync process is performing a large number of I/O read operations.
7. High load handling after the problem occurs
As of this point, this chapter discusses how to find the cause of high loads when the system is overloaded. While top and iostat are great tools, we're not always lucky enough to find a solution when something happens to the system. I can't remember how many times I've met. The machine is running slowly and can only wait for the load to drop to log in. With just a little bit more work, you can install the tools on the server to record performance data throughout the day.
We've discussed how to use the Iostat tool in the Sysstat package to solve high I/O problems, but Sysstat also contains tools that can report CPU and RAM usage. While it is true that you can do this with the top command, Sysstat is more powerful, and it is able to use a simple mechanism to record system statistics such as CPU load, RAM, and I/O status. With these statistics, when someone complains about the slow system at noon yesterday, you can check the logs to see what caused the problem.
Why is the server so slow? Depleted CPU, RAM, and disk I/O resources
Start building with 50+ products and up to 12 months usage for Elastic Compute Service