A Free Trial That Lets You Build Big!
Start building with 50+ products and up to 12 months usage for Elastic Compute Service
Why is the server so slow? CPU, RAM, and disk I/O resources are exhausted, and ram is used up
1. System Load
Machine running is slow because it consumes too many system-specific resources. The main resources of the system include CPU, RAM, disk I/O, and network. Excessive use of any of these resources puts the system in trouble. However, if you can log on to the system, you can use a large number of tools to determine the cause of the problem.
To solve the problem of slow system operation, the average system load may be the first basic measurement standard.
The most common command is uptime:
The three numbers 2.03, 30.17, and 15.09 after the load average represent the average load of the machine within 1 minute, 5 minutes, and 15 minutes, respectively. The average load of a system is equal to the average number of processes in the running or non-disturbing state.
A single CPU system with an average load of 1 means that the CPU is under a constant load. If the average load of a single CPU system is 4, the system is 4 times the load capacity, so 3/4 of processes are waiting for resources. The number of resources used by a single CPU system with a load of 1 is the same as that of a four-CPU system with a load of 4.
In this example, the average load within five minutes and 15 minutes is very low, but the average load within one minute is very high, so it is known that the load surge is relatively recent. Generally, we run the uptime command multiple times (or use the top command) to check whether the load is continuously increasing or decreasing.
What is high average load:
This depends on the cause of high load. Because the load describes the average number of active processes that are using resources, the soaring load reveals a lot of information. It is clear that the load is CPU-intensive (processes waiting for CPU resources) and RAM-intensive (especially when frequently used RAM is moved into the swap zone) i/O-intensive (processes competing for disk or network I/O resources) is very important.
Generally, CPU-intensive systems are more responsive than I/O-intensive systems. I have seen hundreds of CPU-intensive systems that can still run troubleshooting tools on these systems and have a good response time. I/O-intensive systems with relatively low I/O load, it takes some time to log on to the system because their disk I/O is completely saturated. A system that uses up RAM resources is usually the same as an I/O-intensive system because it consumes disk resources once the system starts to use swap storage on the disk, the process slows down until it stops.
When you need to solve the high load problem, the first tool that comes to mind is the top command. You can see the real-time information of the system, including how long the system was started, the average load, the total number of processes running in the system, the total amount of memory, the amount of memory used, and the remaining amount of memory, it also contains the list of system processes and the number of resources they occupy. By default, the top command is sorted by the CPU usage of processes from top to bottom. You can see that those processes are consuming CPU resources at a glance.
The first output column of the top command is PID. to terminate a process, press the K key and enter the PID to be terminated. The system prompts that the process will terminate with signal 15, press Enter.
By default, the top command is in non-interactive mode. If you want to see the complete output of the top command or redirect information to the file, the-B option can enable the batch processing mode, -n option can control the number of times the information is refreshed before the top Command is exited.
To view the complete output, run the top command only once:
top -b -n 1
Store the information to the top_output file:
top -b -n 1 > top_output
If you want to view the output of the top command and write the output to a file, you can use the tee tool:
top -b -n 1 | tee top_output2.1 understand the output of the top command
The first line of top Command output is consistent with the previous uptime command output.
The top command provides additional metrics. For example, the Cpu (s) line provides information about the current CPU running status.
Us: User CPU time
The percentage of CPU time occupied by non-elegant user processes ("nicing" indicates that a process allows you to change the priority of other processes ).
Sy: system CPU time
Percentage of CPU time consumed by running the kernel and kernel processes.
Ni: elegant CPU time
If you have changed the priority of some processes, this metric will tell you the percentage of CPU time they occupy.
Id: CPU idle time
This is one of the metrics that you want to have a high value. It represents the idle time ratio of the CPU. If the system runs slowly, but this indicator is particularly high, you can determine that the cause of the problem is not high CPU load.
Wa: I/O wait
This number represents the percentage of I/O operations waiting for CPU time. When you solve the slow-running system problem, this is a very valuable metric. If this value is very low, you can easily eliminate the disk or network I/O problems.
Hi: hardware interruption
The percentage of time that the CPU uses to handle hardware interruptions.
Si: software interruption
The percentage of time that the CPU is used to process software interruptions.
St: elapsed time
If you are running a VM, this metric indicates the percentage of CPU time occupied by other tasks in the VM.
From the above example, we can see that the system has more than 50% idle time, which matches with the machine with four CPU and 1.70 system load metrics. When processing a slow system, one of the first metric indicators to be observed is the I/O wait time, which can be used to eliminate disk I/O problems, if the I/O wait time is low, you can check the percentage of CPU idle time. If the I/O wait time is high, then the next step is to determine what causes the proportion of I/O wait time to be so high. If the percentage of I/O waits and CPU idle time is low, you may see a very high percentage of user time, so you must determine what causes such a high percentage of user time. If the percentage of I/O wait time is very low, and the percentage of idle time is very high, it is known that the system is running slowly is not the cause of CPU resources, but you should find the reason elsewhere. This may mean you should check for network problems, web server problems, or slow MySQL queries.2.2 solve high user time problems
A common and relatively simple problem in the troubleshooting process is the high load problem caused by a high percentage of CPU time. If you find that the percentage of user time is high but the percentage of I/O wait time is low, you need to determine which process in the system occupies a large amount of CPU resources. By default, top is sorted by the CPU usage of each process from high to low.
In this example, the mysqld process consumes 53% of the CPU time and the nagios2db_status process consumes 12% of the CPU time. This number indicates the percentage of locks to a single CPU. If you have a machine with 4 CPUs, you may see that multiple processes consume 99% of the CPU time.
It is usually easy to determine that the first or two processes in the top command output have a very high CPU percentage, while the remaining processes have a relatively low CPU percentage, in this case, the solution is to terminate a process that uses a large number of CPU resources (press the K key and then enter the PID of the corresponding process ).
In the case of multi-process, if the system has done too many things. For example, there may be a large number of Apache processes on the Web server, and some log parsing scripts running in cron. These processes consume almost the same amount of CPU resources. The solution to this problem is quite complicated in the long term. Take the Web server as an example. You do need to run all Apache processes, and you may also need a log parsing tool. In the short term, you can terminate (or postpone) some processes until the load is reduced, but in the long term, you may need to consider increasing system resources or splitting these functions onto multiple servers.2.3 solve the problem of insufficient memory
The following two lines in the top output provide valuable information about RAM usage. It is important to exclude memory before handling specific system problems.
Row 3 shows how much physical memory is available, How much memory is occupied, idle memory, and how much memory is cached. 2nd behavior we provide similar information, swap storage and how much RAM is used in Linux File Cache. Note that only the 26768KB idle memory is displayed.
To find out how much RAM a process uses, you must clear the File Cache in RAM. In the used 997408KB RAM, 286040KB RAM is occupied by the File Cache. Therefore, only kB RAM is used. In this example, the system still has a large number of available memory resources, and almost no swap storage is used. Even if some SWAP storage is used, this is not enough as a symptom of the problem. If a process changes to idle, Linux usually releases the RAM it occupies for other processes. A good way to identify whether RAM is exhausted is to view the File Cache. If the actual memory used minus the File Cache value is large, and the value of swap storage is also high, it is likely that there is a memory problem.
If memory problems are found, the next step is to determine which processes consume RAM. By default, top is sorted by CPU usage, so you need to sort it by RAM usage to keep top open and press the M key. This will sort all processes by RAM usage.
Note that the % MEM column shows that the previous processes occupy a large amount of RAM. If you find a large number of processes that use RAM, you can terminate them, or use a dedicated troubleshooting method to find out why these processes occupy a large amount of RAM.
2.4 solve the high I/O wait time problem
In fact, the output of the top command can be sorted by any column. To change the sorting method of top output, press the F key to enter the page for selecting the sorting column. After pressing the buttons corresponding to a specific column (for example, K corresponds to the CPU column), press Enter to return to the top output page.
The Linux kernel also has an out-of-memory (OOM) Terminator. If low memory causes system operation risks, it will intervene. When the system memory is about to run out, the OOM Terminator starts to terminate the process. In some cases, the process that occupies a large amount of RAM may be terminated, but it cannot guarantee that the process that does not occupy a large amount of RAM will not be terminated. Sometimes it will terminate programs like sshd or other processes, rather than the real culprit. Many times, after OOM terminates some processes, the system becomes unstable, so you have to restart the machine to ensure that all the system processes are running properly. If the OOM Terminator is involved, you will see the following lines in/var/log/syslog:
When I/O wait time accounts for a high proportion of CPU time, first check whether the machine is using swap space. Because the disk operation speed is much lower than RAM, when the system memory is exhausted and the swap space is used up, the system performance will be seriously affected. Any operation that you want to access the hard disk must complete I/O switching with the hard disk. Therefore, the first step in troubleshooting is to check whether the memory is exhausted. If yes, solve the problem first. If there is a large amount of available RAM, you need to determine which process occupies most of the I/O operations.
Sometimes it is hard to figure out which process occupies a large amount of I/O resources, but if there are multiple partitions in the system, you can narrow down the scope, find the partition that is performing a large number of I/O operations. To achieve this, you need to use the iostat program, which is provided in the Red Hat and Debian-based sysstat packages. If the machine is not installed, you can use the package management tool to install it.
First, we can see CPU information similar to the top command, followed by the I/O status information of all hard disk devices and partitions on the system. Meaning of each column:
This value lists the amount of data transmitted by the device per second. Transfer is another way of expressing an I/O Request to a device.
The amount of data read from the device per second.
The amount of data written from the device per second.
This column indicates the total amount of data read from the device.
This column indicates the total amount of data written to the device.
When the system is in a high I/O Load Status, first observe each partition to see which partition has the highest I/O load. For example, a database server is stored in the/dev/sda3 partition. If you see a large number of I/O operations come from here, this is a very high clue: the database may occupy a lot of I/O resources.
After understanding this, the next step is to determine whether most of the I/O operations come from reading or writing. Assume that I/O operations increase due to suspected backup. Because backup operations mainly focus on reading files from the file system and transmitting them to the backup server over the network, if a large number of I/O operations come from writing rather than reading, this problem can be probably ruled out.
You may need to run the iostat command multiple times to obtain the current accurate I/O status of the system. If you specify a numeric parameter in the command line, iostat continues to run and refresh the output information based on the specified number of seconds. For example, if you want to view the iostat output every 2 seconds, you can input iostat 2. If you have any NFS share, another very useful parameter of iostat is-n. When you specify the-n parameter, iostat will provide all NFS shared I/O statistics.
In addition to iostat, there is also a very simple tool, which is a mix of top and iostat programs that can display all running processes in the system and sort processes by I/O statistics. This program is not installed by default. You can find it in the iotop package.
In this example, the rsync process executes a large number of I/O reads.2.3 High Load processing after a problem occurs
After the problem occurs, you only need to do a little more to install the corresponding tools on the server and record all-day performance data.
The iostat tool in the sysstat package can solve high I/O problems. sysstat also contains tools that can report CPU and RAM usage. Although the top command can be used for this purpose, sysstat is more powerful. It can use a simple mechanism to record system statistics, such as CPU load, RAM, and I/O status. With this statistical information, when someone complained that the system was slow at noon yesterday, they could view the logs to see what caused the problem.
3.1 configure sysstat
It is now easy to use zabbix monitoring.
On Red Hat-based systems, you need to modify the/etc/sysconfig/sysstat file and change the HISTORY option so that it can record statistics over 7 days. Statistics can be captured every 10 minutes and summarized every day. The default value is 28 days.
Once sysstat is enabled, it collects the system status every 10 minutes and stores it in the/var/log/sysstat or/var/log/sa file. In addition, it splits the Statistical File every night before midnight. These operations are all executed by the/etc/cron. d/sysstat script. If you want to change the frequency of sysstat collection information, you can modify this script file.
During sysstat statistics, it stores the information in files (such as sa03) that start with sa and end with the current date of this month ). This means that you can view statistics from the current date within one month. You can use the sar tool to view these statistics. By default, sar outputs the CPU statistics for the current day:
From the output, we can find that many CPU statistics are the same as the output of the top command. In the last row, sar also provides an average value for each value.3.3 view RAM statistics
Sysstat scheduled tasks not only collect CPU load information, but also collect a lot of other information. You can use the-r option to collect RAM statistics:
Here we can see how much memory is used and how much memory is idle. At the same time, we can also view information about the swap space and File Cache. This information is similar to the information output using top or free commands. The difference is that you can view the previous information in time.3.4 view disk statistics
Disk statistics. You can use the-B option to provide basic information about disk I/O.
The total output data volume per second (tps) is obtained by adding the total read data volume and the written data volume (rtps and wtps respectively. The bread/s column is not used to measure block I/O, but to tell you the average data volume read per second. Bwrtn/s can tell you the average data volume written per second.
The ar program can input many parameters to output specific datasets. If you want to see all the data. This can be done using the-A option. It displays statistics including average load, CPU load, RAM, disk I/O, network I/O, and other interesting values. By reading the user manual of sar (input man sar), you can see the flag used to view specific statistics.
Sometimes you need to view the information in some time periods of a day. To obtain information within the specified time range, you can use the-s and-e parameters to specify the start time and end time you are interested in. For example, to query the CPU data in the Period of pm-8: 30pm, You need to input:
If you want to obtain non-day data, use the-f option and enter the complete path of the statistics file stored in the/var/log/sysstat or/var/log/sa folder. For example, if you want to obtain the statistical information for the first day of this month, enter:
sar -f /var/log/sysstat/sa06
You can also use any other sar options to obtain statistics of specific types.
Start building with 50+ products and up to 12 months usage for Elastic Compute Service