Linux backend diagnosis and debugging technology and linux backend debugging

Last Update:2014-12-28 Source: Internet

Author: User

Tags high cpu usage

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Linux backend diagnosis and debugging technology and linux backend debugging

This document is not a Linux Command usage tutorial, and does not explain its usage in all aspects. There are many Internet company projects, and the dependency between service programs is complicated. Various factors may affect the normal running of the thread service, this is especially true for basic service components. When there are various problems, how can we diagnose which part of linux has problems or conditions. From the perspective of linux systems, it is necessary to quickly identify the symptoms and critical points of online programs through various operating commands and means. For example, when the online service runs normally for a long time, Michael suddenly detects that a program has a problem and is in charge of a program that does not provide services or the service is suspended (the process is still running, but cannot receive external responses) status, Check application logs, no obvious errors or prompts (see hidden deep), possible (pressure is high, time is tight) experienced development engineers will not be able to find the symptoms at half past one. I would like to cover the key and important parts as much as possible and share them with readers to provide valuable references, I would like to provide some solutions and ideas for my own problems as a reference. However, due to horizontal restrictions, there are inevitable omissions and shortcomings, you are welcome to provide valuable comments and ideas to help me correct or improve.

The following problems are true examples of engineers in daily development.

Diagnostic Technology and debugging explanation
I have expanded from the above six aspects to multi-dimensional analysis, where the symptoms of application or service problems exist. Due to the particularity of Internet companies, the flow of technical engineers is relatively frequent, due to reasons such as job change, the successor (novice) was unable to cope with various online situations in a timely manner. Or, due to lack of experience or lack of understanding about online conditions, many uncertainties are added. If we have a set of simple, General, and feasible detection and measurement rules, the risks will be greatly reduced. We hope that the help of this reference manual can analyze a large amount of online parameter data and find clues, and narrow down the scope of troubleshooting programs.

It is precisely because of program problems that need to be diagnosed and debugged. It is necessary to analyze and determine which factors will cause linux performance fluctuations.
Sun Tzu
Know yourself and know what you want to do;
If you do not know and know each other, you can win and lose each other;
Do not know or know each other.

Factors Affecting Linux Performance

Operating System
CPU
Memory
Disk I/O and Performance
Network I/O status and Performance

Linux System Performance Evaluation Criteria

Note: Swap and paging

Linux System Performance Analysis commands or tools

Common commands:

Dstat, vmstat, sar, iostat, netstat, free, ps, top, etc.

Performance analysis and diagnosis tools

Use vmstat, sar, and iostat to detect CPU, disk, and memory bottlenecks.
Use free and vmstat to detect memory and IO bottlenecks
Use iostat to check whether it is a disk I/O bottleneck
Use dstat to check whether it is a comprehensive bottleneck analysis of network bandwidth, disk I/O, memory, and load
Use mpstat to check whether cpu calls are not balanced. You can also use top instead.
Use pidstat to detect the cpu consumption of the corresponding process
Use netstat to detect the unsent or processed data in the socket buffer, so as to judge whether the processing capacity of the program is decreased or there is a problem.
Lsof is used to detect the excessive number of characters that open file descriptions (Network files, disk files, pipelines, etc.), resulting in insufficient resources.
Use the df and du combinations to check whether the disk or directory space is huge or inode nodes are exhausted.
Use iftop to check whether there is a traffic bottleneck between the two hosts.

The following command explains the parameters and terminologies:

Buffer is used to store (cache) data to be output to disk (Block device,
The cache stores the data read from the disk. The buffer and cache are designed to improve IO performance and are managed by the OS.
Swap: the virtual memory for Linux kernel read and write is operated on a page. When the memory is transferred to the hard disk SWAP space (swap) and read from the SWAP space to the memory, it is read and written by page.
Paging: This switching process between memory and SWAP is called Paging)

Linux-toptop

Shift + h view cpu consumption by thread
View the consumption of each core
If the value of us is too high, the application consumes most of the cpu.
If sy is too high, the system thread switches frequently.
Wa indicates the percentage of I/O waiting during execution hi hardware interruption (ex: frequent reception of data from the NIC)
The process information is directly displayed in the top-p pid multi-column information list.

Linux-freefree-m

-/+ Buffers/cache
Total memory used by used application = Mem. used-Mem.buffers-Mem.cached
Memory not used by the application = Mem. free + Mem. buffers + Mem. cached
The above is the running result of free-m. There are 4 rows in total. For convenience, I added the column number. In this way, we can regard the free Output as a two-dimensional array rowcols (Free Output). The following two equations can be used to represent two equations.
Rowcols [3] [2] = rowcols [2] [2]-rowcols [2] [5]-rowcols [2] [6]
Rowcols [3] [3] = rowcols [2] [3] + rowcols [2] [5] + rowcols [2] [6]

Note: In the free command, the memory parameter corresponds to the fourth line and the fifth line of top, and the value of the third line of top is missing.

Linux-netstat/ssnetstat-an | grep Port

Generally, the Recv-Q network receiving queue is 0. If the Recv-Q network is not 0, the received data is already received locally, and the application has not processed the data. This may be because the processing performance of the application is degraded.
Generally, the sending queue of Send-Q is 0. If it is not 0, the application may Send packets too quickly, or the other party may not receive packets quickly.
Netstat-an | grep port (check whether the application port is listening normally)
In the past, we encountered this problem in the distributed cache memcached during peak access. Due to the extreme multi-thread lock memory pool during peak hours, the server performance dropped sharply.

Linux-vmstatvmstat 1 10

Vmstat CPU usage, memory usage, virtual memory switching, IO read/write
Generally, the format is vmstat interval count // indicating that the output frequency is 1 second and 10 times in a row.
Instance used: vmstat 1 10

If the percentage of CPU sy and us values is close to 100%, or the number of waiting processes in the running Queue (r) is always not equal to 0, and usually greater than 4, at the same time, the id is often less than 40, the system is limited by the CPU; if the bi and bo values are always not equal to 0, the system is limited by the memory. Swpd is too high. Generally, physical memory is insufficient.

The free column indicates the number of idle physical memory (in k)
The buff column indicates the amount of memory in the buffers cache. Generally, the buffer is required for reading and writing Block devices.
Memory
The cache column indicates the memory size of page cached. It is generally used as the file system cached. files that are frequently accessed will be cached. If the cache value is large, it indicates that there are many cached files, if the bi in I/O is relatively small at this time, the file system is more efficient.

Swap

The si column indicates the amount of memory transferred from the disk to the memory swap zone. The so column indicates the amount of memory transferred from the memory to the disk, that is, the number of memories in the memory swap zone. Generally, the si and so values are 0. If the si and so values are not 0 for a long time, the system memory is insufficient. You need to increase the system memory.

Number of CPU interruptions per second in
Cs context switches per second // if there are too many context switches (much higher than normal values), it may be that too many threads are created.
CPU usage percentage of us users
Sy system CPU time percentage
Id idle CPU time percentage
Wt waiting for IO

Linux-sarsar 1 10

The format is as follows: sar-d interval count

You need to pay attention to the following parameters:

Await indicates the average wait time (in milliseconds) for each device I/O operation ).
Svctm indicates the average service time (in milliseconds) for each device I/O operation ).
% Util indicates a fraction of the time in one second for I/O operations.

The following criteria are generally used to evaluate disk I/O performance using sar:

Under normal circumstances, the svctm should be smaller than the await value, while the svctm size is related to the disk performance. The CPU and memory load will also affect the svctm value, too many requests indirectly increase the svctm value. The size of the await value generally depends on the svctm value, the length of the I/O queue, and the I/O Request mode. If the svctm value is close to that of await, there is almost no I/O wait, disk performance is good. If the await value is much higher than the svctm value, it means that the I/O queue waits too long and the applications running on the system will slow down, at this time, you can solve the problem by replacing the faster hard disk.
The value of % util is also an important indicator for measuring disk I/O. If % util is close to 100%, it indicates that the disk has too many I/O requests, the I/O system is working at full capacity, and the disk may have a bottleneck. In the long run, the system performance will inevitably be affected. You can solve this problem by optimizing the program or replacing a higher or faster disk.

Linux-iostatiostat 1 10 iostat interval cont

The output of each item is explained as follows:
Average percentage = Percentage of total CPUs/number of cpu Cores

% User indicates the average percentage of CPU used to process user processes.
% Nice indicates the average CPU usage when the nice command is used to downgrade the process. The nice command has been introduced in the previous section. Simply put, the nice command changes the priority of a process.
% System indicates the average percentage of CPU used by kernel Processes
% Iowait indicates the percentage of CPU time spent waiting for I/O on average
% Irq indicates the percentage of CPU used to handle system interruptions
% Steal indicates that the column displays the s that pagein forces different pages in an environment with relatively insufficient memory.
% Idle: Percentage of idle CPU time displayed
Blk_read/s indicates the number of data blocks read per second.
Blk_wrtn/s indicates the number of data blocks written per second.
Blk_read indicates the number of all read blocks.
Blk_wrtn indicates the number of all written blocks.
You can use the values of Blk_read/s and Blk_wrtn/s to have a basic understanding of the disk's read/write performance. If the value of Blk_wrtn/s is large, it indicates that disk write operations are frequent, you can consider optimizing the disk or optimization program. If the Blk_read/s value is large, it means that the disk has a lot of Direct Read operations. You can put the read data into the memory for operations.
The values of these two options do not have a fixed size. Different values vary depending on the system application. However, there is a rule that can be followed: long-term, ultra-large data read/write, it is definitely not normal. This situation will definitely affect the system performance.

Linux-mpstatmpstat-p all 1 10

Format: mpstat-p all interval count

It displays various CPU statistics in the system. The-p all option indicates that the command displays statistics of ALL CPUs.

% User indicates the percentage of CPU used to process user processes.
% Nice indicates the CPU usage percentage when the process is downgraded using the nice command. The nice command has been introduced in the previous section. In short, the nice command changes the priority of a process.
% System indicates the percentage of CPU used by the kernel Process
% Iowait indicates the percentage of CPU time used for waiting for I/O
% Irq indicates the percentage of CPU used to handle system interruptions
% Steal indicates that the column displays the s that pagein forces different pages in an environment with relatively insufficient memory.
% Idle: Percentage of idle CPU time displayed

Linux-dstatdstat-tcdnmlpygs 1 10

Format: dstat-tcdnmlpygs interval count

Dstat is a multi-function system resource statistics Tool.

The specific dstat parameter descriptions are not listed. Other commands mentioned above have already been described.
The obtained information is similar to a collection of top, free, iostat, vmstat, and other tools.

CPU status: CPU usage
Disk statistics: disk read/write operations. This column displays the total number of read and write operations on the disk.
Network Traffic Statistics: The data sent and received by network devices. The total number of network incoming and outgoing data is displayed in this column.
Paging statistics: System paging activities
System statistics: This item shows the interrupt (int) and context switch (csw)
Memory statistics: This item lists the physical memory, buffer, cache, and free idle memory used by the application.

Linux-iftopiftop

Iftop displays the list of network bandwidth usage of all source or target hosts on the system. This list is updated on a regular basis.

Iftop Parameters

Host display
N: Switch whether to resolve the host name. You can select whether to display the domain name or IP address;
S/d: whether to display the source host or target host;
T: Switch the host receiving and sending display mode: two lines are displayed, one line is displayed, only send traffic is displayed, only send traffic is displayed;
Port display
N: Switch to display the service name or port number, such as ssh or 22;
S/D: whether to display the source host port or target host port;
P: Switch whether the port number is displayed;
Sort display
1/2/3: sort by the average network traffic of the last 2 seconds, 10 seconds, and 40 seconds;
<: Sort by source host address;
>: Sort by target host address;
O: Fixed display of the current connection, used to observe the traffic of the current connection, to avoid changing the connection sequence after refreshing;
Meanings
TX: Send traffic
RX: receive traffic
TOTAL: TOTAL traffic
Cumm: total traffic from running iftop to current time
Peak: traffic peak
Rates: average traffic in the past 2 s, 10 s, and 40 s respectively

Linux-df & du

Du command to view the space used by the file and directory Disk

Output the space used by each subdirectory in the current directory: du-h -- max-depth = 1
Sort by subdirectory size (in descending order): du-s * | sort-rn | cut-f2-| xargs-d "\ n" du-sh

Summary of linux diagnostic and Debugging commands

// The local listening port comes from the client connections statistical sorting (in descending order)
Lsof-I: 9092 | awk '{print $9}' | cut-d ">"-f2 | awk '{cidx = index ($1 ,":"); print substr ($, cidx-1)} '| sort | uniq-c | sort-rn

Case Analysis Case 1 creating a large number of network connections causes service processes to crash or be suspended

When one or more machines are deployed as basic service components or middleware nodes (not necessarily java, or python, go, c/c ++, etc, A large number of clients will depend on this service to connect to the target cluster. I encountered a real situation. After the cluster was running normally for a long time, one day I suddenly found that a node in the cluster was unavailable. Observe that there is no error message in the application logs, and the service program cannot provide external responses normally.
Based on the preceding rules (cpu, memory, disk I/O, network I/O, and network traffic), the author checks cpu, io, network traffic, and disk capacity,
It is found that the number of connections to the listening port is huge and there is a problem. It is found that a large number of network connections have not been released on a client host. In general, the maximum number of file descriptions in a linux system is set to a large value, reaching more than, most of which are first consumed by other resources.
The following figure simulates the current environment.
[Lizhitao @ host] netstat-an | grep 9094

[Lizhitao @ host] lsof-I: 9094 | awk '{print $9}' | cut-d ">"-f2 | awk '{cidx = index ($1, ":"); print substr ($, cidx-1)} '| sort | uniq-c | sort-rn

Obviously, mafka.sankuai.com creates a large number of connections on this host, resulting in resource depletion.

Description: Netstat generally displays ip addresses, while lsof generally displays host names. Because there are many client connections (in TCP, each socket occupies 500 bytes), it is difficult to see which client has a large number of hosts at once, therefore, the combination of the 2 commands is more convenient.

Additional reading:
The above case is a network connection (network file descriptor) problem, and some may be caused by a large number of opened Disk Files (descriptors), but the program bug caused by not releasing resources.
This is typically a distributed file system that stores images or small files.

Recursively search all open files in a directory
[Lizhitao @ host] lsof + D/DIR/
In the lsof list, Type REG and DIR indicate opening disk files and directories respectively.
[Lizhitao @ host] lsof-p [pid] | "REG" // lists all open disk files of the process.
Then, through computation and analysis, you can determine whether a large number of disk files are not released, and then locate the problematic code block.

Case 2 an online java service process suddenly occupies a high cpu usage.

Sort java Process cpu consumption
[Lizhitao @ host] ps p 31476-L-o pcpu, pid, tid, time, tname, stat, psr | gawk '{printf ("% s % d % s % x \ n", $1, $2, $3, $4, $5, $6, $3)} '| sort-n-k1-r

If the cpu consumption shown in the above image list is high and about 80% cpu, we can use the above command to find the corresponding thread number nid = 31510 to convert to hexadecimal 0x7f8
[Lizhitao @ host] jstack 31476 | grep "7b6"

[Lizhitao @ host] jstack 31476
Narrow down the code troubleshooting scope, find the corresponding call stack information and code, and then analyze the relevant code logic.

Case 3 files cannot be created when a process is started

If you encounter the above error, You can first check the hard disk status information (generally, this type of problem is unlikely, unless it is a database or storage server)
Refer to hard disk detection smartctl: Check the hard disk health status in Linux
Linux hard disk Detection
If the disk status is normal, the analysis may indicate that the disk space is full.
Run df-h to view disk space usage
Then run the following command in the corresponding application software deployment directory or data storage directory:
Du-s * | sort-rn | cut-f2-| xargs-d "\ n" du-sh
You can find a directory that occupies a large space and then process it accordingly.

Summary

The above are the problems and gains I encountered during backend diagnosis and debugging in linux and the solutions. The following is my summary of these aspects:
In linux, backend diagnosis and debugging are complex and challenging. The operator or engineer needs to use various methods to locate the problem, so as to gradually narrow down the scope and locate a specific point, different applications may also be different. For example, common java (tomcat) services and memcached distributed cache may encounter different problems. First, we can draw a line from the system level, which is caused by access to the operating system, local service programs, or other hosts. Six methods are used, such as network I/O, disk I/O, memory, cpu, file opening, and thread, to analyze the bottleneck. Therefore, there is no command or method that can be used to cure all kinds of diseases. It requires engineers to skillfully choose and use different methods in different application environments and combine them with multiple command combinations.

Reference

Http://kodango.com/using-iftop
Http://www.vpsee.com/2009/11/linux-system-performance-monitoring-io/
Http://blog.csdn.net/guoguo1980/article/details/2324454
Http://coolshell.cn/articles/7490.html

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More