Linux back-end diagnostics and debugging technology

Last Update:2015-07-07 Source: Internet

Author: User

Tags memcached memory usage cpu usage

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

This article is not a LIUNX command to use the tutorial, but also do not intend to explain its usage in all aspects, internet company projects a lot, the inter-dependent invocation of the service program is very complex, a variety of factors will affect the normal operation of the thread service, especially the basic service components, when there are various problems, How to diagnose the problem or condition of the Linux link. From the Linux system level, it is necessary to quickly locate the symptoms and the key of the online program through various operation commands and means. For example: when the online service for a long period of normal operation, Zhang San one day suddenly monitoring the problem of their own program, own responsibility for a program does not provide services or services are dead (process is still not receive external response) status, viewing the application log did not find obvious errors or information can be prompted (see hidden darker), May ( Pressure is larger, time is tight) experienced development engineer also 1:30 will not find out the symptoms, I want to cover the key and important part of sharing to the reader, in order to look forward to provide valuable reference to the basis, I would like to provide some of their own problems to solve the problem and ideas as a reference, but limited to the level of limits, Inevitably there are omissions and shortcomings, we welcome your valuable suggestions and ideas to help me correct or improve.

The problem that follows is a true portrayal of the engineer's daily development.

Diagnostic technology and debugging explained the author from the above 6 aspects of the expansion to a number of dimensions, application or service symptoms of the problem, because of the particularity of the Internet company, technical engineers flow relatively frequent, post change handover and other reasons, resulting in the successor (novice) can not timely response to the situation on the line. Or because of lack of experience, or because of the lack of understanding of the online situation, and so on, adding a lot of uncertainties. If we have a simple, universal, and measurable set of rules, we can significantly reduce risk. It is hoped that with the help of this reference manual, a large number of on-line parametric data can be analyzed and found, and the scope of the problematic program will be reduced. Because of problems in the program that need to be diagnosed and debugged, it is necessary to analyze and determine which factors cause Linux performance fluctuations.
Grandson, the attack.
The enemy, Baizhanbudai;
Do not know the bosom friend, one wins a negative;
Do not know, no bosom friend, every war will be dangerous. Factors affecting Linux Performance

Operating system
Cpu
Memory
Disk I/O and performance
Network I/O status and performance

Performance evaluation criteria for Linux systems

Note : There is a name explanation behind the meaning of swap and paging

Linux system performance analysis commands or tools

Common commands:

Dstat,vmstat, SAR, Iostat, netstat, free, PS, top, etc.

Performance analysis and diagnostic tools

Using Vmstat, SAR, iostat to detect whether it is CPU, disk, memory bottleneck
Use free, vmstat detection is memory, IO bottleneck
Detect disk I/O bottlenecks with Iostat
Using Dstat to detect whether the network bandwidth, disk IO, memory, load and other comprehensive bottleneck analysis
Use Mpstat to detect if the CPU call is unbalanced, or you can use the top override
Using Pidstat to detect CPU consumption of the corresponding process
Detection of the socket buffer with Netstat data not sent or processed, to determine the program processing capacity degradation or problems.
Using lsof detection to open File description (network files and disk files, pipelines, etc.) too many, resulting in insufficient resources
Detecting mounted disks or directories with DF and Du combined is a large footprint or the Inode node is depleted.
Use Iftop to check if there is a traffic bottleneck between 2 hosts.

The parameter nouns in the following command are explained:

Buffer is used to store (cache) The data to be output to disk (block device),
The cache is the storage of data read from disk. This buffer and cache are designed to improve IO performance and are managed by the OS.
The Swap:linux kernel reads and writes virtual memory on a "page" basis, transferring memory to the hard disk swap space (swap) and reading from the swap space to memory, both read and write.
Paging: This exchange process of memory and swap is called page swapping (Paging)

Linux-toptop

SHIFT + H View CPU consumption by thread
View each core consumption situation
US too high indicates that the application consumes most of the CPU
Sy high indicates system thread switching is frequent
WA represents the percentage of waiting for IO during execution hi hardware interrupt (ex: network card receives data frequency)
The Top-p PID multi-column Information list displays the corresponding process information directly.

Linux-freefree-m

-/+ Buffers/cache
Used total amount of memory used by the application =mem.used-mem.buffers-mem.cached
The amount of memory that the application has not used =mem.free+mem.buffers+mem.cached
Above is the result of the operation of Free-m, a total of 4 lines. In order to facilitate the explanation, I added the column number. This allows the output of the free to be treated as a two-dimensional array rowcols (free output), which can be used to represent two equations using 2 equations
ROWCOLS[3][2] = rowcols[2][2]-rowcols[2][5]-rowcols[2][6]
ROWCOLS[3][3] = rowcols[2][3] + rowcols[2][5] + rowcols[2][6]

Note : The memory parameter in the free command and the fourth line in top, the five elements-corresponds, top is less free the third row parameter value

Linux-netstat/ssnetstat–an | grep Port

The RECV-Q network receive queue is typically 0, and if the persistent non-0 indicates that the received data is already receiving buffering locally, the application has not yet processed, possibly the application processing performance degradation.
The SEND-Q send queue is typically 0, and if persistent is not 0, it is possible that the application is sending packets out too fast, or that the other party is not fast enough to receive the packets.
Netstat–an | grep Port (see if the application port is listening properly)
This behavior occurs in a previously encountered distributed cache memcached during peak access, resulting in a sharp decline in server performance due to multi-threaded lock memory pool causes in extreme cases during peak times.

Linux-vmstatvmstat 1 10

Vmstat CPU usage, memory usage, virtual memory exchange, IO Read and write conditions
The general use format is: Vmstat interval count//= output frequency 1 seconds, continuous output 10 times
Usage Example: Vmstat 1 10

If the CPU's SY and US values add up to a percentage close to 100%, or the number of processes waiting in the run queue (R) is always not equal to 0, and is often greater than 4, and the ID is often less than 40, then the system is limited to the CPU; The SWPD value is too high in general because of insufficient physical memory.

The free column represents the amount of physical memory currently idle (in K)
The Buff column represents the amount of memory in the buffers cache, which is generally read and written to the block device.
Memory
The cache column represents the amount of memory page cached, generally as a file system cached, frequently accessed files will be cached, if the cache value is large, the number of cached files, if at this time bi relatively small, the file system efficiency is better.

Swap

The SI column indicates the amount of memory that is being transferred to a memory swap area by the disk. The so column represents the amount of memory that is called into the disk, which is the memory swap area. In general, the values of Si and so are 0, and if the value of Si and so is not 0 for a long time, then the system memory is low. Need to increase system memory.

The number of interrupts per second CPU
CS per second Context switch///If the context switch is too much (much higher than the normal value), the thread may have created too many
US user CPU Time%
SY System CPU Time%
ID Idle CPU Time%
WT Wait IO

Linux-sarsar 1 10

The format is as follows: sar-d interval count

Several parameters need to be followed:

An await represents the average wait time (in milliseconds) for each device I/O operation.
SVCTM represents the average service time (in milliseconds) for each device I/O operation.
%util represents the percentage of time in a second that is used for I/O operations.

Using SAR for performance evaluation of disk IO performance, there are generally the following criteria:

Under normal circumstances, SVCTM should be less than the await value, while the size of SVCTM and disk performance, CPU, memory load will also affect the SVCTM value, too many requests will indirectly lead to the increase of SVCTM value. The size of the await value generally depends on the value of SVCTM and I/O Queue length and I/O request mode, if the value of SVCTM is close to await, indicates that there is little I/O waiting, disk performance is good, if the value of await is much higher than the value of SVCTM, the I/O queue waits too long, Applications running on the system will slow down, and you can fix the problem by replacing a faster hard drive.
The value of the%util key is also an important metric for measuring disk I/O, and if the%util is close to 100%, indicating that the disk generates too many I/O requests and that the I/O system is already full-loaded, the disk may have bottlenecks. In the long run, it is bound to affect the performance of the system, either by optimizing the program or by replacing a higher, faster disk to resolve the problem.

Linux-iostatiostat 1 10iostat Interval cont

The output of each item above is interpreted as follows:
Average percent = Total CPU%/CPU Core number

%user represents the percentage of CPU used by the average processing user process.
%nice represents the percentage of CPU on average when a process is degraded with the nice command. The nice command has been introduced in the previous section. In simple terms, the nice command changes the priority of the process.
%system indicates the percentage of CPU used by the average kernel process
%iowait indicates the percentage of CPU time used to average waiting for I/O
%IRQ indicates the percentage of CPU used to process system interrupts
%steal indicates that the column shows the Pagein force on different pages in an environment with relatively tight memory
%idle shows the percentage of idle time for the CPU
BLK_READ/S represents the number of data blocks read per second.
BLK_WRTN/S indicates the number of data blocks written per second.
Blk_read represents the number of blocks read.
BLK_WRTN represents the number of blocks written.
You can have a basic understanding of disk read and write performance through the values of blk_read/s and BLK_WRTN/S, if the BLK_WRTN/S value is large, the disk writes are very frequent, you can consider optimizing the disk or optimizer, if the blk_read/s value is large, Indicates that there are many direct disk read operations, which can be manipulated by putting the read data into memory.
The values for these two options are not a fixed size, depending on the system application, there will be different values, but there is a rule can be followed: long-term, ultra-large data read and write, it is certainly not normal, this situation will affect the performance of the system.

Linux-mpstatmpstat-p All 1 10

The format is as follows: Mpstat-p all interval count

It shows various statistics about the CPU in the system. The –P all option indicates that the command displays statistics for all CPUs.

%user represents the percentage of CPU that is used to process user processes.
%nice represents the percentage of the CPU when a process is demoted with the nice command. The nice command has been introduced in the previous section. In simple terms, the nice command changes the priority of the process.
%system indicates the percentage of CPU used by the kernel process
%iowait indicates the percentage of CPU time that is used to wait for I/O
%IRQ indicates the percentage of CPU used to process system interrupts
%steal indicates that the column shows the Pagein force on different pages in an environment with relatively tight memory
%idle shows the percentage of idle time for the CPU

Linux-dstatdstat-tcdnmlpygs 1 10

Format: Dstat-tcdnmlpygs interval count

Dstat is a multi-function system resource Statistics tool

Dstat specific parameter description is not listed, the above other commands have been described
The information obtained is somewhat similar to the collection of multiple tools such as top, free, Iostat, Vmstat, and so on.

CPU Status: CPU Usage
Disk statistics: Read and write operations on disk, this column shows the total number of read and write disks.
Network traffic statistics: Data sent and received by network devices, this column shows the total number of network receipts and sends data.
Paging statistics: Paging activity of the system
System statistics: This item shows interrupts (int) and context switches (CSW)
Memory Statistics: This item lists the applications that have used physical memory, buffer,cache,free free memory

Linux-iftopiftop

Iftop displays a list of the network bandwidth usage for all source hosts or destination hosts on the system, and this list is updated periodically.

Iftop Part parameter explanation and explanation

Host display
N: Switch whether to resolve host name, you can choose to display the domain name or IP address;
S/D: Switch whether to display the source host or destination host;
T: Switch host to receive and send the display mode: Two lines display, a line display, only send traffic, display only send traffic;
Port display
N: Toggle Display service name or port number, such as SSH or 22;
S/D: Toggles whether the source host port or destination host port is displayed;
P: Toggles whether the port number is displayed;
Sort Display
1/2/3: Based on the last 2 seconds, 10 seconds, 40 seconds of average network traffic sequencing;
<: Sort by source host address;
: Sort by destination host address;
O: Fixed display the current connection, to observe the current connection traffic situation, to avoid the connection sequence changes after the refresh;
The following part means
TX: Send Traffic
RX: Receive Traffic
Total: Overall flow
Cumm: Total traffic running iftop to current time
Peak: Traffic Peaks
Rates: Represents the average traffic for the past 2s 10s 40s, respectively

Linux-df & du view disk space usage, mounts, and inode conditions

Du command to view the space used by file and directory disks

Output the space used by each subdirectory in the current directory: Du-h--max-depth=1
Sort by sub-directory size (Descending):d u-s * | Sort-rn | cut-f2-| Xargs-d "\ n" du-sh

Linux Diagnostics and Debugging Partial command usage Summary

Native listener ports from the number of client connections statistics sort (descending)
lsof-i:9092 | awk ' {print $9} ' | Cut-d ">"-F2 | awk ' {cidx=index ($, ":");p rint substr ($1,0,cidx-1)} ' | Sort | uniq-c | Sort-rn

Case Study 1 Creating a large number of network connections causes the service process to hang or feign death

When one or several machines are deployed as a base service component or middleware node (not necessarily Java, or python,go,c/c++, and so on), there will be a large number of clients that rely on this service to connect to this target cluster. I encountered the real situation, the cluster has been running for a long time, one day suddenly found that a node in the cluster is in an unusable state. Observe that its application log does not have any error messages, and that the service program does not respond properly to the external provider.
The author through the relevant commands in accordance with the above several laws (CPU, memory, disk IO, network IO, network traffic), when we look at Cpu,io, network traffic, disk capacity,
Discover the number of listening port connections is huge, there is a problem, the location analysis found that a client host created a large number of network connections are not released. In general, Linux system Maximum file description number is very large, reached more than 100W, most of the other resources are consumed first.
Here is a picture of the environment at the time of simulation.
[[email protected]] Netstat-an | grep 9094

[[email protected]] lsof-i:9094 | awk ' {print $9} ' | Cut-d ">"-F2 | awk ' {cidx=index ($, ":");p rint substr ($1,0,cidx-1)} ' | Sort | uniq-c | Sort-rn

Obviously mafka.sankuai.com this host creates a large number of connections, resulting in resource exhaustion.

description : Netstat general display is ip,lsof general display hostname, because the client connection is more (in TCP as if each socket occupies 500byte), it is difficult to see at a glance which client host number is more, so 2 command combination is more convenient.

Extended reading:
The above case is a network connection (network file descriptor) problem, and some may be the disk file (descriptor) Open a huge amount, and did not release the resources caused by the program bug.
Typical is a distributed file system that stores pictures or small files.

Recursively find all open files in a directory
[[email protected]] Lsof +d/dir/
Lsof in the list of type Reg and Dir respectively to open disk files and directories
[[email protected]] lsof-p [PID] | grep "REG"//enumerates all open disk files for the process.
Then, by calculating and analyzing, you can know if a large number of disk files are not released, and then locate the problematic block of code.

Case 2 on-line Java service process suddenly CPU occupied very high.

Java Process CPU consumption sequencing
[[email protected]] PS p 31476-l-o PCPU,PID,TID,TIME,TNAME,STAT,PSR | Gawk ' {printf ("%s%d%d%s%s%s%x\n", $1,$2,$3,$4,$5,$6,$3)} ' | Sort-n-k1-r

If the CPU consumption shown in the above picture list is very high and about 80%CPU, we can find the corresponding thread number nid=31510 convert to hexadecimal 0x7b16 by the above command.
[[email protected]] Jstack 31476 | grep "7b16"

[[email protected]] Jstack 31476
Reduce the scope of the code, trace the corresponding call stack information and code, and then analyze the relevant code logic.

Case 3 A process cannot create a file when it starts

If you encounter the above error, you can first detect the hard disk status information (usually this kind of problem is not very likely, unless it is a database or server to do storage)
Please refer to the hard drive detection smartctl:linux system to detect the health status of the hard disk
Testing of Linux hard disks
If the disk status is all right, the analysis may fill up the mounted disk space.
Execute: Df-h View disk space usage
Then go to the appropriate application software deployment directory or data store directory to execute:
Du-s * | Sort-rn | cut-f2-| Xargs-d "\ n" du-sh
You will be able to find a large space-occupying directory, and then do the appropriate processing.

Summarize

The above, is my Linux under the back-end diagnosis and debugging encountered problems and the harvest, as well as the solution, here is my summary of these aspects:
Linux lower back-end diagnostics and debugging is a complex and challenging task, requiring operators or engineers to use a combination of various means to find the problem, so as to gradually narrow the scope, positioning to a point, different applications will not be the same, such as: Ordinary Java (tomcat) Service and memcached distributed cache problems are also different. First we can draw a distinction from the system level, is the operating system, the native service program, or other host access caused. Through six means, such as network io, disk IO, memory, CPU, open file, thread, etc., analyze its bottleneck direction. Therefore, there is no command or method to "package cure all ills", need engineers in different application environment, smart choice and comprehensive use of different means, with a variety of command combinations to complete.

Linux back-end diagnostics and debugging technology

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More