[Linux] frequently used commands for Fault Diagnosis
1. Last Command
Linux last command
Function Description: lists information related to users who have logged on to the system in the past.
Syntax: last [-adRx] [-f <Record File>] [-n <display Number of columns>] [account name...] [terminal number...]
Note: Execute the last command separately. It will read the file named wtmp in the/var/log directory, and display all the usernames of the login system recorded in the file.
Parameters:
-A displays the host name or IP address from which to log on to the system in the last line.
-D. Convert the IP address to the host name.
-F <Record File> specifies the record file.
-N <display columns> or-<display columns> sets the number of columns displayed in the list.
-R does not display the host name or IP address used to log on to the system.
-X displays information such as system shutdown, reboot, and change of execution level.
Last command:
Function Description: lists information related to users who have logged on to the system in the past.
2./var/log/message System Error log: almost all system startup errors will be recorded here
Messages logs are core system log files. It contains boot messages when the system is started and other status messages when the system is running. IO errors, network errors, and other system errors are recorded in this file. Other information, such as switching a person's identity to root, is also listed here. If the service is running, such as a DHCP server, you can observe its activity in the messages file. In general,/var/log/messages is the file you need to view before troubleshooting.
3. lsb_release
-V, -- version
Show version information
-I, -- id
Display the release ID
-D, -- description
Displays the description of the release.
-R, -- release
Displays the release version of the current system.
-C, -- codename
Release code
-A, -- all
Show all the above information
-H, -- help
Show Help Information
[Root @ jp ~] # Lsb_release-
LSB Version: core-4.0-ia32: core-4.0-noarch: graphics-4.0-ia32: graphics-4.0-noarch: printing-4.0-ia32: printing-4.0-noarch:
Distributor ID: EnterpriseEnterpriseServer
Description: Enterprise Linux Server release 5.6 (Carthage)
Release: 5.6
Codename: Carthage
4. df-h
View the usage of each attached Disk
[Root @ jp lsb-release.d] # df-h
Filesystem Size Used Avail Use % Mounted on
/Dev/sda3 92G 20 GB 67G 23%/
/Dev/sda1 1.9G 42 M 1.8G 3%/boot
Tmpfs 1014 M 0 1014 M 0%/dev/shm
5. top Command
Top view 01 [top View 01] is the basic view of top. We will use this view to explain the meaning of each data.
The first line:
10:01:23-current system time
126 days,-the system has been running for 126 days, 14 hours, and 29 minutes (no restart during this period)
2 users-there are currently two users logging on to the system
Load average: 1.15, 1.42, 1.44-load average, the three numbers after which are 1 minute, 5 minutes, and 15 minutes respectively.
The load average data checks the number of active processes every five seconds and then calculates the value based on a specific algorithm. If this number is divided by the number of logical CPUs, the system is overloaded when the result is higher than 5.
Row 2:
Tasks-task (process), the system now has a total of 183 processes, of which one is running, 182 are sleeping, and 0 are stoped, zombie status (zombie) has 0.
Row 3: cpu status
6.7% us-Percentage of CPU used by user space.
0.4% sy-Percentage of CPU occupied by kernel space.
0.0% ni-Percentage of CPU used by processes with changed priorities
92.9% id-Percentage of idle CPU
0.0% percentage of CPU occupied by wa-IO wait
0.0% hi-Percentage of CPU used by Hardware IRQ
0.0% CPU usage of si-Soft Interrupt (Software Interrupts)
Here, the CPU usage rate is different from that of windows. If you do not understand the user space and kernel space, you need to recharge your instance.
Row 4: memory status
8306544 k total-total physical memory (8 GB)
7775876 k used-total memory in use (7.7 GB)
530668 k free-total idle memory (530 MB)
79236 k buffers-cache memory (79 M)
Row 5: swap Partition
2031608 k total-total swap zone (2 GB)
2556 k used-Total number of swap zones used (2.5 M)
2029052 k free-Total number of free swap areas (2 GB)
4231276 k cached-total buffer swap zone (4 GB)
It should be noted that the concept of memory in windows cannot be used to understand the data. If the server is "dangerous" in windows mode: 8 GB memory is left with only MB of available memory. Linux memory management has its own particularity, And the complexity needs to be explained in a book. Here is just a simple introduction and our traditional concepts (windows) are different.
In Row 4, the total memory used (used) refers to the number of memories currently controlled by the system kernel. The total idle memory (free) is the amount of memory that the kernel has not yet incorporated into its control scope. The memory included in the kernel management is not always in use, but also the memory that can be reused in the past, the kernel does not return the reusable memory to free. Therefore, free memory will become fewer and fewer on linux, but you don't have to worry about it.
If you calculate the number of available memory out of habit, there is an approximate formula: free in the fourth row + buffers in the fourth row + cached in the fifth row, according to this formula, the available memory of this server is 530668 + 79236 + 4231276 = 4.7 GB.
For memory monitoring, we need to monitor the used of the fifth-line swap partition in top. If this value is constantly changing, it means that the kernel is constantly exchanging data between memory and swap, this is the real insufficient memory.
Row 6 is empty
Line 7: Status monitoring of processes (tasks)
PID-process id
USER-process owner
PR-process priority
NI-nice value. A negative value indicates a high priority, and a positive value indicates a low priority.
VIRT-Total virtual memory used by the process, in kb. VIRT = SWAP + RES
RES-physical memory used by the process, not swapped out, in kb. RES = CODE + DATA
SHR-shared memory size, in kb
S-Process status. D = non-disruptive sleep state R = running S = sleep T = tracking/stopping Z = botnets
% CPU-Percentage of CPU time occupied since the last update
% MEM-Percentage of physical memory used by the Process
TIME +-total cpu time used by the process, in the unit of 1/100 seconds
COMMAND-process name (COMMAND name/COMMAND line
6. iostat
Iostat
Iostat is used to output statistics related to CPU and disk I/O.
Command Format:
Iostat [-c |-d] [-k |-m] [-t] [-V] [-x] [device [...] | ALL] [-p [device | ALL]
[Interval [count]
1) simple use of iostat
1) simple use of iostat
Iostat displays the load and partition status information of the CPU and I/O systems.
1) simple use of iostat
You can directly execute iostat to display the following content:
# Iostat
Linux 2.6.9-8.11.EVAL (ts3-150.ts.cn.tlan) 08/08/2007
Avg-cpu: % user % nice % sys % iowait % idle
12.01 0.00 2.15 2.30 83.54
Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn
Hda 7.13 200.12 34.73 640119 111076
The meanings of each output project are as follows:
Avg-cpu segment:
% User: the percentage of CPU used for running at the user level.
% Nice: the percentage of CPU used by the nice operation.
% Sys: Percentage of CPU used to run at the system level (kernel.
% Iowait: CPU usage when the CPU waits for hardware I/O.
% Idle: Percentage of idle CPU time.
Device segment:
Tps: the number of I/O requests sent per second.
Blk_read/s: number of blocks read per second.
Blk_wrtn/s: number of blocks written per second.
Blk_read: Total number of read blocks.
Blk_wrtn: Total number of blocks written.
2) iostat parameter description
Parameters of iostat:
-C only displays CPU statistics. It is mutually exclusive with the-d option.
-D only displays the disk statistics. It is mutually exclusive with the-c option.
-K indicates the number of disk requests per second in K. The default unit is block.
-P device | ALL
This parameter is mutually exclusive with the-x option. It is used to display the statistics of Block devices and system partitions. You can also specify a device name after-p, for example:
# Iostat-p hda
Or display all devices
# Iostat-p ALL
-T print the data collection time when outputting data.
-V: print the version number and help information.
-X output extension information.
3) input project description with iostat
Blk_read
The total number of read blocks.
Blk_wrtn
The total number of write blocks.
KB_read/s
The amount of data read from the drive per second. The unit is K.
KB_wrtn/s
Data volume written to the drive per second, in K.
KB_read
Total amount of data read, in K.
KB_wrtn
Total amount of data written, in K.
Rrqm/s
The number of read requests sent to the device per second after the read requests are merged.
Wrqm/s
Number of write requests sent to the device per second after the write requests are merged.
R/s
The number of read requests sent to the device per second.
W/s
Number of write requests sent to the device per second.
Rsec/s
Number of Sectors Read from the device per second.
Wsec/s
Number of sectors written to the device per second.
RkB/s
The amount of data read from the device per second. The unit is K.
WkB/s
Data volume written to the device per second, in K.
Avgrq-sz
The average size of the request sent to the device. Unit: slice.
Avgqu-sz
The average queue length of the request sent to the device.
Await
Average execution time of I/O requests, including the time for sending and executing requests. Unit: milliseconds.
Svctm
The average execution time of the I/O requests sent to the device. The unit is milliseconds.
% Util
The percentage of CPU usage during which I/O requests are sent to the device. It is used to display the bandwidth usage of the device.
When the value is close to 100%, the bandwidth of the device is full.
4) iostat example
# Iostat
Displays a statistical record, including all CPUs and devices.
# Iostat-d 2
The device statistics are displayed every 2 seconds.
# Iostat-d 2 6
Device statistics are displayed every 2 seconds. A total of 6 device statistics are output.
# Iostat-x hda hdb 2 6
The extended statistics of hda and hdb devices are displayed every two seconds, with a total output of six times.
# Iostat-p sda 2 6
The sda and the statistics of all the above partitions are displayed every 2 seconds. A total of 6 times are output.
7. Vmstat
The vmstat command is the most common Linux/Unix monitoring tool that displays the status values of servers at a given interval, including CPU usage, memory usage, and virtual memory switching, IO read/write status. This command is my favorite Linux/Unix Command. One is supported by Linux/Unix, and the other is the top command. I can see the CPU and memory of the entire machine, i/O usage, rather than simply viewing the CPU usage and memory usage of each process (different use cases ).
Generally, the vmstat tool is used by two numeric parameters. The first parameter is the number of sampling intervals, in seconds, and the second parameter is the number of sampling times, for example:
Root @ ubuntu :~ # Vmstat 2 1
Procs ----------- memory ---------- --- swap -- ----- io -----system -- ---- cpu ----
R B swpd free buff cache si so bi bo in cs us sy id wa
1 0 0 3498472 315836 0 0 0 1 2 0 0 3819540 0
2 indicates that the server status is collected every two seconds, and 1 indicates that the server status is collected only once.
In fact, during the application process, we will continue to monitor for a period of time, instead of directly stopping vmstat, for example:
Root @ ubuntu :~ # Vmstat 2
Procs ----------- memory ---------- --- swap -- ----- io -----system -- ---- cpu ----
R B swpd free buff cache si so bi bo in cs us sy id wa
1 0 0 3499840 315836 0 0 0 1 2 0 0 3819660 0
0 0 0 3499584 315836 0 0 0 88 3819660 0 0 158 0
0 0 0 3499708 315836 0 0 0 2 86 3819660 0 0 162 0
0 0 0 3499708 315836 0 0 0 10 81 3819660 0 0 151 0
1 0 0 3499732 315836 3819660 0 0 0 2 83 154 0 100 0
This means that vmstat collects data every 2 seconds and keeps collecting data until I end the program. I have collected five data times and ended the program.
Now, the introduction of the command is complete. We will explain the meaning of each parameter in practice.
R indicates the running Queue (that is, how many processes are actually allocated to the CPU). Currently, the CPU of the server I tested is relatively idle and no program is running. When this value exceeds the number of CPUs, the CPU bottleneck may occur. This is also related to the top load. Generally, when the load exceeds 3, it is relatively high. If the load exceeds 5, it is high. If the load exceeds 10, it is abnormal and the server status is very dangerous. The top load is similar to the running queue per second. If the running queue is too large, it indicates that your CPU is very busy, which may cause high CPU usage.
B Indicates the blocked process. This is not to mention that the process is blocked. Everyone understands it.
The size of the swpd virtual memory used. If it is greater than 0, the physical memory of your machine is insufficient. If it is not the cause of program memory leakage, you should upgrade the memory or migrate the memory-consuming tasks to other machines.
The free physical memory size. The total memory of my machine is 8 GB, and the remaining memory is 3415 MB.
The buff Linux/Unix system is used to store the contents and permissions in the directory. My local machine occupies more than 300 MB.
Cache is used directly to remember the files we opened and buffer the files. My local machine occupies more than 300 MB (here is the cleverness of Linux/Unix, cache part of the idle physical memory for files and directories to Improve the Performance of program execution. When the program uses the memory, buffer/cached will be quickly used .)
The size of the virtual memory read by si from the disk per second. If this value is greater than 0, it indicates that the physical memory is insufficient or the memory is leaked. Find out the memory-consuming process to solve the problem. My machine has plenty of memory and everything is normal.
So the size of the virtual memory written to the disk per second. If the value is greater than 0, the same as above.
The number of blocks received by bi Block devices per second. The Block devices here refer to all disks and other Block devices in the system. The default block size is 1024 bytes. I have no IO operations on this machine, so it has always been 0, but I have seen it on a machine that processes a large amount of data (2-3 TB) that can reach 140000/s, and the disk write speed is almost MB per second.
The number of blocks sent by bo Block devices per second. For example, if we read files, bo must be greater than 0. Bi and bo are generally close to 0. Otherwise, IO is too frequent and needs to be adjusted.
The number of CPU interruptions per second in, including time interruptions.
Cs context switching times per second. For example, if we call a system function, we need to perform context switching, thread switching, and process context switching. The smaller the value, the better, the larger the value, we need to reduce the number of threads or processes. For example, on a web server such as apache and nginx, we generally perform thousands or even tens of thousands of concurrent tests during performance tests, the process of selecting the web server can be lowered from the process or thread peak until cs reaches a relatively small value. This process and the number of threads are a suitable value. The same is true for system calls. Every time we call a system function, our Code will enter the kernel space, resulting in context switching. This is resource-consuming and we should try to avoid frequent calls to system functions. Too many context switches indicate that most of your CPU is wasted on context switches, resulting in less time for proper CPU operations and insufficient CPU utilization.
CPU time of the us user. I used to perform encryption and decryption on a server that frequently performed encryption and decryption. We can see that the number of us running queues is close to 100, and the number of r running queues reaches 80. (The Machine is performing a stress test, poor performance ).
If the CPU time of the sy system is too high, it indicates that the system call time is long, for example, frequent IO operations.
Id idle CPU time. Generally, id + us + sy = 100. Generally, id indicates idle CPU usage, us indicates user CPU usage, and sy indicates system CPU usage.
Wt waits for the io cpu time.
The grep command can be used to check the number of times an error occurs for a string:
Grep "CheckResource error for ora. zhdb1.vip" crsd. log | wc-l
17864
Grep WARNING ocssd. log