[Linux] When a tricky problem needs to be located, how to help develop, narrow the positioning range

Source: Internet
Author: User

Written in front: some time ago, my friend told me a story of a well-known company she had personally experienced, the end of the interview feel that they have taken off a layer of skin ... The interviewer's problem is not tricky, but is a step-by-step, and somewhat similar to the Byzantine general problem, but the Byzantine general problem is all the assumptions are unreliable, the interviewer's problem is, what do not know, there is no regularity, and serious, and do not know how to trigger ... To start a tricky problem, how to assist in the development of positioning analysis, think, the beginning is nothing more than the first to eliminate the possibility of the server, and then step-by-step stripping, the suspect range narrowed, as far as possible to help research and development to find and locate the culprit, of course, maybe not just a culprit

The problem is this: one day, found a problem, will cause the program crashes, but is not the first problem, and now in the pre-launch of the very period, there is no time to slowly reproduce, this problem research and development can not see why, but will appear, the user experience is very poor, need some time positioning out, as a test, How to assist in the development of positioning and find the source of the problem (probably this way, specifically forget, anyway, is not the problem, serious, irregular, and must be reproduced), in fact, I think, specific problems need to be specific treatment, problems arise, and confused the case, do not self-disorderly, we do not see problems, But it is possible to narrow the problem with an exclusion item, and the first 10 minutes of the problem are the most important and critical.

describe the cause and consequences of the problem as exhaustively as possible

Do not begin to rush in the face of problems, remember Ax, first of all we need to know how much of the server is now running how many known situations, how many things can assist the problem analysis.

There are a few things to be clear about:

    • When did the malfunction find out?
    • Whether the fault can be reproduced
    • What is the manifestation of the failure (no response, error, downtime, etc.)
    • Whether there is regularity in the occurrence of the fault (time-point law, resource occupancy Law, triggering law)
    • What is the last platform content update (code, server, container, PCK, etc.)
    • What are the user groups affected by the failure (specific? A random? )
    • What program is running when the fault occurs
    • Whether a monitoring platform is available (any monitoring platform can)
    • Is there a date to check (server log, container log, code log)
    • Server Resource Usage (memory, network I/O, disk I/O, swap, CPU, network, disk, process, load, etc.)
    • Do you have any spare memory? is the server swap between memory and hard disk?

    • Where does the server's maximum load come from? What is the average load?
    • How many cores does the server have? Are there some CPU cores that are overloaded?

second, the server in the time of the problem who is in

#who----See who's Logged in

#last-X----View User login History #lastlog----A simple look at the last logon time for all users

It is important to see who is doing what, what users are doing, and it is possible that some users are causing problems with the debugger ... Or, multiple users update the program at the same time.

third, what has been done before

#history----To view history commands that have been executed on the server

#cat/home/username/.bash_history----View the history commands that a user has performed on the server if you are querying who has manipulated the file, you can

echo ' Export histtimeformat= '%F%T ' whoami ' "' >>/etc/profile
Then use history to view
If you need to query a passive file, you can
Find/xx-mtime-2 in/home search for files that have changed in the last two days
find/xx-atime-1 checked files accessed within 1 days
Find/xx-mmin +60 The files that were changed 60 minutes ago at/home
Find/xx-amin +30 Check the files that were accessed last 30 minutes ago
XX represents the directory you need to query. You can also use it directly, but the query volume is too large. affect the use of resources.

See what has been done before and after the problem, there are no high-risk operations, which can quickly eliminate some unnecessary trouble.

Iv. What processes are currently in operation

#ps the aux----Show all running processes

#pstree-a----a tree-like display of running processes

#top----View process and other attachment information

Looking at the running and running users, you can see which programs are having problems and modularize the problem based on the relationship between the business logic.

Five, hardware

#lspci----Show All PCI bus devices in the system or tools connected to all devices on the bus

#dmidecode----Output Server all hardware information

#ethtool----Check the network card, is the network card set? Are you running in half duplex state? Is the speed 10MBps? Are there any tx/rx errors?

The reason to exclude hardware is very important, network card, RAID card, backup battery, memory bar, hard disk causes can cause failure.

Vi. mount points and documents system

#du-sh----Disk space usage

#df-H----disk space usage (compared to the previous command results, if a file is deleted, but the handle is still present, it is not emptied, it will be counted in this command, not counted in the last command)

#mount How many file systems are----mounted (file system format, file system permissions, file system mode)

#pvs----View Physical Volume information

#vgs----View Volume group information

#lvs----View Logical Volume Information

#cat/etc/fstab----View File system partition information

How many file systems are mounted? Is there a dedicated file system? Are you running out of disk space? Does the file system format support dynamic scaling?

Vii. networks, cores, and interrupts

# Sysctl-a | Grep...
# Cat/proc/interr
# Cat/proc/net/ip_conntrack/Some time on busy servers * *
# netstat
# ss-s
Is your interrupt request distributed evenly to CPU processing, or is there a CPU core overloaded by a large number of network interrupt requests or RAID requests?
What are the swap settings? Swappinness is good for workstations, but it's too bad for servers: You'd better never let the server swap, or the disk reads and writes will lock the swap process.
Is the Conntrack_max large enough to handle your server's traffic?
In different states (time_wait, ...) What is the setting of the TCP connection time?
If you want to show all existing connections, Netstat will be slow, you can first look at the overall situation with SS.
You can also look at Linux TCP tuning to understand some of the key points of network performance tuning.

viii. system logs and kernel messages

#dmesg----View Boot info/print or control core ring buffer information

#less/var/log/message----All the errors that occur in the boot system are basically recorded here.

#less/var/log/secure----Record the files that log on to the system to access data

#less/var/log/auth----Record user authentication log

#less/var/log/cron----Record the actions of a child process that is derived from the crontab daemon Crond

#less/var/log/syslog Alarm Information during the operation of the----environment

#/var/log/xferlog----Log the FTP reply, showing the user like the FTP server or copy files from the FTP server logs

Viewing error messages, hardware errors, file system errors, connection errors, and so on, all error messages can be matched to the wrong point in time for analysis.

Nine, Scheduled Tasks

#ps-ax | grep cron----See if scheduled tasks are running

#for u in ' cat/etc/passwd | Cut-d ":"-f1 ';d o crontab-l-u $u;d One----view scheduled tasks for all users in the system

#crontab-L----List all currently scheduled tasks

The main check is that there are no scheduled tasks too frequent execution, whether some users submit hidden scheduled tasks, in the event of a failure, there is a timed task is executing, etc.

Ten, Application System log

This is going to look like the environment, such as our environment is Linux + Apache (tomcat, wildfly) +nginx + Mysql +php

    • Apache (Tomcat, wildfly)----container access and error logs
    • Mysql----Find the error message directly inside the Mysql.log
    • PHP-FPM----view their logs directly
    • Nginx----See if the configuration is missing, so that users can directly attack the server, or to view the Nginx log, to see if there is an error message available
    • Ha-proxy----View load-balanced backend health, queue size, etc.

Through to the deployment environment, the container judgment, carries on the modularization to divide and judge.

Summarize

10 minutes, we can say that the situation of the problem is already clear, then where the problem arises, and what is related to, because of the resources caused? Or the database is improperly indexed, too many processes, memory overflow, stack errors ... Can have a targeted analysis down, to exclude environmental reasons, the root cause of the problem may be in sight.

Of course, in the search for data, we can also from their own professional perspective on the problem of a certain guessing and retry, the business logic to organize and abnormal attempts to comb out the possible points, assist in the development of positioning problems and reproduce the problem as soon as possible. In a two-pronged approach, efficiency is always better.

[Linux] When a tricky problem needs to be located, how to help develop, narrow the positioning range

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.