A Free Trial That Lets You Build Big!
Start building with 50+ products and up to 12 months usage for Elastic Compute Service
On-line fault location is in the event of a fault, the operating system environment is still accessible, fault handlers can be logged into the operating system through console, SSH, etc., in the shell to perform a variety of operation commands or test procedures in the manner of the failure environment to observe, analyze, test, To locate the cause of the failure.
AD:2014WOT Global Software Technology Summit Beijing Station course video release
Red Hat Linux fault location technology details and examples are the content to be introduced in this article, mainly to understand and learn about Red Hat Linux fault location technology learning, fault location technology is divided into online fault location and offline fault location , together to see the detailed.
1. Fault location (debugging) scene classification
In order to describe the problem, it divides the situation of software fault locating on Linux into two kinds
(1) On-line fault location
On-Line fault location (online-debugging) is in the event of a failure, the operating system environment in which the fault is still accessible, the fault-handling personnel can log on to the operating system via console, SSH, etc. Perform various operational commands or test procedures on the shell to observe, analyze, and test the failure environment to locate the cause of the failure
(2) Offline fault location
Offline fault location (offline-debugging) is the failure of the operation of the operating system environment is not properly accessible, but the fault occurs when the whole or part of the system has been inherent in the system itself or pre-set the way to collect, The fault handler can locate the cause of the fault by analyzing the fault location status information collected.
2. Application process failure situation and treatment
Application process failures generally do not affect the normal operation of the operating system environment (if the application code of the bug caused the kernel crash or hang, then the kernel has a vulnerability), so can be used online fault location method, flexible analysis. There are several scenarios for applying code failures:
(1) Abnormal process termination
Many users think that the process of abnormal termination can not be analyzed, but in fact, the process of abnormal termination is a trace. All process exception termination behaviors are implemented by the internal signaling to a particular process or group of processes. Can be divided into several types to describe:
-SIGKILL. Sigkill is the most special, because the signal cannot be captured, and Sigkill does not cause the terminated process to produce a core file, but if the actual sigkill is emitted by the kernel, the kernel must record the information in DMESG. There are also few places in the kernel where sigkill are used, such as oom_kill_process (), so it is not difficult to analyze the cause by DMESG logging and analyzing the code used in the kernel Sigkill
-Sigquit, Sigill, SIGABRT, Sigbus, SIGFPE, SIGSEGV. These signals terminate the process in a reserved situation and generate a core file, which allows the user to directly locate the code location that causes the terminating signal based on the stack trace information in the core. In addition, SIGQUIT,SIGABRT is generally used by the user code itself, good code will generally log. Sigill, Sigbus, SIGFPE, SIGSEGV, are generated by the kernel, search the kernel source code, it is not difficult to list the use of these signals in the kernel place, such as sigill is illegal instructions, It is possible that the code generated by the floating-point operation is corrupted or the physical memory of the text area corruption; Sigbus is caused by MCE fault location; SIGSEGV multiple pointer variables that are applied by the code are corrupted caused. For applications where the heap or stack memory is corrupted, the application can be profile using the Valgrind tool, often directly discovering the code that causes corruption
-SIGINT, Sigpipe, SIGALRM, SIGTERM. These signals terminate the process in a reserved situation but do not produce a core file. For these signals, it is recommended that the user must define a handler to record the context in which the problem is generated. It is easy to ignore the sigpipe, many user programs use Select () or poll () only listen to the Read/write descriptor, do not listen to the exception descriptor, in the case of the other side TCP is closed, still write to the socket, Lead to Sigpipe.
-For a malicious generation, the resulting process termination behavior, such as some process of cooperation, A to B sigkill, without logging, or B directly judge a condition and call exit (), also did not do logging. It may be difficult to locate this situation by analyzing code failures when the application code is large. SYSTEMTAP provides a good way to solve this problem is to write the user layer of the probes, tracking process to signal (), exit () and other system calls use
(2) The process is blocked, the application does not advance normally
This is a normal state for a single blocked process, but it is an exception for applications that contain multiple processes in general. The application cannot be pushed forward, indicating that one of the process propulsion factors is problematic, causing other processes that depend on it to wait. Analyzing this situation requires an analysis of the dependencies between processes or events, and the processing flow of the data. First, use Gdb-p's back trace function to isolate the execution path of each process block to determine the location of the state machine where each process is located.
In general, if only the state of each process is considered, the process may form a mutually dependent ring relationship, such as (P1 request =>P2 processing =>P2 reaction =>P1 again request =>P2 processing =>p2 re-reaction), but applied to workload , usually according to a transaction or session of the way to deal with each transaction has a starting point and end point, we need to use Strace, tcpdump and other tools and application of the execution log to observe, The position of the transaction that is currently being processed is analyzed to find out why all state machines are blocked. There are several reasons for this state machine to stop working: such as a problem with the remote end of the application communication, a problem with the backend database/directory, or a process or thread in the application that is in an abnormal blocking location or terminated directly, no longer functioning properly.
(3) The user process forms a deadlock
The user process forms a deadlock, and if there is no fault location on the memory, it is entirely the logic of the application itself. A loop is formed between a deadlock process or a thread due to the mutual possession of the lock. When this happens, using the Gdb-p back trace function can directly determine that the deadlock process is all blocked on Futex () and other lock-related system calls, the path of these calls Futex () may be a mutex, semaphore, conditional Variable such as lock function. By analyzing the code of call Trace, it is possible to directly determine all the locks that the process may have held at the time of execution to that location, and to eliminate the deadlock loop according to the code of the modified program, the problem can be resolved.
Note that a memory failure can also cause a false deadlock, such as a physical memory failure that directly causes the lock variable to have a value of-1, so that the process using the lock will block. If a bug in the code causes Memory corruption, use the Valgrind tool Checker to discover. However, if the corruption is caused by the fault location of physical memory, then hardware support is required, for high-end PC, such as MCE machine, when the physical memory fault location can directly produce abnormal or report, but for the low-end PC server, in addition to running the Memtest tool for detection, There is no other way
(4) The process is in the ' D ' (uninterruptible) state for a long time and cannot exit
This is mostly caused by a failure in the kernel. The kernel in many execution paths will process the status of ' D ' to ensure that critical execution paths are not interrupted by external signals, resulting in unnecessary inconsistencies in the state of the data structures in the kernel. In general, however, the process will not be in the ' D ' state for too long, because the condition at the end of the state, such as a timer trigger,
IO operation completed, etc.) will soon wake up the process. When the process is in ' d ' for a long time, the key is to find out its blocked code location, using SYSRQ's t key function to print out the kernel execution stack of all the sleep processes in the system, such as echo ' t ' >/proc/sysrq-trigger, which includes the presence of ' D ' The state of the process's kernel-state stack. After locating the code location, it is generally possible to directly analyze the reason why the ' D ' status cannot be exited, such as an IO read operation that cannot be completed due to a hardware or NFS failure.
It is possible to cause a ' d ' state to be more complex, such as the exit of ' d ' depends on the value of a variable, and the value of the variable is permanently corrupted for some reason.
Red Hat Linux Fault location technology detailed and examples (1)
Start building with 50+ products and up to 12 months usage for Elastic Compute Service