Fault Locating Technology on Red Hat Linux

Last Update:2017-07-18 Source: Internet

Author: User

Tags systemtap valgrind dmesg

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

1. fault Locating (Debugging) scenarios are classified into two categories to facilitate Problem description. A) online Fault Locating (online-debugging) when a fault occurs, the operating system environment where the fault is located can still be accessed. The fault handling personnel can log on to the operating system through the console, ssh, and other methods, and execute

1. Debugging scenario classification

For ease of Problem description, there are two types of software fault locating in Linux:

A) Online Fault Locating

Online-debugging means that when a fault occurs, the operating system environment where the fault is located can still be accessed. The fault handling personnel can log on to the operating system through the console, ssh, and other methods, execute various commands or test programs on shell to observe, analyze, and test the fault environment to locate the cause of the fault.

B) Offline Fault Locating

Offline-debugging means that when a fault occurs, the operating system environment where the fault is located cannot be accessed normally, however, when a fault occurs, all or part of the system's status has been collected by the system itself in an inherent or preset manner. The fault handling personnel can analyze the collected fault status information, locate the cause of the fault

2. Application Process Fault situation and handling

Application Process faults generally do not affect the normal use of the operating system environment (if the application code bug causes the kernel crash or hang, It is a kernel vulnerability ), therefore, the online fault locating method can be used for flexible analysis. application code faults may occur in the following situations:

A) abnormal Process Termination

Many users think that Process Termination exceptions cannot be analyzed, but the Process Termination exceptions are traceable. all abnormal Process Termination behaviors are implemented by sending a signal to a specific process or process group through the kernel. it can be divided into several types for description:

L SIGKILL. SIGKILL is the most special, because the signal cannot be captured, while SIGKILL will not cause terminated processes to generate core files, but if it is actually SIGKILL issued by the kernel, then the kernel will certainly record the information in dmesg. in addition, there are only a few SIGKILL points in the kernel, such as oom_kill_process (). Therefore, it is not difficult to analyze the cause by using dmesg to record and analyze the SIGKILL code in the kernel.

L SIGQUIT, SIGILL, SIGABRT, SIGBUS, SIGFPE, SIGSEGV. when these signals are retained, the process will be terminated and core files will be generated. Based on the stack trace information in the core, the user can directly locate the code location that causes the termination signal. in addition, SIGQUIT and SIGABRT are generally used by the user code, and good code generally records logs. SIGILL, SIGBUS, SIGFPE, and SIGSEGV are all generated in the kernel. It is not difficult to search for the kernel source code to list the locations where these signals are used in the kernel. For example, SIGILL is an illegal command, the code generated by floating-point operations may be upted or physical memory uption in the text area; SIGBUS is mostly caused by MCE faults; SIGSEGV is mostly caused by upted pointer variables of application code. if the heap or stack memory of an application is upted, use the valgrind tool to profile the application. Generally, you can directly find the code that causes the uption.

L SIGINT, SIGPIPE, SIGALRM, SIGTERM. these signals terminate the process while retaining but do not generate core files. we recommend that you define a handler to record the context of the problem. SIGPIPE is easy to ignore. Many user programs only listen to read/write descriptors when using select () or poll (), but not exception descriptors, when the peer TCP is closed, it is still written to the socket, causing SIGPIPE.

L terminate A process that is generated by malicious proxy. For example, in some processes of cooperation, A sends SIGKILL to B without logging, or B directly judges a condition and calls exit () without logging. when the amount of application code is large, it may be difficult to locate this situation by analyzing the code. systemTap provides a better way to solve this problem, that is, to write the user-layer probes and track the use of processes for system calls such as signal () and exit ().

B) the process is blocked and the application cannot proceed normally.

This situation is normal for a single blocked process, but it is abnormal for applications that contain multiple processes. the application cannot be promoted, indicating that a process has encountered a problem and other processes dependent on it have to wait. in this case, you need to analyze the dependency between processes or events and the data processing flow. first, use the back trace function of gdb-p to find the execution path of each process blocking to determine the position of the state machine where each process is located. generally, if you only consider the status of each process, there may be a dependency loop between processes, for example, (P1 sends a request => P2 Processing => P2 sends a response => P1 then requests => P2 Processing => P2 sends a response again), but the application sends a response to workload, generally, the processing is based on one transaction or session. Each transaction has a start and end point. We need to use strace, tcpdump, and other tools as well as application execution logs for observation, analyze the blocked position of the transaction being processed to find out the reason why all the state machines are blocked. there are multiple reasons for this state machine to stop running: problems with the remote end of the Application Communication, and problems with the backend database/directory, A process or thread of the application is in an abnormal blocking position or directly terminated, and does not work normally.

C) User processes form deadlocks

The user process forms a deadlock. If there is no memory fault, it is completely the logic Problem of the application itself. The deadlock process or thread forms a loop due to the mutual occupation of locks. In this case, the back trace function of gdb-p can be used to directly determine that all deadlocks are blocked in futex () and other lock-related system calls. These calls futex () the path may be mutex, semaphore, conditional variable and other lock functions. by analyzing the call trace code, you can directly determine all the locks that may be held by each process when it is executed to this location. Based on the code of the modified program, the deadlock loop can be eliminated, to solve the problem.

Note that a memory failure can also lead to a false deadlock. For example, a physical memory failure can directly lead to a lock variable value of-1, so the process using the lock will be blocked. if it is caused by a code bug, the valgrind tool can be used to check the program to find out. however, if a failover is caused by a physical memory failure, hardware support is required. For high-end PCs, such as MCE-enabled machines, exceptions or reports can be directly generated when the physical memory is faulty, however, for low-end PC servers, apart from running the memtest tool for detection, there is no other method

D) the process has been in the 'D' (UnInterruptible) state for a long time and cannot exit.

This is mostly caused by kernel faults. in many execution paths, the kernel regards the process as 'D' to ensure that the key execution path is not interrupted by external signals, this leads to inconsistent data structure states in unnecessary kernels. but generally, the process will not be in the 'D' state for too long, because the conditions for the end of the State (such as timer triggering, IO operation completion, etc.) will soon wake up the process. when a process remains in 'D' for a long time, the key is to find out the blocked code location. With the t-key function of sysrq, the kernel execution stack of all sleeping processes in the system can be printed directly, for example, echo 'T'>/proc/sysrq-trigger, which includes the kernel state stack of processes in the 'D' state. after finding out the code location, you can directly analyze the cause that the 'D' status cannot exit. For example, the IO read operation cannot be completed due to a hardware or nfs failure.

The reasons for the 'D' state may be complicated. For example, the exit of 'D' depends on the value of a variable, and the value of this variable is permanently upted for some reason.

3. Kernel Fault situation and handling

A) kernel panic

Panic is the most direct Fault Report of the kernel. In the case of panic, the kernel has considered that the failure has caused the operating system to no longer meet the normal operating conditions. when a panic occurs, Linux will disable all CPU interruptions and process scheduling functions, so the system does not respond. If the user starts a graphic interface, you cannot see any panic information on the screen. generally, when the machine does not respond and the ping fails, most of them are panic. when a Panic occurs, the kernel directly prints the call stack that leads to the code position of the panic on the console. Traditional users use serial ports to connect to the machine to collect printed information on the console, obviously, it is inconvenient to use. Now Linux, such as RHEL5 and RHEL6, use kdump to collect the panic information. when kdump is configured, the system loads and switches to a new kernel (placed in the pre-allocated memory location) with kexec ), saves all or part of the memory data of the system with disks or networks.

After kdump is used to collect panic data, you can use the crash tool to directly view the code path of the panic.

Panic is generally intuitive. The stack information of panic can directly reflect the cause of the bug, such as MCE fault, NMI fault, and data structure Allocation failure. however, sometimes panic is because the kernel actively discovers the key data structure inconsistency. It is unclear when the inconsistency is caused by code, you may need to perform multiple tests to capture data using tools such as SystemTap.

B) deadlocks generated by the kernel execution path in a multi-processor Environment

The kernel deadlock is different from that of panic. When a deadlock occurs, the kernel does not take the initiative to suspend itself. however, when a kernel deadlock occurs, the execution paths of more than two CPUs cannot be pushed forward in the kernel state, and they are in mutual blocking state, and 100% of CPU usage (with spin-lock ), directly or indirectly, all processes on the CPU cannot be scheduled. there are two kernel deadlocks:

L deadlock involving the interrupt context. in this case, at least one CPU interruption is blocked. the system may not be able to respond to the ping request. because a CPU can no longer respond to the interruption, the scheduled interruption of the local APIC on it cannot work. You can use the NMI Watchdog method to check the counter variable maintained by the local APIC handler ), NMI Watchdog can call panic () in its processing program, so that users can use kdump to collect memory information, analyze the call stacks on deadlocks, and investigate the logical causes of deadlocks.

L deadlock that does not involve the interrupt context. in this case, the deadlock occurs, and the interruption on each CPU is normal. The system can respond to the ping request. In this case, NMI Watchdog cannot be triggered. in the kernel before 2.6.16, there was no good way to deal with this situation. in RHEL5 and RHEL6 kernels, each CPU provides a watchdog kernel thread. When a deadlock occurs, the watchdog kernel thread on the deadlocked CPU cannot be scheduled (even if it is a real-time process with the highest priority), it cannot update the corresponding counter variable, the NMI Watchdog interruption of each CPU periodically checks the counter corresponding to its CPU. If no updated is found, panic () is called, and the user can use kdump to collect memory information, analyze the call stacks on each deadlocked CPU to find out the logical cause of the deadlock.

C) kernel oops or warning

Oops, similar to warning and panic, are actively reported exceptions because of kernel inconsistencies. however, oops and warning cause much less serious problems than panic, so that the kernel does not need to suspend the system to handle the problem. generate oops and warning. The kernel usually records a considerable amount of information in dmesg, especially oops, at least prints the call trace where the fault occurs. oops can also be switched to panic/kdump for offline-debugging. You only need to set the panic_on_oops variable in/proc/sys/kernel to 1.

There are many direct causes for oops and warning, such as the segment fault in the kernel or the counter value of a Data Structure found by the kernel is incorrect, the segment fault and counter value changes have a deeper reason, which is usually not visible from the kernel dmesg information. To solve this problem, SystemTap is used for probe, if a counter value is found to be incorrect, use SystemTap to make a probe to record all code accesses to the counter, and then analyze it.

Locating oops and warning is much more difficult than locating the application's memory access failure, because the distribution and usage of data structures cannot be tracked in the kernel just like using valgrind to trace the application.

2. Other (hardware-related) faults

Automatic Machine restart is a common fault. It is generally caused by hardware faults such as physical memory. software faults only lead to deadlocks or panic, there is almost no code in the kernel to reboot the machine when a problem is detected. in the/proc/sys/kernel directory, there is a parameter "panic". If the value is not 0, the kernel restarts the machine after the panic occurs for several seconds. nowadays, high-end PC servers are all trying to use software to handle physical memory faults. For example, the "HWPoison" method of the MCA isolates the faulty physical pages, kill the process where the fault page is located. RHEL6 now supports "HWPoison ". machines that do not have the capability of the MCA will not generate MCE exceptions when the physical memory fails. The hardware mechanism reboot the machine directly.

3. Debugging technology introduction on RHEL6

A) Kdump fault collection and crash Analysis

Kdump is used to collect system memory information in the kernel panic. You can also trigger it with the 'C' key of sysrq online. kdump uses a non-contaminated kernel to execute dump, so it is more reliable than the previous diskdump and lkcd methods. with kdump, you can dump data to a local disk or network, or filter the memory information to be collected by defining the makedumpfile parameter, reducing the downtime required by kdump.

Crash is a tool for analyzing kdump information. it is actually a wrapper of gdb. when using crash, it is best to install the kernel-debuginfo package to parse the symbolic information of kernel data collected by kdump. the ability to use crash to locate problems depends entirely on your understanding and analysis of the kernel code.

Refer to "#> man kdump. conf", "#> man crash", "#> man makedumpfile" to learn how to use kdump and crash. Access the http://ftp.RedHat.com to download the debuginfo File

B) Use systemTap to locate bugs

Systemtap is a probe-type locating tool that can probe the specified location of the kernel or user code. when data is executed at the specified location or accessed at the specified location, the user-defined probe function is automatically executed to print the call stack, parameter value, variable value, and other information for this location. the probe position of systemtap is flexible, which is a powerful function of systemtap. the probe of Systemtap includes the following aspects:

L all system calls in the kernel, and all function entry or exit points in the kernel and module

L custom timer probe point

L any specified code or data access location in the kernel

L arbitrary code or data access location in a specific user process

L a number of probe points pre-configured for each function subsystem, such as tcp, udp, nfs, and signal.

The systemTap script is written in stap script language. The script code calls the APIS provided by stap for statistics and data printing. For the APIS provided by stap, refer to "#> man stapfuncs ". for more information about systemTap functions and usage, see "#> man stap" and "#> man stapprobes"

C) ftrace

Ftrace is an event tracking mechanism implemented using the tracepoints infrastructure in the Linux kernel. Its function is to clearly show the activities performed by the system or process within a certain period of time, such as the function call path, process switching, etc. ftrace can be used to observe latency of various parts of the system for real-time application optimization. It can also help locate faults by recording kernel activities within a period of time. use the following method to trace the function calls of a process at one end time:

#> Echo "function">/sys/kernel/debug/tracing/current_tracer

#> Echo "xxx">/sys/kernel/debug/tracing/set_ftrace_pid

#> Echo 1>/sys/kernel/debug/tracing/tracing_enabled

In addition to tracing function calls, ftrace can also perform operations such as process switching, wakeup, block Device Access, and kernel data structure allocation. note that tracing and profile are different. tracing records all activities within a period of time, rather than statistics, you can set the buffer size in buffer_size_kb under/sys/kernel/debug/tracing to record data for a longer period of time.

For details about how to use ftrace, refer to the content in kernel source code enation/trace.

D) oprofile and perf

Both oprofile and perf are tools for system profile (sampling, Statistics). They are mainly used to solve system and application performance problems. perf features are more powerful and comprehensive. At the same time, perf's user space tools are maintained and released together with the kernel source code, allowing users to enjoy the new features of the perf kernel in a timely manner. perf is available only in RHEL6, and RHEL5 does not have Perf. both Oprofile and perf use the hardware counters in the modern CPU for statistics, but perf can also use the "software counter" and "trace points" defined in the kernel ", so we can do more work. oprofile sampling is performed using NMI interruptions of the CPU, while perf can use both NMI interruptions and periodic interruptions provided by hardware counters. you can easily use perf to distribute the execution time of a process or system in oprofile, such

#> Perf top-f 1000-p

You can also use the system-defined "software counter" and the "trace points" of each subsystem to analyze the subsystem, as shown in figure

#> Perf stat-a-e kmem: mm_page_alloc-e kmem: mm_page_free_direct-e kmem: mm_pagevec_free sleep 6 can measure the activity of the kmem subsystem within 6 seconds (this is actually achieved using tracepoints provided by ftrace)

I think with perf, users do not need to use oprofile.

4. Use kdump to locate kernel faulty instances

A) Deploy Kdump

To collect fault information, deploy kdump as follows:

1) set relevant kernel startup parameters

Add the following content to/boot/grub/menu. lst:

Crashkernel = 128M @ 16 M nmi_watchdog = 1

The crashkernel parameter is used to reserve memory for the kdump kernel; nmi_watchdog = 1 is used

If NMI interruption is activated, we need to deploy NMI watchdog to ensure that the panic is triggered. restart the system to ensure that the settings take effect.

2) Set related sysctl Kernel Parameters

Add the last line to/etc/sysctl. conf.

Kernel. softlookup_panic = 1

This setting ensures that when softlock occurs, panic is called to trigger the kdump action.

Run #> sysctl-p to make sure the settings take effect.

3) Configure/etc/kdump. conf

Add the following lines to/etc/kdump. conf:

Ext3/dev/sdb1

Core-collector makedumpfile-c-message-level 7-d 31-I/mnt/vmcoreinfo

Path/var/crash

Default reboot

/Dev/sdb1 is the file system used to place dumpfile. The dumpfile file is placed under/var/crash. You must create the/var/crash directory under/dev/sdb1 in advance. "-d 31" specifies the dump content filtering level. this parameter is important when the dump partition cannot store all the memory content or the user does not want the dumping to interrupt the service for too long. the vmcoreinfo file is stored in the/directory of the/dev/sdb1 partition. The command is as follows:

#> Makedumpfile-g // vmcoreinfo-x/usr/lib/debug/lib/modules/2.6.18-128. el5.x86 _ 64/vmlinux

The "vmlinux" file is provided by the kernel-debuginfo package. Before running the makedumpfile file, install the kernel-debuginfo and kernel-debuginfo-common packages of the corresponding kernel, the two packages need to be downloaded from http://ftp.redhat.com. "default reboot" is used to tell kdump. After collecting the dump information, restart the system.

4) Activate kdump

Run the #> service kdump start command, and you will see that a initrd-2.6.18-128.el5.x86_64kdump.img file will be generated in the/boot/directory after successful completion, this file is the initrd file of the kernel loaded by kdump. The dump information is collected in the initrd startup environment. view/etc/init. d/kdump script code. You can see that the mkdumprd command is called to create the initrd file for dump.

B) test the effectiveness of Kdump deployment

To test the effectiveness of kdump deployment, I wrote the next kernel module, loading the kernel module through insmod, to generate a kernel thread, which occupies 100% of the CPU in about 10 seconds, kdump is triggered in about 20 seconds. after the system is restarted, check the content in the/Oracle partition/var/crash directory to check whether the vmcore file is generated.

Zw.thread. c

# Include

MODULE_AUTHOR ("frzhang@redhat.com ");

MODULE_DESCRIPTION ("A module to test ....");

MODULE_LICENSE ("GPL ");

Static struct task_struct * z1__thread;

Static int zqfd_thread (void * data );

Static int zqfd_thread (void * data)

{

Int I = 0;

While (! Kthread_should_stop ()){

I ++;

If (I <10 ){

Msleep_interruptible (1000 );

Printk ("% d seconds \ n", I );

}

If (I = 1000) // Running in the kernel

I = 11;

}

Return 0;

}

Static int _ init zqfinit (void)

{

Struct task_struct * p;

P = kthread_create (zqfd_thread, NULL, "% s", "zqfd ");

If (p ){

Z1__thread = p;

Wake_up_process (z1__thread); // actually start it up

Return (0 );

}

Return (-1 );

}

Static void _ exit zqffini (void)

{

Kthread_stop (z1__thread );

}

Module_init (zqfinit );

Module_exit (zqffini)

Makefile

Obj-m + = zw.thread. o

Making

#> Make-C/usr/src/kernels/2.6.32-71. el6.x86 _ 64/M = 'pwd' modules

C) use the crash tool to analyze the vmcore File

Use the crash command to analyze the command line format of vmcore. after opening vmcore with crash, we use dmesg and bt commands to print the call trace of the problematic execution path, decompile the Code with dis, and finally confirm the location of the C source code corresponding to the call trace, then perform logical analysis.

#> Crash/usr/lib/debug/lib/modules/2.6.18-128. el5.x86 _ 64/vmlinux/boot/System. map-2.6.18-128.el5.x86_64./vmcore

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More