Kernel module programming (10): viewing errors

Source: Internet
Author: User
Tags dmesg

This article is also named "Linux Device Drivers", the fourth chapter of LDD3: Five of the Reading Notes of Debuging Techniques, but we are not limited to this content.

During development, we cannot avoid system fault when executing the driver, but fault does not mean panic, Linux or robust. For drivers, it usually only causes the process that is using the driver to die, when any process that opens a device dies, the kernel will call close open to release the device. Even so, when oops occurs, even if we uninstall the kernel module program, the system may still be abnormal and usually needs to be restarted for recovery. This is when some information is printed on the console, they may be used to obtain program bugs. If our terminal is not the console of the system, the printk content cannot be displayed, you can try it through dmesg.

For example, an invalid pointer will generate oops messages. We can add the following to scull_write:

* (Int *) 0 = 0; // generates a NULL pointer error. After loading, execute the user program with the write operation (my user program is test), and the following error is reported in dmesg:

BUG: unable to handle kernel Null Pointer Dereference at 00000000

IP: []: scull: scull_write + 0x24/0x260

* PDU = 2b2d3067 * pte = 00000000
Oops: 0002 [#1] SMP
Modules linked in: scull vfat usb_storage fuse SCO bridge STP bnep L2CAP Bluetooth SunRPC ip6t_reject using ip6table_filter using IPV6 using dm_multipath kvm_intel KVM uinput without using snd_seq using fglrx (P) snd_pcm snd_timer extends snd_hwdep snd ppdev e1000e implements i2c_core soundcore serio_raw pcspkr dcdbas parport_pc itco_wdt ata_generic implements parport pata_acpi [last unloaded: microcode]

PID: 17913, COMM: Test

Tainted: P (2.6.27.5-117. fc10.i686 #1)
EIP: 0060: [] eflags: 00010296 CPU: 0
EIP is at scull_write
+ 0x24/0x260 [scull
]


Eax: 00000040 EBX: 00000400 ECx: 000005dc edX: b7f03000
ESI: eb270540 EDI: f8a3f376 EBP: f272ef74 ESP: f272ef18
DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068
Process test (pid: 17913, ti = f272e000 task = f25ac010 task. ti = f272e000)
Stack: f25ac348 00000000 f272ef38 c0429051 00000001 c0879c00 00000001 00000002
F8a41a20 f272ef54 000005dc b7f03000 00000000 eb270540 000005dc f272ef5c
00000040 f272ef74 c049041e 00000001 000005dc eb270540 f8a3f376 f272ef90
Call Trace:
[<C0429051>]? Finish_task_switch + 0x2f/0xb0
[<C049041e>]? Rw_verify_area + 0x76/0x97
[<F8a3f376>]?
Scull_write + 0x0/0x260 [scull]
[<C0490a70>]? Vfs_write + 0x84/0 xdf

[<C0490b64>]? Sys_write + 0x3b/0x60
[<C0403c76>]?
Syscall_call + 0x7/0xb
[<C06a007b>]? Init_intel_cacheinfo + 0x0/0x421

We hope to fix the bug. In the above example:EIP is at scull_write + 0x24/0x260 [scull]
. Indicates that the wrong address is located in the scull module in the scull_write function. There are two numbers in it, which can be used to estimate the location where the error occurs in the function, but this location is compiled, the total function length is 0x260, and the error value is 0x24, indicating that a bug occurs at the beginning. Not all bugs can accurately define the location. Call Trace can be used to estimate the location. Stack to list the problematic parts. Some experience may be required. For example, if 0xa5a5a5a5a5, it may indicate the initial allocated space. In x86, the default user space is less than 0xc0000000. Therefore, in the above example, we assume that f25ac348 is the kernel space. I think it is difficult to look at the stack in this way. It is estimated to be the only method, and there is no need to spend the effort here.

If the entire system is suspended (for example, an endless loop causes the kernel to stop scheduling, but in a multi-CPU system, Other Processors may still be able to schedule, in a single CPU system, by default, the preemption is disabled, causing scheduling to stop.) There is no oops message display. In the past, there was a project where the development machine was in the lab, I remotely called the machine in the Office (not on the same floor yet). () The system has been suspended several times, so I had to keep walking around and restart the machine. The lab environment is too harsh and intolerable. Later, I tried to migrate the machine. But we still have some methods, either we need to prevent it from appearing, or we can debug it later.

To prevent the entire suspension of the system, you can add schedule (). For example, if we can add schedule () to an endless loop to trigger scheduling and allow other processes to obtain the CPU time from it, this makes it possible to kill the process. However, if schedule () is also used in a formal program, we need to note that multiple programs may use the driver at the same time and require lock protection, but do not call schedule () in a place that holds the spinlock (). The most direct way to track where debugging encounters bugs is to add printk to the program for locating. Sometimes the system only seems to be suspended and can still execute scheduling. This means you can use the SysRq key (Alt-SysRq-X ). For more information, see documentation/sysrq.txt. below is from the http://www.deansys.com/doc/ldd3/ch04s05.html#SystemHange.sect
Introduction to these keys:

R
Turn off the original keyboard mode. Using a crashed application (such as the X server) may make your keyboard into a strange state.
K
Call the "Security note key" (SAK) function. Kill all processes running on the current console and give you a clean terminal.
S
Perform an emergency synchronization for all disks.
U
Umount. Try to reload all disks in read-only mode. This operation is often called immediately after S, which can save a lot of File System check time when the system is in serious trouble.
B
Boot. restart the system immediately. Make sure to synchronize and reload the disk first.
P
Print the processor message.
T
Print the current task list.
M
Print memory information.

We can use/proc/sys/kernel/sysrq to set whether the SysQp key is enabled. In addition, there is also a write-only file/proc/sysrq-trigger, which is especially suitable for remote calls. For example, we have echo t>/proc/sysrq-trigger under root, this is equivalent to the case where the SysRq key X = t is called and is not affected by whether/proc/sys/kernel/sysrq is enabled. We can see the relevant information in dmesg or console. The p of SysRq may directly point out the problem. We can also use the profiling function, but LDD3 is very simple. For details, see documentation/basic_profiling.txt. LDD3 also proposes a way to protect our disks, that is, to use read-only to load disks or to use NFS, so as to ensure that the data on the Disks will not be damaged.

 

Link: My articles related to the kernel module

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.