Linux device driver fault location guidelines
Linux device drivers a wide range of knowledge points involved, want to write a general fault location method guidelines, is a difficult and not easy to do the work. Limited to the author's experience, it is difficult to avoid the existence of omissions, welcome to leave a message to add.
Linux device-driven knowledge points related to hardware and software, failure reasons are various, but from the author's years of maintenance experience, hardware-related problems caused by the device drive failure or accounted for a larger proportion, and most of the hardware problems, especially the environmental problems are relatively easy to troubleshoot. is the Linux device driver fault location brain map (large map), the figure is divided into two categories according to hardware and software, hardware classification is subdivided into environment, chip and superior bus/bridge, software classification is subdivided into bootloader and kernel driver. Each checkpoint gives a check and check method.
MTD Device driver Fault Location example fault description
A single board has a model S29GL01GT11TFIV10 nor flash on which the Ubifs file system is used to store 3.10 kernel images. During the test, it was found that about 10% of the probability of a single board power-up will occasionally be the following kernel exception, and all Flash command operations fail after that, and flash cannot access it normally:
MTD get_chip (): Chip not ready after erase suspend
UBI error:ubi_io_write:error-5 while writing bytes to PEB 932:36992, written 0 bytes
Notice (4295011711): cpu0 max interrupt interval is 112812200ns
Scall Trace: [jiffies:0x10000ad9b]
[<ffffffffc0be76cc>] Dump_stack+0x8/0x34
[<ffffffffc0a0b624>] ubi_io_write+0x52c/0x670
[<ffffffffc0a079e8>] ubi_eba_write_leb+0xd8/0x758
[<ffffffffc0897470>] ubifs_leb_write+0xd0/0x178
[<ffffffffc0898cd0>] ubifs_wbuf_write_nolock+0x430/0x798
[<ffffffffc088b16c>] ubifs_jnl_write_data+0x1e4/0x348
[<ffffffffc088e5a8>] do_writepage+0xc8/0x258
[<ffffffffc0714d70>] __writepage+0x18/0x78
[<ffffffffc0715ab8>] Write_cache_pages+0x1e0/0x4c8
[<ffffffffc0715de0>] generic_writepages+0x40/0x78
[<ffffffffc0784620>] __writeback_single_inode+0x58/0x370
[<ffffffffc0785b84>] writeback_sb_inodes+0x2e4/0x498
[<ffffffffc0785df8>] __writeback_inodes_wb+0xc0/0x118
[<ffffffffc07862fc>] Wb_writeback+0x234/0x3c0
[<ffffffffc0786918>] Wb_do_writeback+0x230/0x2b0
[<ffffffffc0786a1c>] bdi_writeback_workfn+0x84/0x268
[<ffffffffc0670300>] Process_one_work+0x180/0x4d0
[<ffffffffc0671848>] worker_thread+0x158/0x420
[<ffffffffc06786c0>] Kthread+0xa8/0xb0
[<ffffffffc06204c8>] ret_from_kernel_thread+0x10/0x18
Fault analysis and positioning steps
The corresponding code fragment is as follows (location: Drivers/mtd/chips/cfi_cmdset_0002.c:get_chip), its implementation function is if the current flash in the erase state, issued erase suspend (CMD (0xb0) ) command to pause the block erase operation to make the flash processing ready state.
Need to have erase suspend, because Ubi has background process ubi_bgt0d, this process function is the Flash block garbage collection, wear balance, torture check, etc., when the user accesses Flash, in order to respond to the user immediately, You need to pause the background process immediately to avoid situations in which the background process is consuming flash for a long time and the user is not requesting a timely response.
Issued after the erase suspend command, if the Timeo time (here is 1s), flash has not entered the ready state, it indicates that Flash has a problem, all subsequent flash command operation began to fail. The flash readiness check is implemented by Chip_ready, which is implemented by reading two times with the same address and indicating that Flash is ready if the values are the same. After the problem occurred, the Flash error address read the bytes have been in 0x28 and 0x6c between the jump, unable to stabilize.
Case Fl_erasing:if (!CFIP | |!) ( Cfip->erasesuspend & (0x1|0x2) | | ! (mode = = Fl_ready | | mode = FL_POINT | | (Mode = = Fl_writing && (cfip->erasesuspend & 0x2)))) Goto sleep;/* We could check to see if we ' re trying to access the sector * This is currently being erased. However, no user would try * anything like that and we just wait for the timeout. *//* Erase suspend *//* It ' s harmless to issue the erase-suspend and Erase-resume * commands when the Erase algorithm isn ' T in progress. */map_write (Map, CMD (0xb0), chip->in_progress_block_addr); chip->oldstate = Fl_erasing;chip->state = FL_ erase_suspending;chip->erase_suspended = 1;for (;;) {if (Chip_ready (map, ADR)) Break;if (Time_after (jiffies, Timeo)) {/* should has suspended the erase by now. * Send a Eras E-resume command as either * there was a error (so leave the erase * routine to recover from it) or we trying to * use th e erase-in-progress sector. */put_chip (map, Chip, ADR);P RINTK (kern_err "MTD%S (): Chip not ready after erase suspend\n ", __func__); Return-eio;} Mutex_unlock (&chip->mutex); Cfi_udelay (1) mutex_lock (&chip->mutex);/* Nobody would touch it while it's in State fl_erase_suspending. So we can just loops here. */}chip->state = Fl_ready;return 0;
Refer to the above fault location brain map, to exclude some troubleshooting points that do not apply this failure, our troubleshooting sequence and results are as follows:
Repair Scenarios and patch submissions
Modification plan is also very simple, is for s29gl01gt/s29gl512t, after erase resume command issued, delay 500μs.
Specific implementations can view this patch submission link.
--eof--
Linux device driver fault location guidelines and examples