Linux device driver fault location guidelines and examples

Source: Internet
Author: User
Tags garbage collection mutex

Linux device driver fault location guidelines

Linux device drivers a wide range of knowledge points involved, want to write a general fault location method guidelines, is a difficult and not easy to do the work. Limited to the author's experience, it is difficult to avoid the existence of omissions, welcome to leave a message to add.

Linux device-driven knowledge points related to hardware and software, failure reasons are various, but from the author's years of maintenance experience, hardware-related problems caused by the device drive failure or accounted for a larger proportion, and most of the hardware problems, especially the environmental problems are relatively easy to troubleshoot. is the Linux device driver fault location brain map (large map), the figure is divided into two categories according to hardware and software, hardware classification is subdivided into environment, chip and superior bus/bridge, software classification is subdivided into bootloader and kernel driver. Each checkpoint gives a check and check method.

MTD Device driver Fault Location example fault description

A single board has a model S29GL01GT11TFIV10 nor flash on which the Ubifs file system is used to store 3.10 kernel images. During the test, it was found that about 10% of the probability of a single board power-up will occasionally be the following kernel exception, and all Flash command operations fail after that, and flash cannot access it normally:

MTD get_chip (): Chip not ready after erase suspend
UBI error:ubi_io_write:error-5 while writing bytes to PEB 932:36992, written 0 bytes
Notice (4295011711): cpu0 max interrupt interval is 112812200ns
Scall Trace: [jiffies:0x10000ad9b]
[<ffffffffc0be76cc>] Dump_stack+0x8/0x34
[<ffffffffc0a0b624>] ubi_io_write+0x52c/0x670
[<ffffffffc0a079e8>] ubi_eba_write_leb+0xd8/0x758
[<ffffffffc0897470>] ubifs_leb_write+0xd0/0x178
[<ffffffffc0898cd0>] ubifs_wbuf_write_nolock+0x430/0x798
[<ffffffffc088b16c>] ubifs_jnl_write_data+0x1e4/0x348
[<ffffffffc088e5a8>] do_writepage+0xc8/0x258
[<ffffffffc0714d70>] __writepage+0x18/0x78
[<ffffffffc0715ab8>] Write_cache_pages+0x1e0/0x4c8
[<ffffffffc0715de0>] generic_writepages+0x40/0x78
[<ffffffffc0784620>] __writeback_single_inode+0x58/0x370
[<ffffffffc0785b84>] writeback_sb_inodes+0x2e4/0x498
[<ffffffffc0785df8>] __writeback_inodes_wb+0xc0/0x118
[<ffffffffc07862fc>] Wb_writeback+0x234/0x3c0
[<ffffffffc0786918>] Wb_do_writeback+0x230/0x2b0
[<ffffffffc0786a1c>] bdi_writeback_workfn+0x84/0x268
[<ffffffffc0670300>] Process_one_work+0x180/0x4d0
[<ffffffffc0671848>] worker_thread+0x158/0x420
[<ffffffffc06786c0>] Kthread+0xa8/0xb0
[<ffffffffc06204c8>] ret_from_kernel_thread+0x10/0x18

Fault analysis and positioning steps

The corresponding code fragment is as follows (location: Drivers/mtd/chips/cfi_cmdset_0002.c:get_chip), its implementation function is if the current flash in the erase state, issued erase suspend (CMD (0xb0) ) command to pause the block erase operation to make the flash processing ready state.

Need to have erase suspend, because Ubi has background process ubi_bgt0d, this process function is the Flash block garbage collection, wear balance, torture check, etc., when the user accesses Flash, in order to respond to the user immediately, You need to pause the background process immediately to avoid situations in which the background process is consuming flash for a long time and the user is not requesting a timely response.

Issued after the erase suspend command, if the Timeo time (here is 1s), flash has not entered the ready state, it indicates that Flash has a problem, all subsequent flash command operation began to fail. The flash readiness check is implemented by Chip_ready, which is implemented by reading two times with the same address and indicating that Flash is ready if the values are the same. After the problem occurred, the Flash error address read the bytes have been in 0x28 and 0x6c between the jump, unable to stabilize.

Case Fl_erasing:if (!CFIP | |!) (    Cfip->erasesuspend & (0x1|0x2) | | !    (mode = = Fl_ready | | mode = FL_POINT | | (Mode = = Fl_writing && (cfip->erasesuspend & 0x2)))) Goto sleep;/* We could check to see if we ' re trying to access the sector * This is currently being erased. However, no user would try * anything like that and we just wait for the timeout. *//* Erase suspend *//* It ' s harmless to issue the erase-suspend and Erase-resume * commands when the Erase algorithm isn ' T in progress. */map_write (Map, CMD (0xb0), chip->in_progress_block_addr); chip->oldstate = Fl_erasing;chip->state = FL_ erase_suspending;chip->erase_suspended = 1;for (;;) {if (Chip_ready (map, ADR)) Break;if (Time_after (jiffies, Timeo)) {/* should has suspended the erase by now. * Send a Eras E-resume command as either * there was a error (so leave the erase * routine to recover from it) or we trying to * use th e erase-in-progress sector. */put_chip (map, Chip, ADR);P RINTK (kern_err "MTD%S (): Chip not ready after erase suspend\n ", __func__); Return-eio;} Mutex_unlock (&chip->mutex); Cfi_udelay (1) mutex_lock (&chip->mutex);/* Nobody would touch it while it's in   State fl_erase_suspending. So we can just loops here. */}chip->state = Fl_ready;return 0;

Refer to the above fault location brain map, to exclude some troubleshooting points that do not apply this failure, our troubleshooting sequence and results are as follows:

 

Repair Scenarios and patch submissions

Modification plan is also very simple, is for s29gl01gt/s29gl512t, after erase resume command issued, delay 500μs.

Specific implementations can view this patch submission link.

--eof--

  

Linux device driver fault location guidelines and examples

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.