Linux Block device driver (1) __linux

Source: Internet
Author: User
Tags data structures function prototype goto
1. Background

Sampleblk is a Linux block device drive project for learning purposes. One of the Day1 source code to achieve a minimalist block device driver, the source codes only more than 200 lines. This article mainly revolves around these source code, discusses the Linux block device drive development the basic knowledge.

Developing a Linux driver requires a series of development environment preparation work. The SAMPLEBLK drive is developed and debugged under Linux 4.6.0. Because of the wide variety of APIs in different Linux kernel versions of the common block layer, this driver may have problems compiling in other kernel versions. Development, compilation, debugging kernel modules need to first prepare the kernel development environment, compile the kernel source code. These basic content is available on the Internet everywhere, this article no longer repeat.

In addition, the development of Linux device-driven classic books when belong to Device Drivers, third Edition abbreviation LDD3. The book is free and can be downloaded freely and redistributed in accordance with its stated License. 2. Module initialization and Exit

The development of Linux driver modules follows the basic framework and APIs provided by Linux for module developers. LDD3 's Hello World module provides examples of writing a minimalist kernel module. The SAMPLEBLK block-driven module is similar to the module initialization and exit function necessary for the Linux kernel module.

Module_init (sampleblk_init);
Module_exit (Sampleblk_exit);

Unlike the Hello World module, the SAMPLEBLK-driven initialization and exit function implements the basic functionality necessary for a block device driver. This section is intended to provide a detailed description of this section. 2.1 Sampleblk_init

Summed up, the Sampleblk_init function for the completion of block device-driven initialization, mainly done the following several things, 2.1.1 Pieces of equipment registration

Call Register_blkdev to complete the allocation and registration of major number, the function prototype is as follows,

int Register_blkdev (unsigned int major, const char *name);

The Linux kernel maintains a global hash table for block device drivers major_names The bucket of this hash table is [0. 255] An array of structure pointers to the blk_major_name of the integer index.

static struct Blk_major_name {
    struct blk_major_name *next;
    int major;
    Char name[16];
} *major_names[blkdev_major_hash_size];

When the Register_blkdev major parameter is not 0 o'clock, its implementation attempts to find the free pointer in the specified major corresponding bucket in this hash table, assigns a new blk_major_name, initializes the major and name according to the specified parameters. If the specified major is already occupied by someone else (the pointer is not empty), it indicates a major conflict, a back error.

When the major parameter is 0 o'clock, the kernel assigns an unused back to the caller from the integer range of [1..255]. Therefore, although the Linux kernel's main device number (Major numbers) is 12-bit, it is still allocated from [1..255] when no Major is specified.

The SAMPLEBLK driver assigns and registers an unused main device number to the kernel by specifying major 0, with the following code,

Sampleblk_major = Register_blkdev (0, "sampleblk");
if (Sampleblk_major < 0) return
    sampleblk_major;
allocation and initialization of 2.1.2-driven state data structures

Typically, all Linux kernel drivers declare a data structure to store the state information that drives require frequent access. Here, we have also declared one for the sampleblk drive,

struct Sampleblk_dev {
    int minor;
    spinlock_t lock;
    struct Request_queue *queue;
    struct Gendisk *disk;
    ssize_t size;
    void *data;
};

To simplify implementation and facilitate debugging, the SAMPLEBLK driver only supports a minor device number for the time being, and can be accessed using the following global variables,

struct Sampleblk_dev *sampleblk_dev = NULL;

The following code assigns the SAMPLEBLK_DEV structure, and initializes the members of the struct.

Sampleblk_dev = kzalloc (sizeof (struct sampleblk_dev), gfp_kernel);
if (!sampleblk_dev) {
    RV =-enomem;
    goto fail;
}

Sampleblk_dev->size = sampleblk_sect_size * sampleblk_nsects;
Sampleblk_dev->data = Vmalloc (sampleblk_dev->size);
if (!sampleblk_dev->data) {
    RV =-enomem;
    Goto Fail_dev;
}
Sampleblk_dev->minor = minor;
2.1.3 Request Queue initialization

Initializing the request queue using Blk_init_queue requires declaring a so-called policy (strategy) callback and securing the spin lock of the request queue. The function pointer and spin lock pointer of the policy callback are then passed as arguments to the function.

In the Sampleblk drive is the Sampleblk_request function and Sampleblk_dev->lock,

Spin_lock_init (&sampleblk_dev->lock);
Sampleblk_dev->queue = Blk_init_queue (sampleblk_request,
    &sampleblk_dev->lock);
if (!sampleblk_dev->queue) {
    RV =-enomem;
    Goto fail_data;
}

The policy function sampleblk_request the read and write IO operations of a block device, and its main entry parameter is the request Queue structure: struct request_queue. The specific implementation of the policy function we will introduce later.

When the blk_init_queue is executed, its internal implementation does the following, allocating a struct request_queue structure from memory. Initializes the struct request_queue structure. For callers, the initialization of the following sections is particularly important,
Blk_init_queue the specified policy function pointer is assigned to the REQUEST_FN member of Struct Request_queue. Blk_init_queue the specified spin lock pointer is assigned to the Queue_lock member of the struct request_queue. Initialization of the IO scheduler associated with this request_queue.

The Linux kernel provides a variety of ways to allocate and initialize the Request Queue, blk_mq_init_queue mainly for block device drivers Blk_alloc_queue and blk_queue_make_request using multiple queuing techniques Primarily used to circumvent the integration and ordering of IO Scheduler supported by the kernel, using a custom implementation. Blk_init_queue uses the IO scheduler supported by the kernel to drive only the implementation of policy functions.

The sampleblk drive belongs to the third case. Here again: If a block device driver needs to merge or sort IO requests using the standard IO Scheduler, it is necessary to use Blk_init_queue to allocate and initializethe request queue. 2.1.4 pieces of equipment operation function table initialization

The Linux block Device action function table block_device_operations is defined in the Include/linux/blkdev.h file. Block device driver can customize the operation function of standard block device by defining the function table.
If the driver does not implement the method defined by this action table, the code for the Linux block device layer will work according to the default behavior of the code in the common layer of the block device.

The SAMPLEBLK driver declares its own open, release, and IOCTL methods, but none of these methods correspond to any of the drive functions in the actual work. Thus the behavior of the actual block device operation is realized by the public layer of the block device,

static const struct Block_device_operations sampleblk_fops = {
    . Owner = This_module,
    . Open = Sampleblk_open,
    . Release = Sampleblk_release,
    . IOCTL = Sampleblk_ioctl,
};
2.1.5 Disk creation and initialization

The Linux kernel uses struct gendisk to abstract and represent a disk. In other words, block device drivers must allocate and initialize a struct gendisk to support normal block device operations.

First, use Alloc_disk to assign a struct gendisk,

Disk = Alloc_disk (minor);
if (!disk) {
    RV =-enomem;
    Goto fail_queue;
}
Sampleblk_dev->disk = disk;

Then, initialize the important members of the struct Gendisk, especially the Block Device action function table, Rquest Queue, and capacity settings. The final call Add_disk to make the disk visible in the system, triggering the disk hot-swappable uevent.

Disk->major = sampleblk_major;
Disk->first_minor = minor;
Disk->fops = &sampleblk_fops;
Disk->private_data = Sampleblk_dev;
Disk->queue = sampleblk_dev->queue;
sprintf (Disk->disk_name, "sampleblk%d", minor);
Set_capacity (disk, sampleblk_nsects);
Add_disk (disk);
2.2 Sampleblk_exit

This is a sampleblk_init reverse process,

Delete disk

Del_gendisk is a add_disk inverse process, so that the disk is no longer visible in the system, touch the heat plug uevent.

Del_gendisk (Sampleblk_dev->disk);

Stop and release block device IO request queues

Blk_cleanup_queue is a blk_init_queue inverse process, but it has to dispose of the IO requests to be processed before releasing the struct request_queue.

Blk_cleanup_queue (Sampleblk_dev->queue);

When Blk_cleanup_queue all IO requests, the queue is marked for release immediately, which prevents Blk_run_queue from continuing the call to block-driven policy functions and continues to execute IO requests. Before Linux 3.8, the kernel had serious bugs while Blk_run_queue and Blk_cleanup_queue were executing simultaneously. The bug was recently discovered in a stress test of surprise Remove with Disk IO (and, to be honest, surprisingly, the bug has been around for so long that no one has found it).

free disk

Put_disk is the inverse process of alloc_disk. Here Gendisk the corresponding Kobject reference count into 0, completely releasing the Gendisk.

Put_disk (Sampleblk_dev->disk);

Releasing the data area

Vfree is the inverse process of vmalloc.

Vfree (Sampleblk_dev->data);

Releases the drive global data structure.

Free is the reverse process of kzalloc.

Kfree (Sampleblk_dev);

Log off the block device.

Unregister_blkdev is the inverse process of register_blkdev.

Unregister_blkdev (Sampleblk_major, "sampleblk"); 3. Policy function Implementation

Understanding the implementation of block device-driven policy functions requires an understanding of the key data structures of the Linux IO stack first. 3.1 struct Request_queue

Block device drives the IO request queue structure to be processed. If the queue is allocated and initialized using Blk_init_queue, the IO request within the team (struct requests) needs to be processed (sorted or merged) by the IO Scheduler and triggered by Blk_queue_bio.

When a block device policy-driven function is invoked, the request is linked to the queue_head chain of the struct request_queue by its queuelist member. There will be many request structures on an IO application queue. 3.2 struct Bio

A bio logically represents the IO request that a task at the upper level initiates to the generic block device layer . IO requests from different applications, different contexts, and different threads are encapsulated into different bio data structures in the block device driver layer.

Data for the same bio structure is composed of a physical contiguous sector of the block device starting from the starting sector . The concept of segment (Segment) is only possible because contiguous physical sectors on block devices are not guaranteed to be contiguous with physical memory in memory. The sector of the block device within the Segment is contiguous with physical memory , but the continuity of physical memory is not guaranteed between Segment. The Segment length does not exceed the memory page size, and is always multiple times the size of the sector.

The following figure clearly shows the layout of the sector (sector), block, and segment (Segment) within the memory page (page) and the relationship between them (note: The image is intercepted from the understand Linux Kernel Third edition, copyright belongs to the original author),

Therefore, a Segment can be uniquely identified with [page, offset, Len]. A bio structure can contain multiple Segment. The bio-structure represents this one-to-many relationship by pointing to an array of pointers to Segment.

In struct bio, member Bi_io_vec is the base address of the "pointer array to Segment" described earlier, and the element of each array is a pointer to the struct BIO_VEC.

struct Bio {

    [... snipped ...]

    struct Bio_vec      *bi_io_vec/* The actual VEC list */

    [... snipped ...]
}

and struct Bio_vec is a data structure that describes a Segment,

struct Bio_vec {
    struct page *bv_page;       /* Segment struct page structure pointer to the physical page containing the
    unsigned int    bv_len;     /* Segment length, Sector integer Times * *
    unsigned int    bv_offset;  /* Segment The offset address at the beginning of the physical page *
/};

Another member of the struct Bio bi_vcnt describes how many Segment are in the bio, that is, the number of elements in the pointer array. The maximum number of segment/page that a bio contains is determined by the following kernel macro definition.

#define BIO_MAX_PAGES       256

Multiple bio structures can be linked into a linked list through member Bi_next. The bio-list can be a linked list maintained by an IO task Task_struct member Bio_list. It can also be a linked list (the next section) that a struct request belongs to.

The following figure shows a linked list of the bio structure through Bi_next links. Each of these bio structures and Segment/page have a one-to-many relationship (note: The graph is intercepted from Professional Linux Kernel architecture, Copyright belongs to the original author),

3.3 struct Request

A request logically represents the IO requests received by the block device driver layer . The IO-requested data is composed of a physical contiguous sector starting from the starting sector on the block device.

The struct request can contain many struct bio, mainly through the bi_next link of the bio structure into a linked list. The first bio structure of this list is pointed to by the bio member of struct request.
The tail of the list is pointed by the Biotail member.

Once the generic block device layer receives a bio from a different thread, it is usually chosen as one of the following two scenarios,

Merge the bio into an existing request

Blk_queue_bio will invoke the IO Scheduler to do the IO merge (merge). Multiple bio may therefore be merged into the same request structure, forming a list of bio structures within the request structure. Because each bio structure comes from different tasks, IO request merging can only be done through a list insertion at the request structure level, and the original bio structure is not modified internally.

Assign a new request

If the bio cannot be merged into an existing request, the generic block device layer constructs a new request for the bio and then inserts it into the queue within the IO scheduler. The block device-driven policy function REQUEST_FN triggers the IO Scheduler's sort operation by blk_finish_plug the top task to trigger the Blk_run_queue action, inserting the request into the block device-driven IO request queue.

In either case, the code for the generic block device will call the block driver registered in the Request_queue REQUEST_FN callback, which will eventually have the merged or sorted request transferred to the driver's underlying function for IO operations. 3.4 Policy Function Request_fn

As mentioned earlier, when a block device driver uses Blk_run_queue to allocate and initialize Request_queue, the function also needs to drive the specified custom policy function request_fn and the desired spin lock queue_lock. When driving to realize your own request_fn, you need to understand the following features

When the generic block code calls REQUEST_FN, the kernel has taken the request_queue Queue_lock. Therefore, the context at this point is the atomic context. The kernel's constraints on the atomic context are required before the driven policy function exits Queue_lock.

When you enter a driver policy function, the generic block device layer code may access the Request_queue at the same time. To reduce lock contention on Request_queue Queue_lock, block-driven policy functions should exit the Queue_lock as soon as possible, and then retrieve the lock before the policy function returns.

Policy functions are executed asynchronously and are not in the context of the kernel that corresponds to the user-state process. Therefore, the implementation cannot assume that the policy function is running in the kernel context of the user process.

The

sampleblk policy function is sampleblk_request, which is registered with Blk_init_queue on the REQUEST_FN member of the Request_queue.

static void Sampleblk_request (struct request_queue *q) {struct request *RQ = NULL;
    int RV = 0;
    uint64_t pos = 0;
    ssize_t size = 0;
    struct Bio_vec Bvec;
    struct Req_iterator iter;

    void *kaddr = NULL;

        while ((RQ = blk_fetch_request (q))!= NULL) {SPIN_UNLOCK_IRQ (q->queue_lock);
            if (Rq->cmd_type!= req_type_fs) {rv =-eio;
        Goto Skip;

        } bug_on (Sampleblk_dev!= rq->rq_disk->private_data);
        pos = Blk_rq_pos (RQ) * SAMPLEBLK_SECT_SIZE;
        Size = Blk_rq_bytes (RQ); if ((pos + size > Sampleblk_dev->size)) {Pr_crit ("Sampleblk:beyond-end write (%llu%zx) \ n", POS, size)
            ;
            RV =-eio;
        Goto Skip;

            } rq_for_each_segment (Bvec, RQ, iter) {kaddr = Kmap (bvec.bv_page);
            RV = Sampleblk_handle_io (Sampleblk_dev, POS, Bvec.bv_len, Kaddr + Bvec.bv_offset, Rq_data_dir (RQ)); if (rv< 0) Goto Skip;
            POS + Bvec.bv_len;
        Kunmap (Bvec.bv_page);

        } skip:blk_end_request_all (RQ, RV);
    SPIN_LOCK_IRQ (Q->queue_lock);
 }
}

The implementation logic of the

policy function sampleblk_request is as follows, using the Blk_fetch_request loop to get each pending request in the queue. The
Kernel function blk_fetch_request can return a pointer to the first request for the queue_head queue of struct request_queue. Then call Blk_dequeue_request to remove the request from the queue. Every request, immediately exit the lock Queue_lock, but after processing each request, need to get queue_lock again. Req_type_fs is used to check whether it is a request from a file system. Non-file system request is not supported for this drive. Blk_rq_pos can return the start Sector area code for the request, and Blk_rq_bytes returns the number of bytes in the entire request, which should be an integer multiple of the sector. Rq_for_each_segment This macro definition is used to loop iterations to traverse every segment in a request: that is, struct Bio_vec. Note that each Segment-Bio_vec is blk_rq_pos as the starting sector, and the physical sector is contiguous. Only physical memory between Segment is not guaranteed to be continuous. Each struct Bio_vec can use Kmap to obtain the virtual address of the page where the Segment resides. The exact page offset and the specific length of this segment can be further known using Bv_offset and Bv_len. Rq_data_dir can be informed whether the request is read or write. After processing the request, it is necessary to call Blk_end_request_all to have the block common layer code do the post-processing.

The driver function Sampleblk_handle_io each segment of a request to a driver-level IO operation. Before invoking the driver function, the starting sector address POS, length Bv_len, the starting sector virtual memory address KADDR + Bvec.bv_offset, and read/write are ready for the parameters. Since the SAMPLEBLK drive is only a RAMDisk drive, each segment IO operation is memcpy to achieve,

 * * Do i I/O operation for each segment/
static int sampleblk_handle_io (struct Sampleblk_dev *sampleblk_d EV,
        uint64_t POS, ssize_t size, void *buffer, int write)
{
    if (write)
        memcpy (Sampleblk_dev->data + POS, buffer, size);
    else
        memcpy (buffer, Sampleblk_dev->data + pos, size);

    return 0;
}
4. The Test 4.1 Compiling and loading

First, you need to download the kernel source code, compile and install the kernel, and start with the new kernel.

Since this drive was developed and debugged on Linux 4.6.0, and the kernel version of the block device driver kernel function varies greatly, it is best to download the Linux mainline source code, and then git checkout to the version 4.6.0 to compile the kernel. Specific steps for compiling and installing the kernel there are a lot of introductions on the internet, please solve them by yourself.

After compiling the kernel, in the kernel directory, compile the driver module.

$ make M=/ws/lktm/drivers/block/sampleblk/day1

Drive compile successful, load kernel module

$ sudo insmod/ws/lktm/drivers/block/sampleblk/day1/sampleblk.ko

After the driver load succeeds, you can view the contents of struct Smapleblk_dev using the crash tool.

Crash7> mod-s Sampleblk/home/yango/ws/lktm/drivers/block/sampleblk/day1/sampleblk.ko
MODULE NAME SIZE OBJECT FILE
ffffffffa03bb580 sampleblk 2681/home/yango/ws/lktm/drivers/block/sampleblk/day1/sampleblk.ko

Crash7> P *sampleblk_dev
$ = {
Minor = 1,
Lock = {
{
Rlock = {
Raw_lock = {
val = {
Counter = 0
}
}
}
}
},
Queue = 0xffff880034ef9200,
Disk = 0xffff880000887000,
Size = 524288,
data = 0xffffc90001a5c000
}

Note: For the use of Linux Crash, please refer to the extended reading. 4.2 Module Reference Problem Resolution

Problem: Remove the drive sampleblk_request function and recompile and load the kernel module. Then uninstall the module with Rmmod, the uninstall will fail, and the kernel reporting module is being used.

Using Strace, you can observe the/sys/module/sampleblk/refcnt Non-zero, that is, the module is being used.

$ strace rmmod sampleblk
execve ("/usr/sbin/rmmod", ["Rmmod", "sampleblk"], [* * VARs]) = 0 .....

[snipped] .......... .....

.... Openat (AT_FDCWD, "/sys/module/sampleblk/holders", o_rdonly| o_nonblock| o_directory| O_cloexec = 3
getdents (3,/* 2 Entries * *, 32768)     =
getdents (3,/* 0 entries/, 32768)     = 0 Close
( 3                                = 0
open ("/sys/module/sampleblk/refcnt", o_rdonly| O_cloexec = 3/    * Display Reference number of 3/
read (3, "1\n")                      = 2
Read (3, "",)                         = 0 Close
(3)                                = 0
  write (2, "Rmmod:ERROR:Module sampleblk i" ..., 41rmmod:error:module sampleblk is in use
) =
Exit_group (1)                           = ?
+++ exited with 1 +++

If viewed with the Lsmod command, you can see that the reference count for the module is indeed 3, but the name of the reference is not displayed. In general, only a reference to the module's name is the reference between the kernel modules, so there is no name for the reference, so the reference is from the user-space process.

So who on Earth is using SAMPLEBLK this just loaded drive? Using Module:module_get tracepoint, you can get the answer. Reboot the kernel and run the Tpoint command before loading the module. Then, run Insmod to load the module.

$ sudo./tpoint module:module_get
tracing module:module_get. Ctrl-c to end.

   systemd-udevd-2986  [m] ....   196.382796:module_get:sampleblk call_site=get_disk refcnt=2
   systemd-udevd-2986  [m] ....   196.383071:MODULE_GET:SAMPLEBLK Call_site=get_disk refcnt=3

As you can see, the UDEVD process that was originally SYSTEMD is using SAMPLEBLK devices. People who are familiar with UDEVD may suddenly realize that the UDEVD is responsible for listening for hot-swappable events on all devices in the system and is responsible for performing a series of operations on the new device according to predefined rules. When the sampleblk driver calls Add_disk, the Kobject layer's code sends the hot-swappable uevent to the user-state UDEVD, so UDEVD opens the block device and does the relevant operation.
With the crash command, it's easy to find out which process is opening the SAMPLEBLK device,

crash> foreach files-r/dev/sampleblk
pid:4084   task:ffff88000684d700   COMMAND: " Systemd-udevd "
ROOT:/    CWD:/
 FD       FILE            DENTRY           INODE       TYPE PATH
  8 ffff88000691ad00 ffff88001ffc0600 ffff8800391ada08 BLK  /dev/sampleblk1
  9 ffff880006918e00 ffff88001ffc0600 Ffff8800391ada08 BLK  /dev/sampleblk1

Because the Sampleblk_request function implementation was deleted, the IO operation Udevd sent could not be completed by the SAMPLEBLK device driver, so the UDEVD sank into a long blocking wait until the timeout returned an error, releasing the device. The above analysis can be verified from the system's message log,

MESSAGES:APR 03:11:51 localhost systemd-udevd:worker [2466]/devices/virtual/block/sampleblk1 is taking a long time
  MESSAGES:APR 03:12:02 localhost systemd-udevd:worker [2466]/devices/virtual/block/sampleblk1 timeout; Kill it
messages:apr 03:12:02 localhost systemd-udevd:seq 4313 '/devices/virtual/block/sampleblk1 ' killed

Note: Tpoint is an open Source Bash scripting tool based on Ftrace that can be downloaded and run directly. It is Brendan Gregg on the GitHub Open source project, the previous article has given the project link.

Re-remove the Sampleblk_request function source code added back, then this problem will not exist. Because UDEVD can quickly end access to SAMPLEBLK devices. 4.3 Creating a file system

Although the SAMPLEBLK block drive has only 200 lines of source code, it can be used as a ramdisk to create a file system on it.

$ sudo mkfs.ext4/dev/sampleblk1

After the file system was successfully created, mount the file system and create an empty file a. You can see that everything works.

$sudo mount/dev/sampleblk1/mnt
$touch A

At this point, sampleblk as the most basic function of RAMDisk has been tested. 5. Extended reading Linux crash-background linux crash-page cache debug ftrace:the Hidden light switch Device Drivers, T Hird Edition

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.