Linux Kernel Analysis notes-block I/O layer

Source: Internet
Author: User

 

If you have a good memory, remember that the most I mentioned in the linux Device Driver instance post is the character device driver, so today's block I/O layer is a device that corresponds to the character device driver. The most fundamental difference between the two is whether they can be randomly accessed. In other words, they can be randomly transferred from one location to another when accessing devices. If they can be Block devices, otherwise, the character device is used.

 

The smallest addressable unit in a block device is a sector. The slice size is generally an integer multiple of 2, and the most common size is 512 bytes. The size of the slice is the physical attribute of the device, and the slice is the basic unit of all Block devices. Block devices cannot perform addressing and operations on smaller units than they are, however, many Block devices can transmit multiple slices at a time. From the software perspective, the smallest logical addressable unit is a block, and a block is an abstraction of the file system-you can only access the file system based on the block. Although the addressing of physical disks is based on the sector level, all disk operations performed by the kernel are performed by block. The front side has already said that the slice is the smallest addressable unit of the device, so the block cannot be smaller than the slice, but only several times the slice size. In addition, the kernel also requires the block size to be an integer multiple of 2, = and cannot exceed the length of a page. Therefore, the final requirement of the size must be an integer multiple of the slice size, and smaller than the page size. Generally, the block size is 512 bytes, 1 k or 4 k.

 

When a block is transferred to the memory, it needs to be stored in a buffer zone. Each buffer zone corresponds to a block, which is equivalent to the representation of the disk block in the memory. In addition, because the kernel needs some related control information when processing data, each buffer has a descriptor called buffer_head, which is called a buffer header, which is defined in linux/buffer_head.h, it contains all the information required for kernel buffer operations, as follows:

 

 

01 struct buffer_head {

 

02 unsigned long B _state;/* buffer state flags */

 

03 atomic_t B _count;/* buffer usage counter */

 

04 struct buffer_head * B _this_page;/* buffers using this page */

 

05 struct page * B _page;/* page storing this buffer */

 

06 sector_t B _blocknr;/* logical block number */

 

07 u32 B _size;/* block size (in bytes )*/

 

08 char * B _data;/* buffer in the page */

 

09 struct block_device * B _bdev;/* device where block resides */

 

10 bh_end_io_t * B _end_io;/* I/O completion method */

 

11 void * B _private;/* data for completion method */

 

12 struct list_head B _assoc_buffers;/* list of associated mappings */

 

13 };

 

The B _state field indicates the status of the buffer. The following table shows a combination of a flag or multiple flags. in linux/buffer_head.h, The bh_state_bite list of all valid flags is defined, as shown below:

 

 

The bh_state_bits list contains a special flag-BH_PrivateStart, which is not an available status flag and is used to specify the start position that may be used by other code. The block I/O layer does not use BH_PrivateStart or higher bits, so a driver wants to use these bits safely when storing information through the B _state domain. The driver can define its own status flag in these BITs, as long as the custom status flag does not conflict with the dedicated bit of the block IO layer. The B _count field indicates the buffer usage count. You can use two inline functions defined in the file linux/buffer_head.h to increase or decrease this field:

 

 

1 static inline void get_bh (struct buffer_head * bh)

 

2 {

 

3 atomic_inc (& bh-> B _count );

 

4}

 

5 static inline void put_bh (struct buffer_head * bh)

 

6 {

 

7 atomic_dec (& bh-> B _count );

 

8}

 

Before operating the buffer header, you should first use the get_bh () function to increase the reference count of the buffer header to ensure that the buffer header will not be allocated. After completing the operations on the buffer header, you must also use the put_bh () function to reduce the reference count. The physical disk block corresponding to the buffer zone is indexed by the B _blocknr domain. The value is the logical block number in the block device specified by the B _bdev domain. The memory physical page corresponding to the buffer zone is represented by the B _page domain. In addition, the B _data domain points directly to the corresponding block (it is located at a location of the page specified by the B _page domain ), the block size is represented by the B _size field. Therefore, the starting position of the block in the memory is at B _data and the ending position is at (B _data + B _size. The purpose of the buffer header is to describe the ing between disk blocks and physical memory buffers (the byte sequence on a specific page. This struct plays a descriptor role in the kernel, indicating the ing from the buffer zone to the block. Using the buffer header as an I/O operation has its drawbacks. I will not elaborate on it here, as you can understand. We only need to know that the current kernel uses a new type of flexible and lightweight container-bio struct.

 

Bio struct is defined in linux/bio. h. This struct represents block I/O operations that are being organized on site (active) in the form of a segment (segment) linked list. A piece is a small contiguous memory buffer. In this way, you do not need to ensure that a single buffer must be continuous. All the buffer segments are used to describe the buffer, even if a buffer is scattered across multiple locations in the memory, bio struct can also ensure the execution of I/O operations. The bio struct and fields are described as follows:

 

 

01 struct bio {

 

02 sector_t bi_sector;/* associated sector on disk */

 

03 struct bio * bi_next;/* list of requests */

 

04 struct block_device * bi_bdev;/* associated block device */

 

05 unsigned long bi_flags;/* status and command flags */

 

06 unsigned long bi_rw;/* read or write? */

 

07 unsigned short bi_vcnt;/* number of bio_vecs off */

 

08 unsigned short bi_idx;/* current index in bi_io_vec */

 

09 unsigned short bi_phys_segments;/* number of segments after coalescing */

 

10 unsigned short bi_hw_segments;/* number of segments after remapping */

 

11 unsigned int bi_size;/* I/O count */

 

12 unsigned int bi_hw_front_size;/* size of the first mergeable segment */

 

13 unsigned int bi_hw_back_size;/* size of the last mergeable segment */

 

14 unsigned int bi_max_vecs;/* maximum bio_vecs possible */

 

15 struct bio_vec * bi_io_vec;/* bio_vec list */

 

16 bio_end_io_t * bi_end_io;/* I/O completion method */

 

17 atomic_t bi_cnt;/* usage counter */

 

18 void * bi_private;/* owner-private method */

 

19 bio_destructor_t * bi_destructor;/* destructor method */

 

20 };

 

The purpose of using the bio struct is to represent the I/O operations being performed on the site. All the Primary Fields in the struct are used to manage relevant information. The most important fields are bi_io_vecs, bi_vcnt, and bi_idx. The relationships between them are shown in:

 

 

I have already given the structure of struct bio at the front. The following describes the structure of struct bio_vec:

 

 

1 struct bio_vec {

 

2 struct page * bv_page;/* pointer to the physical page on which this buffer resides */

 

3 unsigned int bv_len;/* the length in bytes of this buffer */

 

4 unsigned int bv_offset;/* the byte offset within the page where the buffer resides */

 

5 };

 

The figure above is analyzed below. We say that each block I/O request is represented by a bio struct. Each request contains one or more blocks, which are stored in the bio_vec struct array. These struct describe the actual location of each segment in the physical page and are organized together like a vector, the first segment of the IO operation is pointed by the B _io_vec struct, and other segments are placed in sequence. There are a total of bi_vcnt fragments. When the block IO starts to execute the request and each segment needs to be used, the bi_idx domain will be constantly updated to always point to the current segment. The bi_idx field points to the current bio_vec segment in the array, and the block IO layer uses it to track the progress of block IO operations. However, this field is more important to split the bio struct. The bi_cnt field records the use count of the bio struct. If it is 0, the bio struct should be destroyed and the memory occupied by it should be released. Use the following two functions to manage the count:

 

 

1 void bio_get (struct bio * bio );

 

2 void bio_put (struct bio * bio );

 

The last domain is the bi_private domain, which is a private domain of the owner. Whoever creates the bio structure can read and write the domain.

 

Block devices store their suspended block IO requests in the Request queue. The queue has a request_queue struct, which is defined in the file linux/blkdev. h contains a two-way request linked list and related control information. Add requests to the queue using high-level Code such as the file system in the kernel. As long as the request queue is not empty, the block Device Driver of the queue will obtain the request from the queue header, then it is sent to the corresponding block device. Each item in the Request team list is a separate request, which is indicated by the reques struct. The request in the queue is represented by the struct request, which is defined in the file linux/blkdev. h. Because a request may require operations on multiple consecutive disk blocks, each request may consist of multiple bio structures. Note that although the blocks on the disk must be continuous, however, these blocks in the memory do not have to be continuous-each bio structure can describe multiple fragments, and each request can also contain multiple bio structures.

 

Well, we understand the block IO request, and the following is the IO scheduling. Each addressing operation is to locate the disk head to a specific position on a specific block. To optimize the addressing operation, the kernel does not simply follow the request receiving order or submit it to the disk immediately, instead, it executes a pre-operation named merge and sort before submission, which can greatly improve the overall performance of the system.

 

IO scheduler reduces disk addressing time in two ways: Merge and sort. Merging refers to combining two or more requests into one new request. The most famous sort is the well-known elevator scheduling. Sorting means that the entire request queue will be arranged in an orderly manner by sector growth direction, so that all requests are arranged in an orderly order by the sectors on the disk, not only to shorten the addressing time of a single request, the more important optimization is that the disk addressing time for all requests is shortened by keeping the disk header moving in a straight line. Many operating system books have already explained the elevator scheduling program in linux. Here is a general process:

 

1. First, if a request for operations on adjacent disk sectors already exists in the queue, the new request is merged with the existing request into a request.

2. If a long-lived request exists in the queue, the new request will be inserted to the end of the queue to prevent other old requests from hunger.

3. If a proper insertion position exists in the queue in the direction of sectors, new requests will be inserted to this position to ensure that the requests in the queue are sorted in the order of the physical locations of the accessed disk.

4. If no suitable request Insertion Location exists in the queue, the request will be inserted to the end of the queue.

 

I mentioned elevator scheduling before, but there is a problem that has never been raised, that is, the disadvantages of the elevator scheduling program: hunger. To reduce the disk addressing time, heavy operations on a disk area will undoubtedly lead to operational opportunities for operations in other locations on the disk. In fact, A request stream that operates in the same location on the disk can cause other requests in a remote location to never get a running opportunity. This is unfair hunger. Even worse, the hunger of common requests also brings about the special problem of writing-hunger-reading. We know that write operations usually occur when the kernel is empty, but read operations must be blocked to know that read requests are satisfied, which has a very large impact on system performance. Moreover, we know that read requests often depend on each other. For example, to read a large number of files, each time a read operation is performed on a small buffer, the application can continue to read the next data zone only after it reads the last data area from the disk and returns it. Therefore, if each request suffers from hunger, for applications that read files, the total latency will lead to a long wait time. Reducing hunger requests must be at the cost of reducing global throughput. To avoid this problem, the final-term I/O scheduling program is proposed, which not only needs to increase the global throughput as much as possible, but also ensures fair processing of requests. In the deadline IO scheduler, each request has a timeout time. By default, the Read Request timeout is 500 ms, and The Write Request timeout is 5 s. The final I/O scheduling request is similar to the linux elevator. It also maintains the Request queue in order of the physical location of the disk. This queue is called a sort queue. When a new request is submitted to the sorting queue, the deadline IO scheduler is similar to a linux elevator, merge and insert requests, however, the deadline IO scheduler also inserts them into the additional Queue Based on the Request type. Read requests are inserted to a specific read/FIFO queue in order, and write requests are inserted to a specific write/FIFO queue. Although common queues are sorted in the order of disk sectors, these queues are organized in FIFO form, and new queues are always added to the end of the queue. For general operations, the deadline for IO scheduling removes requests from the headers of the sort queue, pushes them to the dispatch queue, distributes the queue, and then submits the requests to the disk drive, this ensures minimal request addressing. If a request in the write/Write/read/write/Write/read. Depending on this method, the final-term IO scheduler tries to ensure that there will be no requests that still cannot receive services when the service is obviously out of date, as shown in:

 

 

The implementation of the deadline IO scheduler is in the file driver/block/deadline-iosched.c.

 

Although the deadline IO scheduler has done a lot of work to reduce the read operation response time, it also reduces the system throughput. In this case, assuming that a system is in a very heavy write operation period, each time a new request is submitted, the IO scheduler will quickly process the read request, so that the disk will first address the read operation, execute the read operation, and then return to the addressing for the write operation, and repeat this process for each read operation. This method significantly damages the global throughput of the system. In this case, I/O scheduling program is used. It is based on the deadline I/O scheduler. The major improvement is that it increases the predictive inspiration capability. The difference is that after a read operation is submitted, it does not directly return other requests, but intentionally idle for a moment. This idle few seconds is a good opportunity for applications to submit other read requests-Any requests that operate on adjacent disk locations will be processed immediately. After the waiting time is over, the prediction IO scheduler returns the original location and continues to execute the remaining requests. NOTE: If waiting can reduce the backward (back-and-forth) addressing operations caused by read requests, so it is worth spending some time waiting for more requests (here the time is spent on predicting more requests). If an adjacent IO request is in waiting period, the IO scheduler can save two addressing operations. If there are more and more read requests accessing the same region, a wait will undoubtedly avoid a large number of addressing operations. Of course, I have to say that if I/O requests are not waiting for a while, I/O scheduling programs will cause minor losses to the system performance and waste several milliseconds. The advantage of forecasting a scheduler is the ability to correctly predict the behavior of applications and file systems. This prediction relies on a series of inspiration and statistical work. The implementation of the prediction IO scheduler is in the file driver/block/as-iosched.c. Which IO scheduler can be used by Block devices. The default IO scheduler is the prediction IO scheduler.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.