Read Process Analysis of Block devices

Source: Internet
Author: User

We will not consider general VFS reading. This article uses the following functions as the root and analyzes them below:

 

do_generic_mapping_read(*ppos,*mapping,*desc)

 

The purpose of this function is to read data from the disk to the user State,

First, from the * PPOs page, until * PPOs + desc-> count, so many pages,

Then copy the data in the desc-> count bytes to the user State.

That is, the read from the disk to the memory buffer is read by PAGE, while the read from the memory buffer to the user State is read by byte.

 

The core function is to call

 

mapping->a_ops->readpage(filp, page);

 

Read disk data to the specified page.

This callback readpage is related to various file systems.

Taking ext2 as an example, this value is ext2_readpage, which is actually

 

mpage_readpage(page, ext2_get_block);

 

To evaluate the knowledge of do_mpage_readpage:

The main task of mpage_readpage () is to determine whether the cached block of the page is continuous on the disk,
If this page is continuous, you can submit only one bio request and then return it.
If not, call block_read_full_page to submit a bio request for each cache block of the page.

Note: For a bio_vec array in bio, each member represents a continuous data buffer segment with a disk address.

That is, a segment.

 

Since we want to read the disk data on this page, we need to find the location of the page corresponding to the disk.
A page corresponds to a fixed position of the file, and the inode of the file must know where the data is stored.
Therefore, you must first retrieve inode Based on the page.

What is this inode used? At that time, the device Dev where the inode file was located was to be retrieved. Otherwise, how can we query the data in the disk (this inode is mainly used by the get_block function ).

 

struct inode *inode = page->mapping->host;

 

In addition, the module size parameter supported by inode is also obtained to calculate the offset later.

 

Let's look at this general situation. Suppose the page is divided into four buffer blocks (the block size is usually 512 bytes, and the page size is 4 K, therefore, each page has four blocks ).

It is very important to judge the four buffers,

Whether the data on the corresponding disk is continuous. Why do we need to determine whether it is continuous? We know that for Block devices,

Addressing similar physical addresses on the disk is certainly much more efficient. So here we will find a foreshadowing.

Whenever possible, submit requests with similar disk addresses, which is also the IO scheduling to be discussed later.

Determine whether the block Buffering on the page is similar to that on the disk by page-> private.

When will the page-> private value be set?
Roughly speaking, it is set when the kernel finds the nth block, which is not consecutive with the n-1 disk number.


The theory is too much, and the mind will be dizzy. Let's analyze two scenarios,
Before looking at these two scenarios, let's take a public operation.


First, find out the first block on this page, which is the block of the file. After all, the disk is operated in blocks.
First, calculate the number of offset bytes of the page in the file:
Page-> index <page_cache_shift
Here, page-> index is the index of the page in mapping, and the byte offset of the page in mapping is obtained by moving page_cache_shift to the left.
Divide the value by the block size to get the offset of the block in the file:


page->index << PAGE_CACHE_SHIFT >> blkbits

That is

block_in_file = page->index << (PAGE_CACHE_SHIFT - blkbits);


Okay. Now let's look at two scenarios respectively.


Scenario 1) access a file for the first time.

 

Obviously, a pure page is allocated in address_space of mapping.
The private field of the page is 0, so the Code considers that the block Buffering in the page is continuous in the disk.
As a result, all the blocks in the page are processed in sequence (generally four blocks exist on one page)
Pass the block numbers on these pages to the get_block function related to the file system, that is, ext2_get_block,
Calculate the number of each block on the disk. Assume that the number of the N disk is S (n ),
Block comparison, whether serial numbers are consecutive, that is, whether S (n) is equal to S (n-1) + 1
If not, it indicates that the block buffer in the page is not consecutive on the disk and needs additional processing.


1.1) if the block is continuously stored on the disk

Save these blocks to the partial blocks array in sequence. Then we will allocate a new bio.

The key is to set bio-> bi_sector to the first sector ID of these blocks (due to disk continuity ),
Allocate a bio_vec, set the page of bio_vec to this page, and set the offset to 0 (intra-page offset ),
The length is set to page_size (ignore file holes)
Finally, submit_bio submits the data to the block device layer.


As you can see, for disk continuity, this page does not allocate a block buffer header for it, nor does it provide page-> private
Position.


1.2) if the block is not stored continuously on the disk

You need to submit bio separately for each block on the page.

This is done by block_read_full_page.
Check the private flag of the page first. If no private flag is set, a new buffer header must be allocated to indicate the page.
This is done through create_empty_buffers.
 
Similarly, based on the index of the page in mapping, calculate the index of the first block of the page, and then
The four blocks of index + 3 respectively call ext2_get_block to calculate their respective serial numbers B _blocknr on the disk to generate the most important
BH structure (Dev, B _blocknr), and then submit bio for the four BH, that is, submit_bh (read, BH );
Submit_bh generates a bio,
Bio memory buffer data (destination address of the Read File ):
The page of bio_vec [0] is set to the new page, and the bv_len of bio_vec [0] is the default block size,
The bv_offset (intra-page offset) of bio_vec [0] is the intra-page offset of the corresponding block.


Bio disk address (read file source address ):
Bio-> bi_sector based on the previous get_block result bh-> B _blocknr Calculation
Bio-> bi_bdev is set to Dev, the block device where the file is located, so that with Dev and the device block logic number, you can locate the sector location of the block device disk.



Scenario 2) previously accessed files


Based on scenario 1, we know that if the block data in the page of the file is stored discretely on the disk, the page will correspond to
The linked list of the buffer header; if it is continuous, the private of the page is empty.


For continuous storage, each visit to do_mpage_readpage will execute ext2_get_block for the four blocks,
Check whether adjacent blocks are continuous on the disk. That is to say, the Code is not optimized for continuous storage,
Still, each block needs to go deep into the driver code to find the corresponding disk sector.


For non-sequential storage, the page private stores the block buffer header set during the last access.
(That is, the BH with bh_mapped mark indicates that the B _bdev and B _blocknr of the buffer header are valid). Therefore, you can
Based on the last result, that is, the saved BH linked list, find the position of each block in the sector of the disk.


If you want to optimize the continuous storage scenario, I believe that you can add a field to page-> flag to distinguish the corresponding page
Whether the block is continuous on the disk. In this way, you can directly operate the disk by using the block disk sector information obtained during the first access,

In this way, the performance may be improved by removing the need to go deep into the drive to find the disk sector number.

 

Next, we will briefly describe submit_bio.
It can be considered that the submit encapsulates the submitted bio as a request, and then inserts the Request queue related to the block device according to certain rules.
The block device Request queue is allocated by the block device driver, and one block device only has one request queue.
Therefore, for operations on the same block device, the insertion of the Request queue must be mutually exclusive. However, if the system has n disks,
Even if these disks are driven by the same block device, N request queues are required, and operations on each disk are independent of each other.
To improve the performance.



The function for inserting the Request queue is Q-> make_request_fn, which is expected to be submitted by the upper layer.
After Bio is sorted and merged, It is encapsulated as a request structure and inserted into the queue Q. This sorting and merging algorithm is
The legendary Io scheduling.


For a request, like bio, the request interval is continuous on the disk, which is very important.


For a block device, there are two queues: one is the queue Q provided by the driver, and the other is the internal queue of each Io scheduling algorithm.
After receiving the bio request, the IO scheduling algorithm merges the request into a request, inserts it into the IO internal queue in sequence, and finally inserts the appropriate request
Transfer to the drive queue Q, and the driver will automatically extract the request on Q.


Let's see how merge and sort are defined:
Merge refers to the merge of requests with the disk number into one request.
For example, in the Request queue, the disk request range block of a request is

),

If the request range of a bio is
(2048,2048 + 512 ),

Then the bio can be merged

Request (512, + );


If the bio request interval is

(1024 ),

Then it can be merged to the forward

Request (1024 );
If the bio request interval is

(3072,3072 + 512 ),

This bio cannot be merged, and a new request is required.


If it cannot be merged, You need to insert the newly generated request to the corresponding position in the IO scheduling algorithm queue.


Return to Q-> make_request_fn. During driver initialization, you can specify the value of Q-> make_request_fn. If no value is specified,
The default value is _ make_request.
_ Make_request requires Io scheduling, that is, merge or sort.
The Merged Code is in elv_merge (Q, & req, bio );
If the data cannot be merged, get_request generates a new request, passes the bio value to the request, and then passes
Add_request performs insert sorting.

__elv_add_request(q, req, ELEVATOR_INSERT_SORT, 0);


For the deadline elevator algorithm, __elv_add_request is inserted to the RB-tree according to request-> sector,
Deadline_move_request then moves an entry in the RB-tree to the drive's Q queue.
As for how to choose this request, it is the core of the algorithm, because the author has come here for 12 o'clock in the middle of the night,
Therefore, we do not intend to continue the analysis. We will discuss it later.


Finally, the device driver must provide a do_request function, which traverses the driver's Q queue,
Retrieve the request one by one, traverse each segment of the request, and submit the data of each segment to the scis layer,

Data transmission is complete.

 

When will the driver be activated to process these requests? The answer is regular processing.

When the timer times out, wake up a kblockd thread, kblockd will execute blk_unplug_work, and finally execute the drive request.

The specific process code is as follows:

 

1 static void blk_unplug_timeout (unsigned long data) 2 {3 request_queue_t * q = (request_queue_t *) data; 4 values (Q, blk_ta_unplug_timer, null, 5 q-> RQ. count [read] + q-> RQ. count [write]); 6 blocks (& Q-> unplug_work); 7} 8 init_work (& Q-> unplug_work, blk_unplug_work); 9 static void blk_unplug_work (struct work_struct * Work) 10 {11 request_queue_t * q = container_of (work, requ Est_queue_t, unplug_work); 12 blk_add_trace_pdu_int (Q, blk_ta_unplug_io, null, 13 Q-> RQ. count [read] + q-> RQ. count [write]); 14 Q-> unplug_fn (Q); 15} 16 Q-> unplug_fn = generic_unplug_device; 17 void _ generic_unplug_device (request_queue_t * q) 18 {19 if (unlikely (blk_queue_stopped (q) 20 return; 21 if (! Blk_remove_plug (q) 22 return; 23 Q-> request_fn (Q); // The driver's request24}

 

Io scheduling can control the speed of plug and unplug to accumulate as many requests as possible for continuous disk addresses to improve disk access efficiency.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.