Analysis of read/write processes in the MD Module-2

Source: Internet
Author: User
This section describes the read/write process in the raid5 module. This process is complex. the most critical function is handle_stripe. This function can be called multiple times to process one read or write operation. Of course, this function is also a core function of the raid5 module. it is also responsible for synchronization, reconstruction, and expansion...

 

This section describes the read/write process in the raid5 module. This process is complex. the most critical function is handle_stripe. This function can be called multiple times to process one read or write operation. Of course, this function is also a core function of the raid5 module. it is also responsible for synchronization, reconstruction, and extension implementation. Before analysis, we need to prepare some preparations:

1. strip: We know that raid5 uses strip as the basic unit to access data. As shown in:

 

Raid5 also has other data distribution methods, which are listed here. Block0, block1, block2, and other data blocks in the figure are logically continuous.

It is worth noting that in MD, the minimum unit for raid5 to process data is a small band consisting of 4 kB, that is, a page size, rather than a block size band at a time. The strip that appears later indicates a small one consisting of 4 kB. The data structure is as follows:

Struct stripe_head {

Struct hlist_node hash;

Struct list_head lru;/* inactive_list or handle_list */

Struct raid5_private_data * raid_conf;

Sector_t sector;/* sector of this row */

Int pd_idx;/* parity disk index */

Unsigned long state;/* state flags */

Atomic_t count;/* nr of active thread/requests */

Spinlock_t lock;

Int bm_seq;/* sequence number for bitmap flushes */

Int disks;/* disks in stripe */

Struct r5dev {

Struct bio req;

Struct bio_vec vec;

Struct page * page;

Struct bio * toread, * towrite, * written;

Sector_t sector;/* sector of this page */

Unsigned long flags;

} Dev [1];/* allocated with extra space depending of RAID geometry */

};

 

 

Meanings of the main fields:

Hash: hash table entries in the buffer.

Lru: the link in which the strip is located.

Sector: The sector ID of the strip, which is an offset from the start address of a single disk.

State: The status bit of the strip.

R5dev: describe the buffer zone of each device in the strip. this struct is the smallest unit for io processing. after the command is scored, it will be added to the linked list in the corresponding r5dev in the corresponding strip, that is, the toread in The struct and the towrite linked list (connected through bio-> bi_next ). In this struct, the field req represents the request bio, and vec represents the segment in bio. the significance of sector is the logical sector number of the r5dev in the array. The flags field indicates the status of the device buffer. These statuses can be found in raid5.h.

II. Status of the device buffer zone in the strip

The flags field in r5dev mentioned above, two of which are very important, that is

# Define R5_UPTODATE 0/* page contains current data */

# Define R5_LOCKED 1/* IO has been submitted on "req "*/

The two can represent the status in buffer 4, which is: empty (! R5_UPTODATE! R5_LOCKED) indicates that the buffer is empty.

Want (R5_LOCKED! R5_UPTODATE) indicates that the buffer needs to request data.

Clean (! R5_LOCKED R5_UPTODATE) indicates that the data in the buffer zone is consistent with that on the disk.

Dirty (R5_LOCKED R5_UPTODATE) indicates that there is new data in the buffer zone to be written to the disk.

In the process of data reading and writing, the status of the buffer will change, which will be reflected in the future.

 

Next let's take a look at the make_request function. its function is to re-distribute the request, and it is determined that bio will be added to the read/write linked list of the device with that stripe. The specific process is as follows:

(Skip the code segment first

If (rw = READ & mddev-> reshape_position =

MaxSector & chunk_aligned_read (q, bi ))

Return 0;

Later)

 

A. Call the md_write_start function to determine whether metadata needs to be updated. Some raid algorithms have redundancy features, such as raid1 and raid5. when you start to write data, update the metadata to prevent data errors caused by data writing failures, the synchronization operation is initiated when the array is started again. The metadata will be updated after writing. MD is determined by the in_sync field. If in_sync = 1, the array is synchronized. In this case, there is a question: Can I update the second element data if I cannot write a request? Indeed, to prevent this, md introduces a timer, that is, metadata is updated when no consecutive write requests are sent within Ms. This function is completed by reloading the safemode value.

B. calculate the start logical fan area of bio: logical_sector and last_sector. here, the calculation method of logical_sector is logical_sector = bi-> bi_sector &~ (Sector_t) STRIPE_SECTORS-1); it means to align the starting sector code of bio to the strip, that is, if bi_sector = 6, then logical_sector = 0.

C. enter the Loop. for each logical_sector, first determine whether it is being resized (conf-> expand_progress! = MaxSector) to support online resizing, that is, the array can be accessed during array resizing. if the logical_sector is being resized, you must determine whether the logical_sector is in the expansion range. If yes, the agent is sleep; otherwise, the agent is processed as needed. I will not describe it too much here, but I will talk about it during resizing.

D. use the raid5_compute_sector function to calculate the offset new_sector from the start position of the disk where logical_sector is located. at the same time, this function also determines the device number dd_index and verification disk number pd_index of the logical_sector.

E. obtain the Strip according to the value of new_sector calculated in step 3. the get_active_stripe function first checks whether the strip is in the strip buffer. If yes, the system returns the result directly. Otherwise, the system tries to find inactive_list inactive strip. if it finds it, it calls init_stripe for initialization. Otherwise, it sleep. The third parameter of this function indicates whether to use the non-blocking method to obtain the activity band. For read/write, we found that the 3rd parameters are (bi-> bi_rw & RWA_MASK), that is, the blocking method used to process read/write to obtain the strip, this ensures that the strip must be obtained during reading and writing. If the matching strip is found, the array will determine whether to retry the strip in expansion. Because the obtained strip may cause sleep, the original logical_sector> expand_prograss becomes logical_sector.

F. use add_stripe_bio to insert the bio into the strip. Note that bio will be inserted into multiple strip. when each strip is processed, the bio will be processed completely, this counter is maintained by bi_phys_segments. When bio is inserted into a stripe, bi_phys_segments ++, and bi_phys_segments -- after processing a stripe --. This function also determines whether the bio to be inserted overwrites the entire r5dev, that is, whether it is full write. This function is useful when processing write requests (in rcw mode.

G. at this time, we have added bio to the added strip, and then we need to process the strip. This function is completed by the handle-stripe function. As this function is complex, I will analyze it separately in the next section.

H. after the handle_stripe function is complete, the release_stripe function is implemented. the function is to add the function to different linked lists based on the state of the strip so that it can be used later.

 

In the next section, I will analyze the handle_stripe function with a simple read/write operation. Reading is relatively simple, and the writing process is troublesome, involving delayed writing.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.