NetEase Video Cloud is a cloud-based distributed multimedia processing cluster and professional audio and video technology designed by NetEase to provide stable, smooth, low-latency, high-concurrency video streaming, recording, storage, transcoding and VOD, such as the PAAs service, online education, telemedicine, entertainment show, Online finance and other industries and enterprise users only through simple development can create online audio and video platform. Now, NetEase video cloud Technical experts to share a technical article: Linux Soft raid bitmap analysis.
In the use of disk arrays such as RAID1,RAID5, the reliability of the data is very high requirements, RAID5 in writing need to calculate checksum write, RAID1 write source and image to ensure consistency of data, in the process of writing, there may be unstable factors, such as disk damage, system failure, etc. , so that the write failure, after the system recovery, RAID also need to restore, the traditional way of recovery is the overall scan calculation check or full-volume synchronization, if the disk is relatively large, the process of synchronous recovery will be very long, there may be other failures, which will have a greater impact on the business. In the case of RAID1, in the event of a failure, in fact, the two disk data are already consistent, there may be only a small number of inconsistencies, so there is no need to do a full scan, but the system does not know what the two pieces of disk data is consistent, which need to be recorded in a place which is synchronized, for this, The birth of bitmap, in short, the bitmap is to record what data in the raid is consistent, which is inconsistent, so that when the raid recovery is not the full amount of synchronization, but incremental synchronization, thereby reducing the time to recover.
1. Use of bitmap
The use of bitmap is relatively simple, and the Mdadm help document has a very detailed description. Bitmap is divided into two kinds, one is internal, the other is external.
The internal bitmap is located near the superblock of the member disk of the RAID device (which can also be followed before), and the external is a separate file to hold the bitmap.
Here is a brief introduction to the use of bitmap.
# mdadm–create/dev/md/test_md1–run–force–metadata=0.9–assume-clean–bitmap=/mnt/test/bitmap_md1–level=1– Raid-devices=2/dev/sdb/dev/sdc
Mdadm:/dev/sdb appears to BES part of a RAID array:
Level=raid1 devices=2 ctime=tue Dec 17 21:00:58 2013
MDADM:ARRAY/DEV/MD/TEST_MD1 started.
To view the status of MD
#cat/proc/mdstat
Personalities: [RAID1]
Md126:active RAID1 sdc[1] sdb[0]
2097216 blocks [2/2] [UU]
bitmap:1/257 pages [4KB], 4KB chunk, file:/mnt/test/bitmap_md1
Can see the last line is md126 bitmap information, here the default bitmap chunksize is 4KB, can be –bitmap-chunk to specify bitmap chunk size, bitmap chunk indicated in bitmap
1bit corresponds to a chunk (size of bitmap chunksize) of the MD device.
Here is a description of the bitmap information cat/proc/mdstat see.
The 4KB chunk indicates that the chunk size of bitmap is 4KB;
1/257 pages refers to the memory bitmap that bitmap corresponds to (as a cache of bitmap on disk, increasing the operational efficiency of the bitmap), 257 is the total page count of memory bitmap, 1 indicates the number of pages that have been allocated, and the memory bitmap is dynamically allocated, You can recycle it when you're done. The memory bitmap uses 16bit to characterize a chunk, where 14bit is used to count the number of write Io in progress on the chunk (described in more detail later).
[4KB] Represents the total size of the allocated memory bitmap page.
Total number of chunk =MD device size/bitmap chunk size
Memory Bitmap A PAGE can represent the number of chunk =PAGE_SIZE*8/16
Total number of page memory Bitmap = total number of chunk/memory bitmap a page can represent the number of chunk
The total number of chunk in the example given above is 2097216kb/4kb=524304
The fetch page size is 4096, and a page can represent a chunk number of 4096*8/16=2048
The total need to 524304/2048=256 a page, see actually 257, this is because there may not be divisible, the last page may not be all used.
2. Bitmap's memory structure
The structure of the bitmap has more fields, which are concerned with several important fields, which are explained in order to facilitate subsequent analysis.
struct Bitmap {
struct Bitmap_page *bp; /* Structure pointing to the memory bitmap page */
......
unsigned long chunks; /* The total number of chunk in the array */
......
struct file *file; /* Bitmap File */
......
struct page **filemap; /* Cache page for bitmap file */
unsigned long *filemap_attr; /* Properties of the bitmap file cache page */
......
};
Where the struct bitmap_page structure is as follows:
struct Bitmap_page {
Char *map; /* point to the actual allocated memory page */
/*
* In emergencies (when map cannot is alloced), hijack the map
* Pointer and use it as counters itself
*/
unsigned int hijacked:1;
/*
* Count of dirty bits on the page
*/
unsigned int count:31; /* How many dirty chunk are on the page, and each 16bit represents a chunk*/
};
Each memory page that is actually dynamically allocated, each 16bit corresponds to a bit of bitmap file, which represents a chunk of MD
The functions of these 16 bits are as follows:
15 14 13 0
+ ————-+ ———-+ ————————————-+
| Needed |resync | Counter |
+ ————-+ ———-+ ———————————— +
The highest one indicates whether synchronization is required, the next one indicates whether the synchronization is in progress, and the low 14bit is counter, which is used to count how much of the chunk is being written io.
This 14bit represents the counter as BMC, conveniently described later. The value of the BMC 0,1,2 is special, 0 indicates that the corresponding chunk has not been written, the memory bitmap is not set, the BMC is 1 to indicate that the memory bitmap has been set, the BMC 2 means that all writes have just ended, the real write IO count is from 2.
The Filemap in the bitmap structure represents the corresponding cache of the bitmap file, how large the bitmap file is, and how large the corresponding Filemap cache is, allocated at the time of initialization.
Filemap_attr represents the properties of a bitmap file cache page, using 4bit to represent the properties of a cached page.
The 0bit is Bitmap_page_dirty, and the bit is 1 for dirty in memory bitmap, but the corresponding bit in the bitmap file is not dirty, so the PAGE with this tag needs to be brushed to disk synchronously (actually calling Write_page asynchronously, But wait until the writing is done)
The 1bit is bitmap_page_pending, the position indicates that the dirty bit in memory bitmap has cleared 0, but at this time external memory bitmap file corresponding dirty bit is not clear 0, need to clear 0 operation, this is a transition state, transition to Bitmap_page_ Needwrite.
The 2bit is a bitmap_page_needwrite, which indicates the need to synchronize the data in the memory bitmap cache to the external map file, the PAGE for such a tag only need to write asynchronously, because even if the write failed, the maximum to bring additional synchronization, will not bring the data harm.
3bit is not seen in the code, guess is reserved.
The relationship between page and Filemap in BP's page and bitmap file is as follows:
1 bitmap File cache pages can represent 4,096 chunk, while a memory bitmap page requires 16 pages. The function of the memory bitmap page of the BP array is actually to control the bitmap and reset, and also control the IO on a chunk cannot exceed the maximum value (14bit represents the largest integer), when the maximum value will be IO schedule.
3.bitmap of reliable refresh mechanism
In the writing operation, it is first to bitmap the corresponding location for dirty, and then write operations, write and then reset. So how do you ensure that the data in the memory bitmap is reliably brushed to the corresponding disk bitmap file in each write operation?
The general logic is that before writing the IO to the MD device, it is marked as dirty (successfully brushed to disk) in bitmap, and then after the write Io,io is completed, the dirty token needs to be cleaned up, which requires a bitmap refresh operation before normal data write operations. So how does bitmap do it?
In the case of RAID1, MD calls the Bitmap_startwrite function in Make_request, but the function does not call Write_page to flush the data directly to the disk, but instead calls the bitmap_file_set_ The bit marks the bitmap bit as Bitmap_page_dirty. The reason why the Write_page refresh is not called in the Bitmap_startwrite function is that the IO operation of the block device is carried out through the queue queues, there is no guarantee that each IO operation can be completed in time, and the order of IO scheduling may be reversed, so if the write is called directly _page the write operation, it is possible that there is a bitmap refresh and the order of normal data write operations is reversed.
The real deal with Bitmap_page_dirty is in Bitmap_unplug, and for RAID1, Bitmap_unplug is called in raid1.c functions in Flush_pending_writes, and Flush_ Pending_writes is called by the RAID1 daemon raid1d. Flush_pending_writes calls Bitmap_unplug to flush bitmap to disk, then traverses conf->pending_bio_list and takes out the bio to handle the normal pending write Io. (mbio will be added to Conf->pending_bio_list in RAID1 's make_request)
From the above analysis, RAID1 when receiving the Write IO request, first put the memory bitmap to dirty, and add the write IO to pending_list, and then raid1d daemon will be marked dirty memory bitmap page to the external memory bitmap file, and then from Pendling_ The list is removed before the pending write Io is processed.
When you brush dirty pages, you need to write the bitmap file cache page data into the map file, because MD is the kernel state of the program, in the implementation does not directly call the usual write function to the external memory file to write data, but through the bmap mechanism, according to the inode, the file data block and the physical disk block map up, This allows the bitmap to be refreshed by calling SUBMIT_BH through the file system.
The reliable refresh mechanism described above is the process of bitmap setup, which analyzes the logic of bitmap cleanup.
Removal of 4.bitmap bits
In front of the pending_list to remove the Write IO processing, when the completion of the IO needs to clear the dirty mark, will be the memory bitmap page properties set to Bitmap_page_pending, indicating that is going to clear, Bitmap_page_ The Pending property page is not immediately brushed into the external memory bitmap file, but asynchronously 0 is cleared. The real clean-up process is implemented in Bitmap_daemon_work. This is called when a daemon with raid is executed on a regular basis (such as raid1d), the daemon calls Md_check_recovery periodically, and Md_check_recovery calls Bitmap_daemon_ Work is performed according to various states 0.
Bitmap_daemon_work implementation is more complex, the inside of various state judgments and conversions, it is easy to get around the halo, bitmap of the 0 (Memory bitmap page bit clear 0 and brush to external memory bits and pieces) is required 3 times to call Bitmap_daemon_work. The following is explained by the 1 bit cleanup, after the completion of the IO, the counter BMC of this bit will be set to 2 (provided that the bit corresponds to the write IO on the chunk is completed), and the mark bit is bitmap_page_pending. Bitmap_endwrite.
1) for the first time into bitmap_daemon_work,bmc=2, the page property is bitmap_page_pending.
This determines whether the bit is bitmap_page_pending, this time the bit corresponds to the bitmap_page_pending, so skip the processing logic in this judgment
if (!test_page_attr (bitmap, page, bitmap_page_pending)) {
int need_write = test_page_attr (bitmap, page,
Bitmap_page_needwrite);
if (need_write)
Clear_page_attr (Bitmap, page, bitmap_page_needwrite); Spin_unlock_irqrestore (&bitmap->lock, flags);
if (need_write)
Write_page (bitmap, page, 0);
Spin_lock_irqsave (&bitmap->lock, flags);
J |= (PAGE_BITS–1);
Continue
}
followed by follow-up,
This will determine if the page is bitmap_page_needwrite, but this time the page is not bitmap_page_needwrite, so go to else's processing,
Mark the page as Bitmap_page_needwrite
if (lastpage! = NULL) {
if (test_page_attr (bitmap, LastPage,
Bitmap_page_needwrite)) {
Clear_page_attr (Bitmap, LastPage,
Bitmap_page_needwrite);
Spin_unlock_irqrestore (&bitmap->lock, flags);
Write_page (bitmap, lastpage, 0);
} else {
Set_page_attr (Bitmap, LastPage,
Bitmap_page_needwrite);
Bitmap->allclean = 0;
Spin_unlock_irqrestore (&bitmap->lock, flags);
}
}
Continue, BMC 2 will set the BMC to 1 and set the bitmap_page_pending again
if (*BMC) {
if (*BMC = = 1 &&!bitmap->need_sync) {
/* We can clear the bit */
*BMC = 0;
Bitmap_count_page (Bitmap,
(sector_t) J << Chunk_block_shift (bitmap),
-1);
/* Clear the bit */
paddr = kmap_atomic (page, KM_USER0);
if (Bitmap->flags & Bitmap_hostendian)
Clear_bit (File_page_offset (Bitmap, J),
PADDR);
Else
__clear_bit_le (
File_page_offset (Bitmap,
j),
PADDR);
Kunmap_atomic (PADDR, KM_USER0);
} else if (*BMC <= 2) {
Enter here to set the BMC to Bmc=1
*BMC = 1; /* Maybe clear the bit next time */
Set_page_attr (Bitmap, page, bitmap_page_pending);
Bitmap->allclean = 0;
}
The first call ends.
2) second entry into Bitmap_daemon_work,bmc=1, page properties are bitmap_page_pending and Bitmap_page_needwrite.
This will go to the following process, to clear the bitmap_page_pending
if (*BMC = = 1 &&!bitmap->need_sync) {
/* We can clear the bit */
*BMC = 0;
Bitmap_count_page (Bitmap,
(sector_t) J << Chunk_block_shift (bitmap),
-1);
/* Clear the bit */
Here is the real bitmap file cache page bit 0 where
paddr = kmap_atomic (page, KM_USER0);
if (Bitmap->flags & Bitmap_hostendian)
Clear_bit (File_page_offset (Bitmap, J),
PADDR);
Else
__clear_bit_le (
File_page_offset (Bitmap,
j),
PADDR);
Kunmap_atomic (PADDR, KM_USER0);
}
2) The third entry bitmap_daemon_work,bmc=1, the page property is Bitmap_page_needwrite.
Will go to the following process, clear the bitmap_page_needwrite, and then call Write_page brush to disk, so that the cleanup operation is complete,
A total of three calls to Bitmap_daemon_work to complete a bit of the 0 operation.
if (!test_page_attr (bitmap, page, bitmap_page_pending)) {int need_write = test_page_attr (bitmap, page, Bitmap_page_ Needwrite); if (need_write) clear_page_attr (bitmap, page, bitmap_page_needwrite); Spin_unlock_irqrestore (&bitmap->lock, Flags);
if (need_write)
Write_page (bitmap, page, 0);
Spin_lock_irqsave (&bitmap->lock, flags);
J |= (PAGE_BITS–1);
Continue
}
The advantage of this asynchronous zeroing mechanism is that when it is not clear 0 or the memory bitmap cleared 0 But there is no brush to the disk, and there is a write request to the page, just add the BMC counter or just set the memory bitmap, instead of writing to the external memory bitmap file, thus reducing the IO of a write external memory bitmap.
More technical exchanges, please pay attention to us Oh! More technical articles will be updated later!
NetEase Video Cloud Technology share: Bitmap analysis of Linux soft raid