A legendary life of IO (4 )---

Source: Internet
Author: User

 

Buffer cache mechanism for Block devices

 

Before the new journey of EXT3 file I/O, we need to introduce a companion to EXT3 file I/O. They are about to embark on the same journey. However, this companion has never experienced the EXT3 file system, but has a slightly different experience. This companion was created during block device write operations. We can call it block device IO.

In many applications, Block devices are operated directly. The common command dd can read and write Block devices. For example, dd if =/dev/sda of =. the/abc bs = 512 count = 1 command can read the first sector of the/dev/sda device into the abc file in the current directory. Read this and think about: what is the essential difference between accessing a block device file and an EXT3 file? When talking about the differences, we can still list one or two:

1) EXT3 is a common file system, and the distribution of data on disks is described in the form of metadata. Therefore, read/write operations on an EXT3 file involve two types: Metadata and file data. Because the two data types are correlated, EXT3 usually uses logs to ensure the atomicity of operations between the two data types.

2) The Block device is not as complex as the EXT3 file system. The written data is stored directly on the disk and no metadata information exists.

Therefore, when comparing Block devices and EXT3 file systems, there are differences in data read/write methods. A major commonality between the two is that there are disk access performance problems, and both of them can use memory to optimize the disk performance. EXT3 uses page cache to optimize IO performance. Block devices can also use page cache to optimize their performance. As we have learned earlier, each EXT3 file has a radix tree used to maintain the page cache of the file content, while the bare device can be equivalent to an EXT3 file, A radix tree can also be used to maintain the cache of data on Block devices. Therefore, there is a lot in common between the two.

Because of this, Linux is similar to EXT3 when implementing block device I/O operations. This subsystem for block device access can be called bdev file system. VFS enables access to all types of files, and the APIS called by applications are identical. For block device access, the function provided by the bdev file system is called through the VFS layer.

When initializing a block device, the init_special_inode function is called to initialize the inode of the block device:

Void init_special_inode (struct inode * inode, umode_t mode, dev_t rdev) {inode-> I _mode = mode; if (S_ISCHR (mode )) {/* initialize the character device Operation Method */inode-> I _fop = & def_chr_fops; inode-> I _rdev = rdev;} else if (S_ISBLK (mode )) {/* initialize block device Operation Method */inode-> I _fop = & def_blk_fops; inode-> I _rdev = rdev;} else if (S_ISFIFO (mode )) inode-> I _fop = & def_fifo_fops; else if (S_ISSOCK (mode) inode-> I _fop = & bad_sock_fops; elseprintk (KERN_DEBUG "kernel: bogus I _mode (% o) for "" inode % s: % lu \ n ", mode, inode-> I _sb-> s_id, inode-> I _ino );}

When a user program calls the open function to open a specified block device, a file object is initialized and the file object is initialized using the inode mentioned above. Therefore, the file operation method of the file object can call the general block device operation function. The general block device operation functions defined in Linux are described as follows:

const struct file_operations def_blk_fops = {.open       = blkdev_open,.release    = blkdev_close,.llseek     = block_llseek,.read       = do_sync_read,.write      = do_sync_write,.aio_read   = generic_file_aio_read,.aio_write  = blkdev_aio_write,.mmap       = generic_file_mmap,.fsync      = blkdev_fsync,.unlocked_ioctl = block_ioctl,#ifdef CONFIG_COMPAT.compat_ioctl   = compat_blkdev_ioctl,#endif.splice_read    = generic_file_splice_read,.splice_write   = generic_file_splice_write,};

Therefore, for write operations on a block device, run the write function into the kernel to execute do_sync_write. In the do_sync_write function, the blkdev_aio_write function is called to Implement Asynchronous Writing of Block devices. The core function of blkdev_aio_write is _ generic_file_aio_write. So far, we found that all the calling functions of the block device are basically the same as those of the EXT3 file system. Previously analyzed, __generic_file_aio_write is divided into two types: Direct_IO and Buffer_IO. Key steps for dividing buffer_IO:

1) write_begin

2) copy buffer

3) write_end

Write_begin and write_end functions are specific methods related to each type of file system. For bdev file system, they are defined:

static const struct address_space_operations def_blk_aops = {.readpage   = blkdev_readpage,.writepage  = blkdev_writepage,.write_begin    = blkdev_write_begin,.write_end  = blkdev_write_end,.writepages = generic_writepages,.releasepage    = blkdev_releasepage,.direct_IO  = blkdev_direct_IO,};

For EXT3 files, write_begin performs log operations. If the block Device File System does not perform such operations, it only performs page initialization. For the write_end function, the EXT3 file system clears the log and notifies the writeback daemon thread to write data back. For block device file systems, the main task of the write_end function is to mark the page as dirty, and then notify the write-back thread to process the dirty pages in the block device. The functions for setting dirty pages are described as follows:

Static int _ block_commit_write (struct inode * inode, struct page * page, unsigned from, unsigned to) {unsigned block_start, block_end; int partial = 0; unsigned blocksize; struct buffer_head * bh, * head; blocksize = 1 <inode-> I _blkbits; for (bh = head = page_buffers (page), block_start = 0; bh! = Head |! Block_start; block_start = block_end, bh = bh-> B _this_page) {block_end = block_start + blocksize; if (block_end <= from | block_start> = to) {if (! Buffer_uptodate (bh) partial = 1;} else {set_buffer_uptodate (bh);/* set the page and inode to dirty, wait for write-back scheduling process */mark_buffer_dirty (bh);} clear_buffer_new (bh );} /** If this is a partial write which happened to make all buffers * uptodate then we can optimize away a bogus readpage () for * the next read (). here we 'discover' whether the page went * uptodate as a result of this (potentially partial) write. */if (! Partial) SetPageUptodate (page); return 0 ;}

From the entire analysis, there is no essential difference between the Cache mechanism and principle in the write operations of bare Block devices and EXT3. We all use the page cache managed by the radix tree. If you do not use the Direct_IO method, the data is first written to the page cache, and then written back to the disk through the writeback mechanism. The entire mechanism is identical. The difference is that the cache block size of the EXT3 and Block devices is different. For EXT3, the cache block size is the page size. For Block devices, the cache block size is obtained using a certain policy. For details about the buffer cache block size, refer to the Buffer cache performance issue in Linux.

Although the block device and the EXT3 file seem very different, the IO processing mechanism is similar because the problem to be solved by the system is similar. Good! To put it bluntly, so far, the EXT3 file IO and block device IO have been prepared, and the writeback write-back mechanism has written all these IO back to the underlying device. These IO will leave the short page cache and set foot on the block device layer. The block device layer will face a fair and difficult scheduling process.

 

This article is from the "Storage path" blog. For more information, contact the author!

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.