Store the block in the page Cache

Source: Internet
Author: User

VFS (ing layer) and various file systems organize disk data in logical units called blocks. In earlier versions of the Linux kernel, there are two main types of disk cache: Page cache and buffer cache. The former is used to store Disk Data Pages generated when accessing disk files, the latter keeps the block content accessed through VFS (managing the disk file system) in the memory.

 

From the stable version 2.4.10, the buffer cache does not exist. In fact, block buffers are not allocated separately due to efficiency. Instead, they are stored in a special page called "buffer page", while the buffer page is saved in the page cache.

 

A buffer page is a data page related to an additional descriptor called a buffer header. Its main purpose is to quickly determine the address of a block on the disk. In fact, the address of a large piece of data in a page in the page cache on a disk is not necessarily adjacent.

 

1 buffer and buffer Header

 

Each block buffer has a buffer header descriptor of the buffer_head type. This descriptor contains all information required by the kernel about how to process blocks. Therefore, the kernel checks the buffer header before performing any block operations. The buffer header field is located in/include/Linux/buffer_head.h:
Struct buffer_head {
Unsigned long B _state;/* buffer status flag */
Struct buffer_head * B _this_page;/* pointer to the next element in the linked list on the buffer page */
Struct page * B _page;/* pointer to the descriptor of the buffer page that owns the block */

Sector_t B _blocknr;/* block number related to the block device (Starting logical block number )*/
Size_t B _size;/* block size */
Char * B _data;/* position of the block on the buffer page */

Struct block_device * B _bdev;/* pointer to the block device descriptor */
Bh_end_io_t * B _end_io;/* I/O Completion Method */
Void * B _private;/* pointer to the data of the I/O Completion Method */
Struct list_head B _assoc_buffers;/* pointer provided for the linked list of indirect blocks related to an index node */
Atomic_t B _count;/* block counter */
};

 

The two fields in the buffer header are encoded to indicate the disk address of the block. The B _bdev field indicates the block device that contains the block, which is usually a disk or partition. The B _blocknr field stores the logical block number, that is, the number of the block in the disk or partition.

 

The B _data field indicates the position of the block buffer on the buffer page. In fact, the number of this location depends on whether the page is in high-end memory. If the page is in high-end memory, the B _data field stores the offset of the block buffer relative to the starting position of the page. Otherwise, B _data stores the linear address of the block buffer.

 

The B _state field can store several flags. Some of these labels are generic and we listed them below. Each file system can also define its own private buffer header flag.

Bh_uptodate: set when the buffer contains valid data
Bh_dirty: If the buffer zone is dirty, it indicates that the data in the buffer zone must be written back to the block device)
Bh_lock: If the buffer zone is locked, it is usually used for disk transmission in the buffer zone.
Bh_req: If you have requested data transmission to initialize the buffer zone
Bh_mapped: If the buffer is mapped to the disk, it is set to a bit. That is, if the B _bdev and B _blocknr of the corresponding buffer header are valid, it is set to a bit.
Bh_new: if the corresponding block has just been allocated and has not been accessed, set it.
Bh_async_read: If the buffer is asynchronously read
Bh_async_write: If the buffer is written asynchronously
Bh_delay
Bh_boundary: If two adjacent blocks are no longer adjacent after one of them is submitted
Bh_write_eio: If I/O error occurs during block writing, set it to a bit.
Bh_ordered: If the block must be strictly written to the back of the block submitted before it, it will be placed (for the log file system)
Bh_eopnotsupp: if the driver of the block device does not support the requested operation, set it

 

The buffer header has its own slab distributor for high-speed caching, and its descriptor kmem_cache_s is included in the variable bh_cachu. The alloc_buffer_head () and free_buffer_head () functions are used to obtain and release the buffer header respectively.

Struct buffer_head * alloc_buffer_head (gfp_t gfp_flags)
{
Struct buffer_head * ret = kmem_cache_alloc (bh_cachu, gfp_flags );
If (RET ){
Get_cpu_var (bh_accounting). Nr ++;
Recalc_bh_state ();
Put_cpu_var (bh_accounting );
}
Return ret;
}

 

Void free_buffer_head (struct buffer_head * BH)
{
Bug_on (! List_empty (& bh-> B _assoc_buffers ));
Kmem_cache_free (bh_cachu, BH );
Get_cpu_var (bh_accounting). NR --;
Recalc_bh_state ();
Put_cpu_var (bh_accounting );
}

 

The B _count field in the buffer header is the reference counter of the corresponding block buffer. Increase the counter before each block buffer operation and then decrease it after the operation. In addition to periodically checking the block buffers stored in the page cache, when the idle memory becomes very small, check it. Only block buffers with a reference counter equal to 0 can be recycled.

 

When the kernel control path wants to access the block buffer, it should first reference the counter progressively. The function (_ getblk () that determines the position of a block in the page cache automatically completes this task. Therefore, the Reference Counter of the block buffer is usually not added to the upper-level functions.

 

When the kernel control path stops accessing the block buffer, call _ brelse () or _ bforget () to decrease the corresponding reference counter. The difference between the two functions is that _ bforget () also deletes blocks from the indirect block linked list (the B _assoc_buffers field in the buffer header) and marks the buffer as clean, therefore, the kernel is forced to ignore any modifications made to the buffer, but the buffer must still be written back to the disk.

 

2 buffer page Data Structure

 

As long as the kernel must access a block separately, it must involve the buffer page for storing the block buffer and check the corresponding buffer header. The following are two common conditions for the kernel to create a buffer page:

 

(1) When the read or write file pages are not adjacent to the disk block. This occurs because the file system assigns a non-contiguous block to the file, or because the file has a "hole ".

 

(2) when accessing a separate disk block (for example, when reading a super block or an index node block ).

 

In the first case, insert the descriptor of the buffer page into the base tree of a common file. Save the buffer header because it contains important information, that is, the block device and logical block number where data is stored in the disk.

 

In the second case, the descriptor of the buffer page is inserted into the base tree. The root of the tree is the address_space object of the index node in the special bdev file system related to the block device. This buffer page must meet strong constraints, that is, all blocks involved in the block buffer must be stored adjacent to the block device.

 

In the second case, an application instance is: if the Virtual File System needs to read an index block of 1024 bytes (including the index node of the given file ). The kernel does not allocate only one separate buffer, but must allocate a full page to store four buffers. These buffers will store four adjacent data blocks on Block devices, this includes the requested index node block.

Here we will focus on the second type of buffer page, the so-called block device buffer page (sometimes referred to as the block device page), because this is the most common situation for reading and writing disk files.

 

The size of all block buffers on a buffer page must be the same. Therefore, in the 80x86 architecture, a buffer page can contain multiple buffers based on the block size.

 

If a page is used as a buffer page, all the buffer headers related to its block buffer are collected in a one-way circular linked list. The private field of the buffer page descriptor points to the buffer of the first block on the page (because the private field contains valid data and the pg_private flag of the page is set, if the page contains disk data and the pg_private flag is set, this page is a buffer page. Note that, however, other kernel components unrelated to the block I/O subsystem also use the private and pg_private fields for other purposes); each buffer header is stored in the B _this_page field, this field is a pointer to the next buffer header in the linked list. In addition, each buffer header stores the address of the buffer page descriptor in B _page.

3. Block device Buffer Allocation page

 

When the kernel finds that the page of the specified block buffer is not in the page cache, it allocates a new block device buffer page. In particular, block search fails for the following reasons:

(1) The page containing the data block is not in the base tree of the block device: In this case, the descriptor of the new page must be added to the base tree.

(2) pages containing data blocks are in the base tree of the block device, but this page is not a buffer page: in this case, a new buffer header must be allocated, and link it to the page to convert it into a block device buffer page.

(3) The buffer page containing the data block is located in the base tree of the block device, but the block size in the page is different from the requested block size: in this case, the old buffer header must be released, allocated with a re-assigned buffer header, and linked to the corresponding page.

 

The kernel call function grow_buffers () adds the block device buffer page to the page cache. This function receives the parameters of three identification blocks:
-Block_device descriptor address bdev.
-Logical block number (The position of the block in the block device ).
-Block size.

 

Next we will analyze this important function:
Static int
Grow_buffers (struct block_device * bdev, sector_t block, int size)
{
Struct page * page;
Pgoff_t index;
Int sizebits;

Sizebits =-1;
Do {
Sizebits ++;
} While (size <sizebits) <page_size );

Index = block> sizebits;

/*
* Check for a block which wants to lie outside our maximum possible
* Pagecache index. (This comparison is done using sector_t types ).
*/
If (unlikely (index! = Block> sizebits )){
Char B [bdevname_size];

Printk (kern_err "% s: Requested out-of-range block % LlU"
"Device % s/n ",
_ FUNCTION __, (unsigned long) block,
Bdevname (bdev, B ));
Return-EIO;
}
Block = index <sizebits;
/* Create a page with the proper size buffers ..*/
Page = grow_dev_page (bdev, block, index, size );
If (! Page)
Return 0;
Unlock_page (PAGE );
Page_cache_release (PAGE );
Return 1;
}

 

1. Calculate the offset index of the data page in the block device of the requested block, and then align the block with the index.

For example, the block size is 512 (all in bytes), and size <sizebits is size * 2 ^ sizebits. This is okay! So 512*8 = 4096 (page_size), so when the sizebits is 3, then Index = block> sizebits, that is, the offset in the block device corresponding to the block in each 512-byte block device is calculated as Index = block/8. Then align block with index: block = Index * 8

 

2. If necessary, call grow_dev_page () to create a new block device buffer page.

Static struct page *
Grow_dev_page (struct block_device * bdev, sector_t block,
Pgoff_t index, int size)
{
Struct inode * inode = bdev-> bd_inode;
Struct page * page;
Struct buffer_head * BH;

Page = find_or_create_page (inode-> I _mapping, index, gfp_nofs );
If (! Page)
Return NULL;

Bug_on (! Pagelocked (page ));

If (page_has_buffers (page )){
BH = page_buffers (PAGE );
If (bh-> B _size = size ){
Init_page_buffers (page, bdev, block, size );
Return page;
}
If (! Try_to_free_buffers (page ))
Goto failed;
}

/*
* Allocate some buffers for this page
*/
BH = alloc_page_buffers (page, size, 0 );
If (! BH)
Goto failed;

/*
* Link the page to the buffers and initialise them. Take
* Lock to be atomic WRT _ find_get_block (), which does not
* Run under the page lock.
*/
Spin_lock (& inode-> I _mapping-> private_lock );
Link_dev_buffers (page, BH );
Init_page_buffers (page, bdev, block, size );
Spin_unlock (& inode-> I _mapping-> private_lock );
Return page;

Failed:
Bug ();
Unlock_page (PAGE );
Page_cache_release (PAGE );
Return NULL;
}

This function performs the following sub-steps in sequence:

 

A. Call the find_or_create_page () function and pass the following parameters to it: block device address_space object (bdev-> bd_inode-> I mapping), page offset index, and gfp_nofs flag. As described in the previous blog post "Page cache handler", find_or_create_page () searches for the desired page in the page cache (in the base tree). If necessary, insert the new page into the cache.

 

B. At this time, the requested page is already in the page cache and the function obtains its descriptor address. The function checks its pg_private flag. If it is null, it indicates that the page is not a buffer page (there is no relevant buffer header) and jumps to Step E.

 

C. The page is already a buffer page. Obtain the address bh of the first buffer header from the private field of the page descriptor, and check whether the block size bh-> size is equal to the requested block size. If the size is equal, the page found in the page cache is a valid buffer page, so jump to step G.

 

D. If the block size on the page is incorrect, call try_to_free_buffers () to release the previous buffer header on the buffer page and report an error (goto failed ).

 

E. Call the alloc_page_buffers () function to allocate the buffer headers Based on the block size requested on the page, and insert them into the one-way cyclic linked list implemented by the B _this_page field (note the WHILE LOOP ):

Struct buffer_head * alloc_page_buffers (struct page * Page, unsigned long size,
Int retry)
{
Struct buffer_head * BH, * head;
Long offset;

Try_again:
Head = NULL;
Offset = page_size;
While (offset-= size)> = 0 ){
BH = alloc_buffer_head (gfp_nofs );
If (! BH)
Goto no_grow;

BH-> B _bdev = NULL;
BH-> B _this_page = head;
BH-> B _blocknr =-1;
Head = BH;

BH-> B _state = 0;
Atomic_set (& bh-> B _count, 0 );
BH-> B _private = NULL;
BH-> B _size = size;

/* Link the buffer to its page */
Set_bh_page (BH, page, offset );

Init_buffer (BH, null, null );
}
Return head;
No_grow:
......
}

Void set_bh_page (struct buffer_head * BH,
Struct page * Page, unsigned long offset)
{
BH-> B _page = page;
Bug_on (Offset> = page_size );
If (pagehighmem (page ))
/*
* This catches illegal uses and preserves the offset:
*/
BH-> B _data = (char *) (0 + offset );
Else
BH-> B _data = page_address (PAGE) + offset;
}
Inline void
Init_buffer (struct buffer_head * BH, bh_end_io_t * Handler, void * private)
{
BH-> B _end_io = handler;
BH-> B _private = private;
}

In addition, the alloc_page_buffers function calls set_bh_page to initialize the B _page field of the buffer header with the page descriptor address, and initializes the B _data field with the linear address or offset of the block buffer in the page.

 

Return to grow_dev_page:

 

F. call link_dev_buffers to link the buffer header of the page into a circular linked list, store the address of the first buffer header in the private field of the page structure, and set the pg_private field to a bit, increment the page's usage counter (the block buffer in the page is counted as a page user ):

Static inline void
Link_dev_buffers (struct page * Page, struct buffer_head * head)
{
Struct buffer_head * BH, * tail;

BH = head;
Do {
Tail = BH;
BH = bh-> B _this_page;
} While (BH );
Tail-> B _this_page = head;
Attach_page_buffers (page, head );
}

Static inline void attach_page_buffers (struct page * page,
Struct buffer_head * head)
{
Page_cache_get (PAGE);/* increment the page's counter */
Setpageprivate (PAGE );
Set_page_private (page, (unsigned long) Head );
}

# Define setpageprivate (PAGE) set_bit (pg_private, & (PAGE)-> flags)
# Define set_page_private (page, V) (PAGE)-> private = (V ))

 

G. Call the init_page_buffers () function to initialize the fields B _bdev, B _blocknr, and B _bstate in the buffer header connecting to the page. Because all the blocks are adjacent to each other on the disk, the logical block numbers are continuous and can be easily obtained from the blocks:

Static void
Init_page_buffers (struct page * Page, struct block_device * bdev,
Sector_t block, int size)
{
Struct buffer_head * head = page_buffers (PAGE );
Struct buffer_head * bH = head;
Int uptodate = pageuptodate (PAGE );

Do {
If (! Buffer_mapped (BH )){
Init_buffer (BH, null, null );
BH-> B _bdev = bdev;
BH-> B _blocknr = block;
If (uptodate)
Set_buffer_uptodate (BH );
Set_buffer_mapped (BH );
}
Block ++;
BH = bh-> B _this_page;
} While (BH! = Head );
}

 

H. Return the page descriptor address.

 

After the block device buffer page is allocated, the following data structure relationships are formed:
.............

 

4. Release block device buffer page

 

When the kernel tries to obtain more idle memory, it releases the buffer page of the block device. Obviously, it is impossible to release pages with dirty buffers or locked buffers. The kernel calls the try_to_release_page () function to release the buffer page. This function receives the page descriptor address page and performs the following steps (you can also call the try_to_release_page function for the buffer page of a common file ):

 

1. If the pg_writeback flag of the page is set, 0 is returned (because the page is being written back to the disk, it is impossible to release the page ).

2. If you have defined the releasepage method of the block device address_space object, call it (usually there is no releasepage method defined for the block device ).
3. Call the try_to_free_buffers () function and return its error code.

 

The try_to_free_buffers () function scans the buffer header linked to the buffer page in sequence and essentially performs the following operations:

1. Check the flag of the buffer header of all buffers on the page. If the bh_dirty or bh_locked flag of some buffer headers is set, it means that the function cannot release these buffers. Therefore, the function terminates and returns 0 (failed ).

2. If the buffer header is in the linked list of the indirect buffer, this function will delete it from the linked list.

3. Clear the pg_private tag of the page descriptor, set the private field to null, and decrease the use counter of the page.

4. Clear the pg_dirty mark on the page.

 

5. Call free_buffer_head () repeatedly to release all the buffer headers of the page.

6. Return 1 (successful ).

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.