"Linux kernel design and implementation" reading notes (16)-page cache and page writeback

Source: Internet
Author: User

Main content:

    • Introduction to Caching
    • Page cache
    • Page Write-back

1. Introduction to Caching

In programming, caching is a very common and effective mechanism for improving program performance.

The Linux kernel is no exception, and in order to improve I/O performance, a caching mechanism is introduced to cache the data on a portion of the disk into memory.

1.1 Principle

The ability to improve I/O performance through caching is based on the following 2 important principles:

    1. CPU accesses memory much faster than accessing the disk (the access speed gap is not generally large, several orders of magnitude difference)
    2. Once the data is accessed, it is possible to be accessed again in a short time (temporary local principle)

1.2 Strategy

The creation and reading of the cache is nothing to say but to check whether the cache exists to be created or to be read.

But write cache and cache recycling need to be considered, which involves "cache content" and "Disk content" synchronization issues.

1.2.1"Write Cache" There are 3 common strategies

    • Do not cache (Nowrite):: That is, do not cache the write operation, when the data in the cache write, write directly to the disk, while the cache of this data is invalidated
    • Write-through Cache (Write-through):: Update disk and cache while writing data
    • Write Back (Copy-write or Write-behind):: Write data directly to the cache, by another process (writeback process) at the appropriate time to synchronize data to disk

The advantages and disadvantages of the 3 strategies are as follows:

Strategy

Complexity of

Performance

Do not cache Simple Caching is for read only, and performance is degraded for I/O with more write operations
Write-through cache Simple Improved read performance and write performance decreased (except write disk, write cache)
Write back Complex Improved read and write performance ( currently used in the kernel )

1.2.2"Cache Recycle" policy

    • Least recently Used (LRU):: Each cached data has a timestamp to save the most recently accessed time. When the cache is reclaimed, the older data is first reclaimed.
    • Dual-chain Strategy (LRU/2):: LRU-based improvement strategy. See the additional instructions below for details

Supplemental instructions (double chaining strategy):

The double-chain strategy is actually an improved version of the LRU (Least recently used) algorithm.

It simulates the LRU process through 2 linked lists (active linked lists and inactive linked lists) to improve the performance of page recycling.

When a page recycling action occurs, the page is recycled from the tail of the inactive list.

The key to a two-strand strategy is how a page moves between 2 linked lists.

In a two-chain strategy, each page has 2 flag bits, respectively

Pg_active-the Flag page is active, that is, whether this page is to be moved to the active linked list

Pg_referenced-Indicates whether the page is accessed by the process

The process for page movement is as follows:

    1. When the page is accessed for the first time, Pg_active is set to 1, adding to the active list
    2. When the page is accessed again, pg_referenced is set to 1, and if the page is in an inactive list, move it to the active list and set Pg_active to 1,pg_referenced to 0
    3. Daemon in the system will periodically scan the active linked list, timing the page pg_referenced position is 0
    4. The system in the Daemon timed check page pg_referenced, if pg_referenced=0, then the pg_active of this page is set to 0, while moving the page to the inactive linked list

Reference: page recycling and reverse mapping in Linux 2.6

2. Page Cache

Therefore, the smallest unit cached in the page cache is the memory page.

But the data for this memory page is not just data for the file system, it can be any page-based object, including various types of file and memory mappings.

2.1 Introduction

The page cache cache is a specific physical page, unlike the virtual memory space (Vm_area_struct) mentioned in the previous section, assuming that a process has created multiple vm_area_struct that point to the same file.

Then this vm_area_struct has only one copy of the page cache.

That is, when a file on a disk is cached to memory, it can have multiple virtual memory addresses, but only one physical memory address.

To effectively improve I/O performance, the page cache needs to meet the following criteria:

    1. Ability to quickly retrieve the presence of required memory pages
    2. Ability to quickly locate dirty pages (that is, data that has been written but not synced to disk)
    3. Minimize the performance penalty associated with concurrent locks when the page cache is accessed concurrently

The following analysis of the corresponding structure in the kernel, to understand how the kernel improves I/O performance.

2.2 Implementation

The most important structure to implement the page cache is address_space, in <linux/fs.h>

structAddress_space {structInode *host;/*the Inode object that owns this address_space*/    structRadix_tree_root Page_tree;/*radix tree with all pages*/spinlock_t Tree_lock; /*protection of Radix tree spin lock*/unsignedinti_mmap_writable;/*vm_shared Count*/    structPrio_tree_root I_mmap;/*Tree of the private mapping list*/    structList_head i_mmap_nonlinear;/*vm_nonlinear Linked list*/spinlock_t I_mmap_lock; /*self-rotating lock for i_map protection*/unsignedintTruncate_count;/*Truncation Count*/unsignedLongNrpages;/*Total Pages*/pgoff_t Writeback_index;/*start offset of write-back*/    Const structAddress_space_operations *a_ops;/*address_space Table of Operations*/unsignedLongFlags/*Gfp_mask mask and error ID*/    structBacking_dev_info *backing_dev_info;/*Pre-read information*/spinlock_t Private_lock; /*private Address_space Spin lock*/    structList_head private_list;/*private address_space Linked list*/    structAddress_space *assoc_mapping;/*Buffering*/    structMutex Unmap_mutex;/*Mutux Lock to protect unmapped pages*/} __attribute__ ((Aligned (sizeof(Long))));
View Code

Additional notes:

    1. Inode-This field is null if Address_space is mapped by a file in the file system without an inode
    2. Page_tree-this tree structure is very important, it ensures that the data in the page cache can be quickly retrieved, dirty pages can be quickly located.
    3. I_mmap-based on vm_area_struct, can quickly find the associated cache file (ie, address_space), mentioned earlier, Address_space and Vm_area_struct is a one-to-many relationship.
    4. Other fields are primarily available with various locks and accessibility features

In addition, a brief description of a new data structure radix tree is presented here.

The radix tree queries each node through a long bit operation, which is highly efficient and can be queried quickly.

Radix tree-related content in Linux see: include/linux/radix-tree.h and lib/radix-tree.c

According to my own understanding, the following simple explanation of radix tree structure and principles.

2.2.1 First is the definition of the radix tree node

/*Source Reference lib/radix-tree.c*/structRadix_tree_node {unsignedintHeight/*the height of the radix tree*/unsignedintCount/*number of child nodes of the current node*/    structRcu_head Rcu_head;/*RCU callback function chain list*/    void*slots[radix_tree_map_size];/*the slot array in the node*/unsignedLongTags[radix_tree_max_tags][radix_tree_tag_longs];/*Slot Label*/};
View Code

Figure out the meaning of each field in Radix_tree_node, and almost know what the radix tree is all about.

    • The height of the entire radix tree (that is, the height of the leaf node to the root), not the height of the current node to the root
    • Count is a good understanding, indicating the number of child nodes of the current node, the count=0 of the leaf node
    • Rcu_head List of callback functions triggered when RCU occurs
    • Slots each slot corresponds to a child node (leaf node)
    • tags tag child nodes whether dirty or wirteback

2.2.2 Each leaf node points to the cache page corresponding to the corresponding offset within the file

For example, the 0x000000 to 0x11111111 offset range, the height of the tree is 4 (figure is found on the Internet, not their own painting)

The leaf nodes of the 2.2.3 radix tree correspond to a binary integer, not a string, so the comparison is very fast.

In fact, the value of the leaf node is the value of the address space (usually a long type)

3. Page write-back

Since the current Linux kernel for the "write cache" is the 3rd strategy, so the time to write back is very important, write back too often affect performance, too little write-back is likely to cause data loss.

3.1 Introduction

Writeback in the Linux page cache is done by a thread in the kernel (the flusher thread), and the flusher thread triggers a writeback operation when the following 3 scenarios occur.

1. When free memory is below a threshold

When there is not enough free memory, you need to release a portion of the cache, because only the dirty pages can be released, so the dirty pages are written back to the disk, so that it becomes a clean page.

2. When dirty pages reside in memory longer than one threshold

Ensure that dirty pages do not reside in memory indefinitely, reducing the risk of data loss.

3. When the user process calls sync () and the Fsync () system is tuned

Provides a way for the user to write a forced writeback, and to respond to scenarios that require strict writeback.

Some of the thresholds involved in page write-back can be found in the /proc/sys/vm

Some of the thresholds associated with pdflush (an implementation of the Flusher thread) are listed in the following table

Valve value

Describe

Dirty_background_ratio As a percentage of total memory, when the free pages in memory reach this scale, the Pdflush thread begins to write back the dirty page
Dirty_expire_interval The value is in 1% seconds, which describes how long the data will be executed periodically by the Pdflush thread.
Dirty_ratio As a percentage of total memory, when a process produces a dirty page that reaches this scale, it is written
Dirty_writeback_interval The value is in units of 1% seconds, which describes how often the Pdflush thread runs
Laptop_mode A Boolean value that controls the laptop mode

3.2 implementation

The implementation of flusher threads is changing with the development of the kernel. Here are a few of the more typical implementations that appear in the kernel development process.

1. Laptop computer mode

The intent of this mode is to minimize the mechanical behavior of the hard drive rotation, allowing the hard disk to stall for as long as possible to prolong battery life.

This mode is set by the/proc/sys/vm/laptop_mode file. (0-Off the mode 1-turn on the mode)

2. Bdflush and kupdated (Implementation of Flusher thread before version 2.6)

Bdflush Kernel threads run in the background with only one bdflush thread in the system, and the Bdflush thread wakes up when memory is consumed below a specific threshold

Kupdated run periodically, write back the dirty page.

Bdflush Existing problems:

The entire system has only one bdflush thread, and when the system writeback task is heavier, the Bdflush thread may block the I/O on a disk,

Causes the I/O writeback of other disks not to be executed in time.

3. Pdflush (introduced in version 2.6)

The number of Pdflush threads is dynamic, depending on the I/O load of the system. It is a global task for all disks in the system.

Pdflush Existing problems:

The number of Pdflush is dynamic and alleviates the bdflush problem to some extent. But since Pdflush is for all disks,

Therefore, it is possible that multiple Pdflush threads are all blocked on a congested disk, causing the I/O writeback of other disks not to be executed in time.

4. Flusher thread (introduced after 2.6.32 version)

The flusher thread improves the problem above:

First, the number of flusher threads is not unique, which avoids problems with bdflush threads

Second, flusher threads do not target all disks, but each flusher thread corresponds to a disk, which avoids problems with pdflush threads

"Linux kernel design and implementation" reading notes (16)-page cache and page writeback

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.