Main content:
- Introduction to Caching
- Page cache
- Page Write-back
1. Introduction to Caching
In programming, caching is a very common and effective mechanism for improving program performance.
The Linux kernel is no exception, and in order to improve I/O performance, a caching mechanism is introduced to cache the data on a portion of the disk into memory.
1.1 Principle
The ability to improve I/O performance through caching is based on the following 2 important principles:
- CPU accesses memory much faster than accessing the disk (the access speed gap is not generally large, several orders of magnitude difference)
- Once the data is accessed, it is possible to be accessed again in a short time (temporary local principle)
1.2 Strategy
The creation and reading of the cache is nothing to say but to check whether the cache exists to be created or to be read.
But write cache and cache recycling need to be considered, which involves "cache content" and "Disk content" synchronization issues.
1.2.1"Write Cache" There are 3 common strategies
- Do not cache (Nowrite):: That is, do not cache the write operation, when the data in the cache write, write directly to the disk, while the cache of this data is invalidated
- Write-through Cache (Write-through):: Update disk and cache while writing data
- Write Back (Copy-write or Write-behind):: Write data directly to the cache, by another process (writeback process) at the appropriate time to synchronize data to disk
The advantages and disadvantages of the 3 strategies are as follows:
Strategy |
Complexity of |
Performance |
Do not cache |
Simple |
Caching is for read only, and performance is degraded for I/O with more write operations |
Write-through cache |
Simple |
Improved read performance and write performance decreased (except write disk, write cache) |
Write back |
Complex |
Improved read and write performance ( currently used in the kernel ) |
1.2.2"Cache Recycle" policy
- Least recently Used (LRU):: Each cached data has a timestamp to save the most recently accessed time. When the cache is reclaimed, the older data is first reclaimed.
- Dual-chain Strategy (LRU/2):: LRU-based improvement strategy. See the additional instructions below for details
Supplemental instructions (double chaining strategy):
The double-chain strategy is actually an improved version of the LRU (Least recently used) algorithm.
It simulates the LRU process through 2 linked lists (active linked lists and inactive linked lists) to improve the performance of page recycling.
When a page recycling action occurs, the page is recycled from the tail of the inactive list.
The key to a two-strand strategy is how a page moves between 2 linked lists.
In a two-chain strategy, each page has 2 flag bits, respectively
Pg_active-the Flag page is active, that is, whether this page is to be moved to the active linked list
Pg_referenced-Indicates whether the page is accessed by the process
The process for page movement is as follows:
- When the page is accessed for the first time, Pg_active is set to 1, adding to the active list
- When the page is accessed again, pg_referenced is set to 1, and if the page is in an inactive list, move it to the active list and set Pg_active to 1,pg_referenced to 0
- Daemon in the system will periodically scan the active linked list, timing the page pg_referenced position is 0
- The system in the Daemon timed check page pg_referenced, if pg_referenced=0, then the pg_active of this page is set to 0, while moving the page to the inactive linked list
Reference: page recycling and reverse mapping in Linux 2.6
2. Page Cache
Therefore, the smallest unit cached in the page cache is the memory page.
But the data for this memory page is not just data for the file system, it can be any page-based object, including various types of file and memory mappings.
2.1 Introduction
The page cache cache is a specific physical page, unlike the virtual memory space (Vm_area_struct) mentioned in the previous section, assuming that a process has created multiple vm_area_struct that point to the same file.
Then this vm_area_struct has only one copy of the page cache.
That is, when a file on a disk is cached to memory, it can have multiple virtual memory addresses, but only one physical memory address.
To effectively improve I/O performance, the page cache needs to meet the following criteria:
- Ability to quickly retrieve the presence of required memory pages
- Ability to quickly locate dirty pages (that is, data that has been written but not synced to disk)
- Minimize the performance penalty associated with concurrent locks when the page cache is accessed concurrently
The following analysis of the corresponding structure in the kernel, to understand how the kernel improves I/O performance.
2.2 Implementation
The most important structure to implement the page cache is address_space, in <linux/fs.h>
structAddress_space {structInode *host;/*the Inode object that owns this address_space*/ structRadix_tree_root Page_tree;/*radix tree with all pages*/spinlock_t Tree_lock; /*protection of Radix tree spin lock*/unsignedinti_mmap_writable;/*vm_shared Count*/ structPrio_tree_root I_mmap;/*Tree of the private mapping list*/ structList_head i_mmap_nonlinear;/*vm_nonlinear Linked list*/spinlock_t I_mmap_lock; /*self-rotating lock for i_map protection*/unsignedintTruncate_count;/*Truncation Count*/unsignedLongNrpages;/*Total Pages*/pgoff_t Writeback_index;/*start offset of write-back*/ Const structAddress_space_operations *a_ops;/*address_space Table of Operations*/unsignedLongFlags/*Gfp_mask mask and error ID*/ structBacking_dev_info *backing_dev_info;/*Pre-read information*/spinlock_t Private_lock; /*private Address_space Spin lock*/ structList_head private_list;/*private address_space Linked list*/ structAddress_space *assoc_mapping;/*Buffering*/ structMutex Unmap_mutex;/*Mutux Lock to protect unmapped pages*/} __attribute__ ((Aligned (sizeof(Long))));
View Code
Additional notes:
- Inode-This field is null if Address_space is mapped by a file in the file system without an inode
- Page_tree-this tree structure is very important, it ensures that the data in the page cache can be quickly retrieved, dirty pages can be quickly located.
- I_mmap-based on vm_area_struct, can quickly find the associated cache file (ie, address_space), mentioned earlier, Address_space and Vm_area_struct is a one-to-many relationship.
- Other fields are primarily available with various locks and accessibility features
In addition, a brief description of a new data structure radix tree is presented here.
The radix tree queries each node through a long bit operation, which is highly efficient and can be queried quickly.
Radix tree-related content in Linux see: include/linux/radix-tree.h and lib/radix-tree.c
According to my own understanding, the following simple explanation of radix tree structure and principles.
2.2.1 First is the definition of the radix tree node
/*Source Reference lib/radix-tree.c*/structRadix_tree_node {unsignedintHeight/*the height of the radix tree*/unsignedintCount/*number of child nodes of the current node*/ structRcu_head Rcu_head;/*RCU callback function chain list*/ void*slots[radix_tree_map_size];/*the slot array in the node*/unsignedLongTags[radix_tree_max_tags][radix_tree_tag_longs];/*Slot Label*/};
View Code
Figure out the meaning of each field in Radix_tree_node, and almost know what the radix tree is all about.
- The height of the entire radix tree (that is, the height of the leaf node to the root), not the height of the current node to the root
- Count is a good understanding, indicating the number of child nodes of the current node, the count=0 of the leaf node
- Rcu_head List of callback functions triggered when RCU occurs
- Slots each slot corresponds to a child node (leaf node)
- tags tag child nodes whether dirty or wirteback
2.2.2 Each leaf node points to the cache page corresponding to the corresponding offset within the file
For example, the 0x000000 to 0x11111111 offset range, the height of the tree is 4 (figure is found on the Internet, not their own painting)
The leaf nodes of the 2.2.3 radix tree correspond to a binary integer, not a string, so the comparison is very fast.
In fact, the value of the leaf node is the value of the address space (usually a long type)
3. Page write-back
Since the current Linux kernel for the "write cache" is the 3rd strategy, so the time to write back is very important, write back too often affect performance, too little write-back is likely to cause data loss.
3.1 Introduction
Writeback in the Linux page cache is done by a thread in the kernel (the flusher thread), and the flusher thread triggers a writeback operation when the following 3 scenarios occur.
1. When free memory is below a threshold
When there is not enough free memory, you need to release a portion of the cache, because only the dirty pages can be released, so the dirty pages are written back to the disk, so that it becomes a clean page.
2. When dirty pages reside in memory longer than one threshold
Ensure that dirty pages do not reside in memory indefinitely, reducing the risk of data loss.
3. When the user process calls sync () and the Fsync () system is tuned
Provides a way for the user to write a forced writeback, and to respond to scenarios that require strict writeback.
Some of the thresholds involved in page write-back can be found in the /proc/sys/vm
Some of the thresholds associated with pdflush (an implementation of the Flusher thread) are listed in the following table
Valve value |
Describe |
Dirty_background_ratio |
As a percentage of total memory, when the free pages in memory reach this scale, the Pdflush thread begins to write back the dirty page |
Dirty_expire_interval |
The value is in 1% seconds, which describes how long the data will be executed periodically by the Pdflush thread. |
Dirty_ratio |
As a percentage of total memory, when a process produces a dirty page that reaches this scale, it is written |
Dirty_writeback_interval |
The value is in units of 1% seconds, which describes how often the Pdflush thread runs |
Laptop_mode |
A Boolean value that controls the laptop mode |
3.2 implementation
The implementation of flusher threads is changing with the development of the kernel. Here are a few of the more typical implementations that appear in the kernel development process.
1. Laptop computer mode
The intent of this mode is to minimize the mechanical behavior of the hard drive rotation, allowing the hard disk to stall for as long as possible to prolong battery life.
This mode is set by the/proc/sys/vm/laptop_mode file. (0-Off the mode 1-turn on the mode)
2. Bdflush and kupdated (Implementation of Flusher thread before version 2.6)
Bdflush Kernel threads run in the background with only one bdflush thread in the system, and the Bdflush thread wakes up when memory is consumed below a specific threshold
Kupdated run periodically, write back the dirty page.
Bdflush Existing problems:
The entire system has only one bdflush thread, and when the system writeback task is heavier, the Bdflush thread may block the I/O on a disk,
Causes the I/O writeback of other disks not to be executed in time.
3. Pdflush (introduced in version 2.6)
The number of Pdflush threads is dynamic, depending on the I/O load of the system. It is a global task for all disks in the system.
Pdflush Existing problems:
The number of Pdflush is dynamic and alleviates the bdflush problem to some extent. But since Pdflush is for all disks,
Therefore, it is possible that multiple Pdflush threads are all blocked on a congested disk, causing the I/O writeback of other disks not to be executed in time.
4. Flusher thread (introduced after 2.6.32 version)
The flusher thread improves the problem above:
First, the number of flusher threads is not unique, which avoids problems with bdflush threads
Second, flusher threads do not target all disks, but each flusher thread corresponds to a disk, which avoids problems with pdflush threads
"Linux kernel design and implementation" reading notes (16)-page cache and page writeback