頁緩衝在《linux核心情景分析》一書的第5.6節檔案的寫與讀一章中說明的很詳細,這裡摘抄下來;
在檔案系統層中有三隔主要的資料結構,file結構、dentry結構和inode結構;
file結構:代表目標檔案的一個上下文,不同進程可以在同一檔案上建立不同的上下文,而且同一進程也可以通過開啟一個檔案多次而建立起多個上下文。因此不能在file結構上設定緩衝區隊列,因為這些file結構體之間都不共用。
dentry結構體:該結構體是檔案名稱結構體,通過軟/永久連結可以得到多個dentry結構體對應一個檔案,dentry結構體和檔案也不是一對一關聯性,所以也不能在該結構體上建立緩衝區隊列;
inode結構體:很顯然就只有inode結構體了,inode結構體和檔案是一對一的關係,可以這麼說inode就是代表檔案。在inode結構體上設定了i_mapping指標,該指標指向了一個address_space資料結構,一般來說該資料結構就是inode->i_data,緩衝區隊列就是在該資料結構中;
掛在緩衝區隊列中的不是記錄塊而是記憶體頁面,因此當一個進程調用mmap()函數將一個檔案對應到它使用者空間時,它只要設定相應的記憶體映射表,就可以很自然的把這些快取頁面面映射到進程的使用者空間。所以才又起名為i_mapping。
這裡還要瞭解下基數樹概念,先看看圖(圖片來自《深入linux核心架構》)
基數樹不是不是平衡樹,樹本身由兩種不同的資料結構組成,樹根節點和非葉子節點,樹根節點由簡單的資料結構表示,其中包含了樹的高度和指向組成樹的第一個節點的資料結構。節點本質上是數組,count是該節點的指標計數,其他的都是指向下一層節點的指標。而葉子節點是指向page的指標;
其中節點上的資料結構還包含了搜尋標記,比如髒頁標記和回寫標記,可以很快的指定哪邊有標記的頁;
塊緩衝
塊緩衝在結構上由兩個部分組成:
1、緩衝頭:包含與緩衝區狀態相關的所有管理資料,塊號、長度,訪問器等,這些緩衝頭不直接儲存在緩衝頭之後,而是由緩衝頭指標指向的實體記憶體獨立地區中。
2、有用的資料儲存在專門分配的頁中,這些頁也可以能同事存在頁緩衝中。
緩衝頭:
/* * Historically, a buffer_head was used to map a single block * within a page, and of course as the unit of I/O through the * filesystem and block layers. Nowadays the basic I/O unit * is the bio, and buffer_heads are used for extracting block * mappings (via a get_block_t call), for tracking state within * a page (via a page_mapping) and for wrapping bio submission * for backward compatibility reasons (e.g. submit_bh). */struct buffer_head { unsigned long b_state; /* buffer state bitmap (see above) *///緩衝區狀態標識,看下面 struct buffer_head *b_this_page;/* circular list of page's buffers *///指向下一個緩衝頭 struct page *b_page; /* the page this bh is mapped to *///指向擁有該塊緩衝區的頁描述符指標 sector_t b_blocknr; /* start block number *///塊裝置的邏輯塊號 size_t b_size; /* size of mapping *///塊大小 char *b_data; /* pointer to data within the page *///塊在緩衝頁內的位置 struct block_device *b_bdev;//指向塊裝置描述符 bh_end_io_t *b_end_io; /* I/O completion *///i/o完成回呼函數 void *b_private; /* reserved for b_end_io *///指向i/o完成回呼函數的資料參數 struct list_head b_assoc_buffers; /* associated with another mapping */ struct address_space *b_assoc_map; /* mapping this buffer is associated with */ atomic_t b_count; /* users using this buffer_head *///塊使用計算機};
緩衝區頭部的通用標誌
enum bh_state_bits { BH_Uptodate, /* Contains valid data *///表示緩衝區包含有效資料 BH_Dirty, /* Is dirty *///緩衝區是髒的 BH_Lock, /* Is locked *///緩衝區被鎖住 BH_Req, /* Has been submitted for I/O *///初始化緩衝區而請求資料轉送 BH_Uptodate_Lock,/* Used by the first bh in a page, to serialise * IO completion of other buffers in the page */ BH_Mapped, /* Has a disk mapping *///b_bdev和b_blocknr是有效 BH_New, /* Disk mapping was newly created by get_block *///剛分配還沒有訪問過 BH_Async_Read, /* Is under end_buffer_async_read I/O *///非同步讀該緩衝區 BH_Async_Write, /* Is under end_buffer_async_write I/O *///非同步寫該緩衝區 BH_Delay, /* Buffer is not yet allocated on disk *///還沒有在磁碟上分配緩衝區 BH_Boundary, /* Block is followed by a discontiguity */// BH_Write_EIO, /* I/O error on write *///i/o錯誤 BH_Unwritten, /* Buffer is allocated on disk but not written */ BH_Quiet, /* Buffer Error Prinks to be quiet */ BH_Meta, /* Buffer contains metadata */ BH_Prio, /* Buffer should be submitted with REQ_PRIO */ BH_PrivateStart,/* not a state bit, but the first bit available * for private allocation by other entities */};
如果一個頁作為緩衝區頁使用,那麼與它的塊緩衝區相關的所有緩衝區首部都被收集在一個單向迴圈鏈表中。緩衝頁描述符的private欄位指向該頁中第一個塊的緩衝區首部;而每個緩衝區首部的b_this_page欄位中,該欄位是指向鏈表中下一個緩衝區首部的指標。每個緩衝區首部的b_page指向所屬的緩衝區頁描述符;
從上圖可以看出一個緩衝頁對應了4個緩衝區,這就統一了page cache和buffer cache了。修改緩衝區或者緩衝頁,他們之間都會相互影響。
address_space結構體:
struct address_space {
struct inode *host; /* owner: inode, block_device *///指向宿主檔案的inode
struct radix_tree_root page_tree; /* radix tree of all pages *///基數樹的root
spinlock_t tree_lock; /* and lock protecting it *///基數樹的鎖
unsigned int i_mmap_writable;/* count VM_SHARED mappings *///vm_SHARED共用映射頁計數
struct rb_root i_mmap; /* tree of private and shared mappings *///私人和共用映射的樹
struct list_head i_mmap_nonlinear;/*list VM_NONLINEAR mappings *///匿名映射的鏈表元素
struct mutex i_mmap_mutex; /* protect tree, count, list *///包含樹的mutex
/* Protected by tree_lock together with the radix tree */
unsigned long nrpages; /* number of total pages *///頁的總數
pgoff_t writeback_index;/* writeback starts here *///回寫的開始
const struct address_space_operations *a_ops; /* methods *///函數指標
unsigned long flags; /* error bits/gfp mask *///錯誤碼
struct backing_dev_info *backing_dev_info; /* device readahead, etc *///裝置預讀
spinlock_t private_lock; /* for use by the address_space */
struct list_head private_list; /* ditto */
void *private_data; /* ditto */
} __attribute__((aligned(sizeof(long))));
struct inode *host和struct radix_tree_root page_tree關聯了檔案和記憶體頁。
346 struct address_space_operations { 347 int (*writepage)(struct page *page, struct writeback_control *wbc);//寫操作,從頁寫到所有者的磁碟映像 348 int (*readpage)(struct file *, struct page *);//讀操作,從所有者磁碟映像讀取到頁 349 350 /* Write back some dirty pages from this mapping. */ 351 int (*writepages)(struct address_space *, struct writeback_control *);//指定數量的所有者髒頁回寫磁碟 352 353 /* Set a page dirty. Return true if this dirtied it */ 354 int (*set_page_dirty)(struct page *page);//把所有者的頁設定為髒頁 355 356 int (*readpages)(struct file *filp, struct address_space *mapping, 357 struct list_head *pages, unsigned nr_pages);//從磁碟中讀取所有者頁的鏈表 358 359 int (*write_begin)(struct file *, struct address_space *mapping, 360 loff_t pos, unsigned len, unsigned flags, 361 struct page **pagep, void **fsdata);// 362 int (*write_end)(struct file *, struct address_space *mapping, 363 loff_t pos, unsigned len, unsigned copied, 364 struct page *page, void *fsdata); 365 366 /* Unfortunately this kludge is needed for FIBMAP. Don't use it */ 367 sector_t (*bmap)(struct address_space *, sector_t); 368 void (*invalidatepage) (struct page *, unsigned long); 369 int (*releasepage) (struct page *, gfp_t); 370 void (*freepage)(struct page *); 371 ssize_t (*direct_IO)(int, struct kiocb *, const struct iovec *iov, 372 loff_t offset, unsigned long nr_segs); 373 int (*get_xip_mem)(struct address_space *, pgoff_t, int, 374 void **, unsigned long *); 375 /* 376 * migrate the contents of a page to the specified target. If sync 377 * is false, it must not block. 378 */ 379 int (*migratepage) (struct address_space *, 380 struct page *, struct page *, enum migrate_mode); 381 int (*launder_page) (struct page *); 382 int (*is_partially_uptodate) (struct page *, read_descriptor_t *, 383 unsigned long); 384 int (*error_remove_page)(struct address_space *, struct page *); 385 386 /* swapfile support */ 387 int (*swap_activate)(struct swap_info_struct *sis, struct file *file, 388 sector_t *span); 389 void (*swap_deactivate)(struct file *file); 390 }; 391