1. bdi是什麼?
bdi,即是backing device info的縮寫,顧名思義它描述備用存放裝置相關描述資訊,這在核心代碼裡用一個結構體backing_dev_info來表示。
bdi,備用存放裝置,簡單點說就是能夠用來儲存資料的裝置,而這些裝置儲存的資料能夠保證在電腦電源關閉時也不丟失。這樣說來,磁碟片存放裝置、光碟機存放裝置、USB存放裝置、硬碟存放裝置都是所謂的備用存放裝置(後面都用bdi來指示),而記憶體顯然不是
2. bdi工作模型
相對於記憶體來說,bdi裝置(比如最常見的硬碟存放裝置)的讀寫速度是非常慢的,因此為了提高系統整體效能,Linux系統對bdi裝置的讀寫內容進行了緩衝,那些讀寫的資料會臨時儲存在記憶體裡,以避免每次都直接操作bdi裝置,但這就需要在一定的時機(比如每隔5秒、髒資料達到的一定的比率等)把它們同步到bdi裝置,否則長久的呆在記憶體裡容易丟失(比如機器突然宕機、重啟),而進行間隔性同步工作的進程之前名叫pdflush,但後來在Kernel 2.6.2x/3x對此進行了最佳化改進,產生有多個核心進程,bdi-default、flush-x:y等。
關於以前的pdflush不再多說,我們這裡只討論bdi-default和flush-x:y,這兩個進程(事實上,flush-x:y為多個)的關係為父與子的關係,即bdi-default根據當前的狀態Create或Destroy flush-x:y,x為塊裝置類型,y為此類裝置的序號。如有兩個TF卡,則分別為:flush-179:0、flush-179:1。
一般而言,一個Linux系統會掛載很多bdi裝置,在bdi裝置註冊(函數:bdi_register(…))時,這些bdi裝置會以鏈表的形式組織在全域變數bdi_list下,除了一個比較特別的bdi裝置以外,它就是default bdi裝置(default_backing_dev_info),它除了被加進到bdi_list,還會建立一個bdi-default核心進程,即本文的主角。具體代碼如下,我相信你一眼就能注意到kthread_run和list_add_tail_rcu這樣的關鍵代碼。
struct backing_dev_info default_backing_dev_info = {.name= "default",.ra_pages= VM_MAX_READAHEAD * 1024 / PAGE_CACHE_SIZE,.state= 0,.capabilities= BDI_CAP_MAP_COPY,};EXPORT_SYMBOL_GPL(default_backing_dev_info);
static inline bool bdi_cap_flush_forker(struct backing_dev_info *bdi){return bdi == &default_backing_dev_info;}int bdi_register(struct backing_dev_info *bdi, struct device *parent,const char *fmt, ...){va_list args;struct device *dev;if (bdi->dev)/* The driver needs to use separate queues per device */return 0;va_start(args, fmt);dev = device_create_vargs(bdi_class, parent, MKDEV(0, 0), bdi, fmt, args);va_end(args);if (IS_ERR(dev))return PTR_ERR(dev);bdi->dev = dev;/* * Just start the forker thread for our default backing_dev_info, * and add other bdi's to the list. They will get a thread created * on-demand when they need it. */if (bdi_cap_flush_forker(bdi)) {struct bdi_writeback *wb = &bdi->wb;wb->task = kthread_run(bdi_forker_thread, wb, "bdi-%s",dev_name(dev));if (IS_ERR(wb->task))return PTR_ERR(wb->task);}bdi_debug_register(bdi, dev_name(dev));set_bit(BDI_registered, &bdi->state);spin_lock_bh(&bdi_lock);list_add_tail_rcu(&bdi->bdi_list, &bdi_list);spin_unlock_bh(&bdi_lock);trace_writeback_bdi_register(bdi);return 0;}EXPORT_SYMBOL(bdi_register);
接著跟進函數bdi_forker_thread,它是bdi-default核心進程的主體:
static int bdi_forker_thread(void *ptr) { struct bdi_writeback *me = ptr;current->flags |= PF_SWAPWRITE; set_freezable();/* * Our parent may run at a different priority, just set us to normal */ set_user_nice(current, 0);for (;;) { struct task_struct *task = NULL; struct backing_dev_info *bdi; enum { NO_ACTION, /* Nothing to do */ FORK_THREAD, /* Fork bdi thread */ KILL_THREAD, /* Kill inactive bdi thread */ } action = NO_ACTION;/* * Temporary measure, we want to make sure we don't see * dirty data on the default backing_dev_info */ if (wb_has_dirty_io(me) || !list_empty(&me->bdi->work_list)) { del_timer(&me->wakeup_timer); wb_do_writeback(me, 0); }spin_lock_bh(&bdi_lock); /* * In the following loop we are going to check whether we have * some work to do without any synchronization with tasks * waking us up to do work for them. Set the task state here * so that we don't miss wakeups after verifying conditions. */ set_current_state(TASK_INTERRUPTIBLE); /* 遍曆所有的bdi對象,檢查這些bdi是否存在髒資料,如果有髒資料,那麼需要為其fork線程,然後做writeback操作 */ list_for_each_entry(bdi, &bdi_list, bdi_list) { bool have_dirty_io;if (!bdi_cap_writeback_dirty(bdi) || bdi_cap_flush_forker(bdi)) continue;WARN(!test_bit(BDI_registered, &bdi->state), "bdi %p/%s is not registered!\n", bdi, bdi->name); /* 檢查是否存在髒資料 */ have_dirty_io = !list_empty(&bdi->work_list) || wb_has_dirty_io(&bdi->wb);/* * If the bdi has work to do, but the thread does not * exist - create it. */ if (!bdi->wb.task && have_dirty_io) { /* * Set the pending bit - if someone will try to * unregister this bdi - it'll wait on this bit. */ /* 如果有髒資料,並且不存線上程,那麼接下來做線程的FORK操作 */ set_bit(BDI_pending, &bdi->state); action = FORK_THREAD; break; }spin_lock(&bdi->wb_lock);/* * If there is no work to do and the bdi thread was * inactive long enough - kill it. The wb_lock is taken * to make sure no-one adds more work to this bdi and * wakes the bdi thread up. */ /* 如果一個bdi長時間沒有髒資料,那麼執行線程的KILL操作,結束掉該bdi對應的writeback線程 */ if (bdi->wb.task && !have_dirty_io && time_after(jiffies, bdi->wb.last_active + bdi_longest_inactive())) { task = bdi->wb.task; bdi->wb.task = NULL; spin_unlock(&bdi->wb_lock); set_bit(BDI_pending, &bdi->state); action = KILL_THREAD; break; } spin_unlock(&bdi->wb_lock); } spin_unlock_bh(&bdi_lock);/* Keep working if default bdi still has things to do */ if (!list_empty(&me->bdi->work_list)) __set_current_state(TASK_RUNNING); /* 執行線程的FORK和KILL操作 */ switch (action) { case FORK_THREAD: /* FORK一個bdi_writeback_thread線程,該線程的名字為flush-major:minor */ __set_current_state(TASK_RUNNING); task = kthread_create(bdi_writeback_thread, &bdi->wb, "flush-%s", dev_name(bdi->dev)); if (IS_ERR(task)) { /* * If thread creation fails, force writeout of * the bdi from the thread. Hopefully 1024 is * large enough for efficient IO. */ writeback_inodes_wb(&bdi->wb, 1024, WB_REASON_FORKER_THREAD); } else { /* * The spinlock makes sure we do not lose * wake-ups when racing with 'bdi_queue_work()'. * And as soon as the bdi thread is visible, we * can start it. */ spin_lock_bh(&bdi->wb_lock); bdi->wb.task = task; spin_unlock_bh(&bdi->wb_lock); wake_up_process(task); } bdi_clear_pending(bdi); break;case KILL_THREAD: /* KILL一個線程 */ __set_current_state(TASK_RUNNING); kthread_stop(task); bdi_clear_pending(bdi); break;case NO_ACTION: /* 如果沒有可執行檔動作,那麼調度本線程睡眠一段時間 */ if (!wb_has_dirty_io(me) || !dirty_writeback_interval) /* * There are no dirty data. The only thing we * should now care about is checking for * inactive bdi threads and killing them. Thus, * let's sleep for longer time, save energy and * be friendly for battery-driven devices. */ schedule_timeout(bdi_longest_inactive()); else schedule_timeout(msecs_to_jiffies(dirty_writeback_interval * 10)); try_to_freeze(); break; } }return 0; }
3. bdi相關資料結構
在bdi資料結構中定義了一個writeback對象,該對象是對writeback核心線程的描述,並且封裝了需要處理的inode隊列。在bdi資料結構中有一條work_list,該work隊列維護了writeback核心線程需要處理的任務。如果該隊列上沒有work可以處理,那麼writeback核心線程將會睡眠等待。
writeback
writeback對象封裝了核心線程task以及需要處理的inode隊列。當page cache/buffer cache需要重新整理radix tree上的inode時,可以將該inode掛載到writeback對象的b_dirty隊列上,然後喚醒writeback線程。在處理過程中,inode會被移到b_io隊列上進行處理。多條鏈表的方式可以降低多線程之間的資源共用。writeback資料結構具體定義如下:
struct bdi_writeback { struct backing_dev_info *bdi; /* our parent bdi */ unsigned int nr;unsigned long last_old_flush; /* last old data flush */ unsigned long last_active; /* last time bdi thread was active */struct task_struct *task; /* writeback thread */ struct timer_list wakeup_timer; /* used for delayed bdi thread wakeup */ struct list_head b_dirty; /* dirty inodes */ struct list_head b_io; /* parked for writeback */ struct list_head b_more_io; /* parked for more writeback */ spinlock_t list_lock; /* protects the b_* lists */ };
writeback work
wb_writeback_work資料結構是對writeback任務的封裝,不同的任務可以採用不同的重新整理策略。writeback線程的處理對象就是writeback_work。如果writeback_work隊列為空白,那麼核心線程就可以睡眠了。
Writeback_work的資料結構定義如下:
struct wb_writeback_work { long nr_pages; struct super_block *sb; /* superblock對象 */ unsigned long *older_than_this; enum writeback_sync_modes sync_mode; unsigned int tagged_writepages:1; unsigned int for_kupdate:1; unsigned int range_cyclic:1; unsigned int for_background:1; enum wb_reason reason; /* why was writeback initiated? */ struct list_head list; /* pending work list,鏈入bdi-> work_list隊列 */ struct completion *done; /* set if the caller waits,work完成時通知調用者 */ };
4. writeback主要函數分析
writeback機制的主要函數包括如下兩個方面:
1. 管理bdi對象並且fork相應的writeback核心線程處理cache資料的重新整理工作。
2. writeback核心線程處理函數,實現dirty page的重新整理操作
writeback線程管理
Linux中有一個核心守護線程,該線程用來管理系統bdi隊列,並且負責為block device建立writeback thread。當bdi中有dirty page並且還沒有為bdi分配核心線程的時候,bdi_forker_thread程式會為其分配線程資源;當一個writeback線程長時間處於空閑狀態時,bdi_forker_thread程式會釋放該線程資源。