Linux緩衝寫回機制
原址: http://oenhan.com/linux-cache-writeback
在做進程安全監控的時候,拍腦袋決定的,如果發現一個進程在D狀態時,即TASK_UNINTERRUPTIBLE(不可中斷的睡眠狀態),時間超過了8min,就將系統panic掉。恰好DB組做日誌時,將整個log緩衝到記憶體中,最後刷磁碟,結果系統就D狀態了很長時間,自然panic了,中間涉及到Linux的緩衝寫回刷磁碟的一些機制和調優方法,寫一下總結。
目前機制需要將髒頁刷回到磁碟一般是以下情況: 髒頁緩衝佔用的記憶體太多,記憶體空間不足; 髒頁已經更改了很長時間,時間上已經到了臨界值,需要及時重新整理保持記憶體和磁碟上資料一致性; 外界命令強制重新整理髒頁到磁碟 write寫磁碟時檢查狀態重新整理
核心使用pdflush線程重新整理髒頁到磁碟,pdflush線程個數在2和8之間,可以通過/proc/sys/vm/nr_pdflush_threads檔案直接查看,具體策略機制參看源碼函數__pdflush。 一、核心其他模組強制重新整理
先說一下第一種和第三種情況:當記憶體空間不足或外界強制重新整理的時候,髒頁的重新整理是通過調用wakeup_pdflush函數實現的,調用其函數的有do_sync、free_more_memory、try_to_free_pages。wakeup_pdflush的功能是通過background_writeout的函數實現的:
static void background_writeout(unsigned long _min_pages){ long min_pages = _min_pages; struct writeback_control wbc = { .bdi = NULL, .sync_mode = WB_SYNC_NONE, .older_than_this = NULL, .nr_to_write = 0, .nonblocking = 1, }; for ( ; ; ) { struct writeback_state wbs; long background_thresh; long dirty_thresh; get_dirty_limits(&wbs, &background_thresh, &dirty_thresh, NULL); if (wbs.nr_dirty + wbs.nr_unstable < background_thresh && min_pages <= 0) break; wbc.encountered_congestion = 0; wbc.nr_to_write = MAX_WRITEBACK_PAGES; wbc.pages_skipped = 0; writeback_inodes(&wbc); min_pages -= MAX_WRITEBACK_PAGES - wbc.nr_to_write; if (wbc.nr_to_write > 0 || wbc.pages_skipped > 0) { /* Wrote less than expected */ blk_congestion_wait(WRITE, HZ/10); if (!wbc.encountered_congestion) break; } }}
background_writeout進到一個死迴圈裡面,通過get_dirty_limits擷取髒頁開始重新整理的臨界值background_thresh,即為dirty_background_ratio的總記憶體頁數百分比,可以通過proc介面/proc/sys/vm/dirty_background_ratio調整,一般預設為10。當髒頁超過臨界值時,調用writeback_inodes寫MAX_WRITEBACK_PAGES(1024)個頁,直到髒頁比例低於臨界值。 二、核心定時器啟動重新整理
核心在啟動的時候在page_writeback_init初始化wb_timer定時器,逾時時間是dirty_writeback_centisecs,單位是0.01秒,可以通過/proc/sys/vm/dirty_writeback_centisecs調節。wb_timer的觸發函數是wb_timer_fn,最終是通過wb_kupdate實現。
static void wb_kupdate(unsigned long arg){ sync_supers(); get_writeback_state(&wbs); oldest_jif = jiffies - (dirty_expire_centisecs * HZ) / 100; start_jif = jiffies; next_jif = start_jif + (dirty_writeback_centisecs * HZ) / 100; nr_to_write = wbs.nr_dirty + wbs.nr_unstable + (inodes_stat.nr_inodes - inodes_stat.nr_unused); while (nr_to_write > 0) { wbc.encountered_congestion = 0; wbc.nr_to_write = MAX_WRITEBACK_PAGES; writeback_inodes(&wbc); if (wbc.nr_to_write > 0) { if (wbc.encountered_congestion) blk_congestion_wait(WRITE, HZ/10); else break; /* All the old data is written */ } nr_to_write -= MAX_WRITEBACK_PAGES - wbc.nr_to_write; } if (time_before(next_jif, jiffies + HZ)) next_jif = jiffies + HZ; if (dirty_writeback_centisecs) mod_timer(&wb_timer, next_jif); }
上面的代碼沒有拷貝全。核心首先將超級塊資訊重新整理到檔案系統上,然後擷取oldest_jif作為wbc的參數只重新整理已修改時間大於dirty_expire_centisecs的髒頁,dirty_expire_centisecs參數可以通過/proc/sys/vm/dirty_expire_centisecs調整。 三、WRITE寫檔案重新整理緩衝
使用者態使用WRITE函數寫檔案時也有可能要重新整理髒頁,generic_file_buffered_write函數會在將寫的記憶體頁標記為髒之後,根據條件重新整理磁碟以平衡當前髒頁比率,參看balance_dirty_pages_ratelimited函數:
void balance_dirty_pages_ratelimited(struct address_space *mapping){ static DEFINE_PER_CPU(int, ratelimits) = 0; long ratelimit; ratelimit = ratelimit_pages; if (dirty_exceeded) ratelimit = 8; /* * Check the rate limiting. Also, we do not want to throttle real-time * tasks in balance_dirty_pages(). Period. */ if (get_cpu_var(ratelimits)++ >= ratelimit) { __get_cpu_var(ratelimits) = 0; put_cpu_var(ratelimits); balance_dirty_pages(mapping); return; } put_cpu_var(ratelimits);}
balance_dirty_pages_ratelimited函數通過ratelimit_pages調節重新整理(調用balance_dirty_pages函數)的次數,每ratelimit_pages次調用才會重新整理一次,具體重新整理過程看balance_dirty_pages函數:
static void balance_dirty_pages(struct address_space *mapping){ struct writeback_state wbs; long nr_reclaimable; long background_thresh; long dirty_thresh; unsigned long pages_written = 0; unsigned long write_chunk = sync_writeback_pages(); struct backing_dev_info *bdi = mapping->backing_dev_info; for (;;) { struct writeback_control wbc = { .bdi = bdi, .sync_mode = WB_SYNC_NONE, .older_than_this = NULL, .nr_to_write = write_chunk, }; get_dirty_limits(&wbs, &background_thresh, &dirty_thresh, mapping); nr_reclaimable = wbs.nr_dirty + wbs.nr_unstable; if (nr_reclaimable + wbs.nr_writeback <= dirty_thresh) break; if (!dirty_exceeded) dirty_exceeded = 1; /* Note: nr_reclaimable denotes nr_dirty + nr_unstable. * Unstable writes are a feature of certain networked * filesystems (i.e. NFS) in which data may have been * written to the server's write cache, but has not yet * been flushed to permanent storage. */ if (nr_reclaimable) { writeback_inodes(&wbc); get_dirty_limits(&wbs, &background_thresh, &dirty_thresh, mapping); nr_reclaimable = wbs.nr_dirty + wbs.nr_unstable; if (nr_reclaimable + wbs.nr_writeback <= dirty_thresh) break; pages_written += write_chunk - wbc.nr_to_write; if (pages_written >= write_chunk) break; /* We've done our duty */ } blk_congestion_wait(WRITE, HZ/10); } if (nr_reclaimable + wbs.nr_writeback <= dirty_thresh && dirty_exceeded) dirty_exceeded = 0; if (writeback_in_progress(bdi)) return; /* pdflush is already working this queue */ /* * In laptop mode, we wait until hitting the higher threshold before * starting background writeout, and then write out all the way down * to the lower threshold. So slow writers cause minimal disk activity. * * In normal mode, we start background writeout at the lower * background_thresh, to keep the amount of dirty memory low. */ if ((laptop_mode && pages_written) || (!laptop_mode && (nr_reclaimable > background_thresh))) pdflush_operation(background_writeout, 0);}
函數走進一個死迴圈,通過get_dirty_limits擷取dirty_background_ratio和dirty_ratio對應的記憶體頁數值,當24行做判斷,如果髒頁大於dirty_thresh,則調用writeback_inodes開始刷緩衝到磁碟,如果一次沒有將髒頁比率刷到dirty_ratio之下,則用blk_congestion_wait阻塞寫,然後反覆迴圈,直到比率降低到dirty_ratio;當比率低於dirty_ratio之後,但髒頁比率大於dirty_background_ratio,則用pdflush_operation啟用background_writeout,pdflush_operation是非阻塞函數,喚醒pdflush後直接返回,background_writeout在有pdflush調用。
如此可知:WRITE寫的時候,緩衝超過dirty_ratio,則會阻塞寫操作,回刷髒頁,直到緩衝低於dirty_ratio;如果緩衝高於background_writeout,則會在寫操作時,喚醒pdflush進程刷髒頁,不阻塞寫操作。 四,問題總結
導致進程D狀態大部分是因為第3種和第4種情況:有大量寫操作,緩衝由Linux系統管理,一旦髒頁累計到一定程度,無論是繼續寫還是fsync重新整理,都會使進程D住。