PostgreSQL 9.6 平滑fsync, write深入分析

最後更新：2017-01-13 來源：互聯網

上載者：User

創建阿里雲帳戶，並獲得超過 40 款產品的免費試用版；而企業帳戶則可以享有總值 $1200 的免費試用版。立即註冊！

背景

汽車換擋是否平順，通常取決於檔位元，或者換擋技術。

檔位元越多，換擋時感覺會約平順，檔位元較少的情況下，換擋可能會有比較明顯的頓挫感覺。

資料庫也一樣，有些時候可能就會出現卡頓的現象，比如尖銳(堆積)的IO需求時。

本文將給大家介紹9.6在fsync, write方面的平順性改進，減少尖銳的IO需求。

資料庫為了保證資料可靠性，同時還要保證好的讀寫效能，以及讀寫的一致性，經過多年的發展，REDO日誌，shared buffer基本已經成為資料庫的標配。

為了保證資料的可靠性，通常需要在將dirty page刷盤前，保證其REDO先刷盤，然後再通過LRU演算法非同步老化shared buffer中的dirty page。

為了保證好的讀寫效能，通常會需要shared buffer，寫先落在shared buffer中，而不是直接同步處理修改資料頁，因為資料頁很離散。資料庫會把把離散的IO轉換成順序的REDO IO。

那麼問題來了，什麼時候會調用fsync，什麼時候會調用write呢？

為什麼9.6會需要最佳化平滑的fsync, write呢？

資料庫哪些場合調用fsync,write

write

調用write的地方較多，我舉一些非常重度的write情境，也是9.6重點最佳化的地方。

1. shared buffer

bgwriter後台進程會將shared buffer中的髒頁根據設定的喚醒時間，老化演算法，寫調度設定，將老化的dirty page寫到作業系統。 (調用系統的write介面) 。

backend process 進程，當請求shared buffer中的page時，如果沒有足夠的空閑page則會主動觸發與bgwriter一樣的操作，將老化的dirty page寫到作業系統。 (調用系統的write介面) 。

2. wal buffer

walwriter後台進程，將wal buffer中的髒頁根據設定的喚醒時間，將wal buffer中的dirty page寫到作業系統。 (調用系統的write介面) 。

注意walwriter的fsync介面是可配置的，有buffer writer也有direct IO。如果配置了direct io則直接落盤，不會寫到作業系統的dirty page。

大記憶體主機隱藏的write問題

因為write實際上是寫到了作業系統中，作業系統再調度將髒頁落盤。

作業系統調度刷髒頁涉及到幾個核心參數，同時還涉及到檔案系統。

vm.dirty_background_ratio = 10 # 表示當髒頁比例達到了記憶體的10%，系統觸發background flush線程刷髒頁
vm.dirty_ratio = 20 # 當髒頁比例達到20%，使用者進程在調用write時，會觸發flush磁碟的操作。
vm.dirty_writeback_centisecs = 50 # background flush線程的喚醒間隔(單位：百分之一秒)
vm.dirty_expire_centisecs = 6000 # background flush線程將存活時間超過該值的髒頁刷盤（類似LRU）(單位：百分之一秒)
如果系統記憶體非常大，當觸發後台線程刷髒頁時，可能需要刷很多髒頁，導致尖銳的IO需求。

因此，我們可以通過修改核心參數達到削尖的目的。

vm.dirty_background_bytes = 102400000 # 當髒頁數達到了100MB，系統觸發background flush線程刷髒頁
但是這樣設定可能還不夠，因為資料庫是可以並行作業的（wal writer, bgwriter, backend processes都可能並發的調用write），也就是說高峰時產生髒頁的速度可能遠遠大於作業系統後台線程flush的速度。

因此這種情況下os dirty page還是可能堆積，爆發尖銳的IO需求。

fsync

1. create database

建立資料庫時，需要COPY 模板庫DIR，每個檔案COPY一部分後，會調用sync_file_range，最後調用fsync持久化。

pic1

如果模板庫檔案數多，或者檔案很大，可能導致產生較多的os dirty page。

2. checkpoint

2.1 首先標記shared buffer中的髒頁

2.2 對已標記為髒頁的PAGE，調用write

如果shared buffer很大，並且業務形態導致資料庫產生髒頁速度很快的話，檢查點會瞬間產生很多的os dirty page。

2.3 對相關的fd，調用fsync，持久化

資料庫檢查點進程調用fsync，如果OS後台線程沒有將檢查點過程中write出去的髒頁落盤，資料庫檢查點進程fsync會產生大量的刷盤IO。

3. wal writer

wal writer會根據配置的fsync系統調用方法、調度間隔，將wal buffer中的資料刷到XLOG檔案。

如果wal buffer配置較大，同時資料庫高並發的產生大量的REDO，則WAL writer也會產生大量的寫盤IO。

fsync隱藏的問題

1. 如果模板庫檔案數多，或者檔案很大，create database可能導致瞬間產生較多的os dirty page，在檔案write完後，調用fsync導致大量的寫盤IO。

2. 如果shared buffer很大，並且業務形態導致資料庫產生髒頁速度很快的話，檢查點會瞬間產生很多的os dirty page。

3. 資料庫檢查點進程調用fsync，如果OS後台線程沒有將檢查點過程中write出去的髒頁落盤，資料庫檢查點進程fsync會產生大量的刷盤IO。

4. 如果wal buffer配置較大，同時資料庫高並發的產生大量的REDO，則WAL writer也會產生大量的寫盤IO。

尖銳IO需求問題分析

前面分析了資料庫write, fsync在特定的情境（通常是寫非常重的情境）中，可能導致大量的IO需求。

這些需求其實是因為資料庫為了提高效能，大量的時候了BUFFER，並且沒有很好的處理BUFFER堆積，導致蜂擁而至的寫盤IO請求。

通過配置OS的backend flush線程的調度參數，可以緩解，但是不能徹底根治（無法抵禦高並發的寫，有點像三英佔呂布的感覺，呂布再強也幹不過高並發的寫操作產生的os dirty page）。

那麼9.6是如何改進的呢？

9.6平滑fsync,write最佳化

1. 新增 sync_file_range 非同步寫的調度策略

Where feasible, trigger kernel writeback after a configurable number of writes, to prevent accumulation of dirty data in kernel disk buffers (Fabien Coelho, Andres Freund)

PostgreSQL writes data to the kernel's disk cache, from where it will be flushed to physical storage in due time.

Many operating systems are not smart about managing this and allow large amounts of dirty data to accumulate before deciding to flush it all at once,

causing long delays for new I/O requests until the flushing finishes.

This change attempts to alleviate this problem by explicitly requesting data flushes after a configurable interval.

On Linux, sync_file_range() is used for this purpose, and the feature is on by default on Linux because that function has few downsides.

This flushing capability is also available on other platforms if they have msync() or posix_fadvise(),

but those interfaces have some undesirable side-effects so the feature is disabled by default on non-Linux platforms.

The new configuration parameters backend_flush_after, bgwriter_flush_after, checkpoint_flush_after, and wal_writer_flush_after control this behavior.
通過以下4個參數，控制這4中進程的write操作

backend_flush_after, （單位：BLCKSZ ）
當某backend process write dirty page的數量超過配置閾值時，觸發調用OS sync_file_range，告訴os backend flush 線程非同步刷盤。
從而削減os dirty page堆積。

bgwriter_flush_after,   （單位：BLCKSZ ）
    當bgwriter process write dirty page的數量超過配置閾值時，觸發調用OS sync_file_range，告訴os backend flush 線程非同步刷盤。
    從而削減os dirty page堆積。

checkpoint_flush_after,   （單位：BLCKSZ ）
    當checkpointer process write dirty page的數量超過配置閾值時，觸發調用OS sync_file_range，告訴os backend flush 線程非同步刷盤。
    從而削減os dirty page堆積。

wal_writer_flush_after （單位：size）
當wal writer process write dirty page的數量超過配置閾值時，觸發調用OS sync_file_range，告訴os backend flush 線程非同步刷盤。
從而削減os dirty page堆積。
9.6以前的版本，我們可以認為他們是沒有克制的濫用write調用，高峰時比較容易出現os backend flush線程跟不上的節奏。

9.6改成了有節制的使用write，即每隔一段，會提醒後台線程刷髒頁。

但是由於使用的是sync_file_range的非同步介面，問題可能不能完全解決，再改進一下，當os dirty page超過多少的時候，觸發sync_file_range的同步調用可能更好(起到抑制產生髒頁的速度的左右)。

2. 檢查點write屬於同一個fd的dirty page時，排序後再write，從而降低離散的IO。

Perform checkpoint writes in sorted order (Fabien Coelho, Andres Freund)

Previously, checkpoints wrote out dirty pages in whatever order they happen to appear in shared buffers, which usually is nearly random.

That performs poorly, especially on rotating media.

This change causes checkpoint-driven writes to be done in order by file and block number, and to be balanced across tablespaces.
代碼剖析參考

《PostgreSQL 9.6 檢查點柔性最佳化(SYNC_FILE_RANGE) - 在單機多執行個體下的IO Hang問題淺析與最佳化》

src/backend/storage/file/fd.c

/*
* pg_flush_data --- advise OS that the described dirty data should be flushed
*
* offset of 0 with nbytes 0 means that the entire file should be flushed;
* in this case, this function may have side-effects on the file's
* seek position!
*/
void
pg_flush_data(int fd, off_t offset, off_t nbytes)
{
...
                /*
                 * sync_file_range(SYNC_FILE_RANGE_WRITE), currently linux specific,
                 * tells the OS that writeback for the specified blocks should be
                 * started, but that we don't want to wait for completion. Note that
                 * this call might block if too much dirty data exists in the range.
                 * This is the preferable method on OSs supporting it, as it works
                 * reliably when available (contrast to msync()) and doesn't flush out
                 * clean data (like FADV_DONTNEED).
                 */
                rc = sync_file_range(fd, offset, nbytes,
                                                         SYNC_FILE_RANGE_WRITE);
NAME
       sync_file_range - sync a file segment with disk
DESCRIPTION
       sync_file_range() permits fine control when synchronising the open file referred to by the file descriptor fd with disk.

offset is the starting byte of the file range to be synchronised. nbytes specifies the length of the range to be synchronised, in bytes; if nbytes is zero, then all bytes from offset through to the end of
file are synchronised. Synchronisation is in units of the system page size: offset is rounded down to a page boundary; (offset+nbytes-1) is rounded up to a page boundary.

The flags bit-mask argument can include any of the following values:

SYNC_FILE_RANGE_WAIT_BEFORE
Wait upon write-out of all pages in the specified range that have already been submitted to the device driver for write-out before performing any write.

SYNC_FILE_RANGE_WRITE
Initiate write-out of all dirty pages in the specified range which are not presently submitted write-out.

SYNC_FILE_RANGE_WAIT_AFTER
Wait upon write-out of all pages in the range after performing any write.

Specifying flags as 0 is permitted, as a no-op.

NOTES
None of these operations write out the file’s metadata. Therefore, unless the application is strictly performing overwrites of already-instantiated disk blocks, there are no guarantees that the data will
be available after a crash.

SYNC_FILE_RANGE_WAIT_BEFORE and SYNC_FILE_RANGE_WAIT_AFTER will detect any I/O errors or ENOSPC conditions and will return these to the caller.

Useful combinations of the flags bits are:

SYNC_FILE_RANGE_WAIT_BEFORE | SYNC_FILE_RANGE_WRITE
Ensures that all pages in the specified range which were dirty when sync_file_range() was called are placed under write-out. This is a start-write-for-data-integrity operation.

       SYNC_FILE_RANGE_WRITE
              Start write-out of all dirty pages in the specified range which are not presently under write-out. This is an asynchronous flush-to-disk operation. This is not suitable for data integrity opera-
              tions.

       SYNC_FILE_RANGE_WAIT_BEFORE (or SYNC_FILE_RANGE_WAIT_AFTER)
              Wait for completion of write-out of all pages in the specified range. This can be used after an earlier SYNC_FILE_RANGE_WAIT_BEFORE | SYNC_FILE_RANGE_WRITE operation to wait for completion of that
              operation, and obtain its result.

       SYNC_FILE_RANGE_WAIT_BEFORE | SYNC_FILE_RANGE_WRITE | SYNC_FILE_RANGE_WAIT_AFTER
              This is a traditional fdatasync(2) operation. It is a write-for-data-integrity operation that will ensure that all pages in the specified range which were dirty when sync_file_range() was called are
              committed to disk.
參數詳解

1. backend_flush_after (integer)

Whenever more than backend_flush_after bytes have been written by a single backend, attempt to force the OS to issue these writes to the underlying storage.

Doing so will limit the amount of dirty data in the kernel's page cache, reducing the likelihood of stalls when an fsync is issued at the end of a checkpoint, or when the OS writes data back in larger batches in the background.

Often that will result in greatly reduced transaction latency, but there also are some cases, especially with workloads that are bigger than shared_buffers, but smaller than the OS's page cache, where performance might degrade. This setting may have no effect on some platforms.

The valid range is between 0, which disables controlled writeback, and 2MB. The default is 0 (i.e. no flush control).
(Non-default values of BLCKSZ change the maximum.)
2. bgwriter_flush_after (integer)

Whenever more than bgwriter_flush_after bytes have been written by the bgwriter, attempt to force the OS to issue these writes to the underlying storage.

This setting may have no effect on some platforms.

The valid range is between 0, which disables controlled writeback, and 2MB.

The default is 512kB on Linux, 0 elsewhere. (Non-default values of BLCKSZ change the default and maximum.)

This parameter can only be set in the postgresql.conf file or on the server command line.
3. checkpoint_flush_after (integer)

Whenever more than checkpoint_flush_after bytes have been written while performing a checkpoint, attempt to force the OS to issue these writes to the underlying storage.

Doing so will limit the amount of dirty data in the kernel's page cache, reducing the likelihood of stalls when an fsync is issued at the end of the checkpoint, or when the OS writes data back in larger batches in the background.

This setting may have no effect on some platforms. The valid range is between 0, which disables controlled writeback, and 2MB.

The default is 256kB on Linux, 0 elsewhere. (Non-default values of BLCKSZ change the default and maximum.)

This parameter can only be set in the postgresql.conf file or on the server command line.
4. wal_writer_flush_after (integer)

Specifies how often the WAL writer flushes WAL.

In case the last flush happened less than wal_writer_delay milliseconds ago and less than wal_writer_flush_after bytes of WAL have been produced since, WAL is only written to the OS, not flushed to disk.

If wal_writer_flush_after is set to 0 WAL is flushed every time the WAL writer has written WAL.

The default is 1MB.

This parameter can only be set in the postgresql.conf file or on the server command line.
Count

本文章原先以中文撰寫並發佈於 aliyun.com，亦設英文版本，僅作資訊用途。本網站不對文章的準確性，完整性或可靠性或其任何翻譯作出任何明示或暗示的陳述或保證。如對該文章有任何疑慮或投訴，請傳送電郵至 info-contact@alibabacloud.com 並提供相關疑慮或投訴的詳細說明。職員會於 5 個工作天內與您聯絡，一經驗證之後，即會刪除該侵權內容。

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More