PostgreSQL 9.6 Smooth Fsync, write in-depth analysis

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Background

Car shift is smooth, usually depending on the number of stalls, or shifting technology.

The more the number of stalls, shifting feeling will be about smooth, less number of stalls, the shift may have more obvious duncuo feeling.

Databases are the same, and sometimes there may be a cotton phenomenon, such as sharp (stacked) IO requirements.

This article will introduce 9.6 comfort improvements in Fsync, write, and reduce the need for sharp IO.

Database in order to ensure data reliability, but also to ensure good read and write performance, as well as the consistency of reading and writing, after years of development, redo log, shared buffer Basic has become the standard database.

In order to ensure the reliability of the data, it is usually necessary to make sure that the dirty page is redo before it is brushed, and then the dirty page in the buffer will be aged by the LRU algorithm asynchronously.

To ensure good read and write performance, it is common to have shared buffer, write first in shared buffer, rather than directly modify the data page, because the data page is very discrete. The database converts the discrete IO into a sequential redo IO.

So the question is, when will call Fsync, and when will call write?

Why 9.6 will need to optimize the smooth fsync, write it?

Database which occasions call Fsync,write

Write

Call write more places, I give some very heavy write scenes, but also 9.6 focus on optimization.

1. Shared buffer

The bgwriter background process will write the aging dirty page to the operating system based on the set wake-up time, aging algorithm, write schedule settings, and the dirty pages in the shared buffer. (Invoke the Write interface of the system).

Backend process processes, when requesting a page in the shared buffer, if there is not enough free page, the Bgwriter action is triggered and the aging Dirty page is written to the operating system. (Invoke the Write interface of the system).

2. Wal buffer

Walwriter the background process, the dirty page in Wal buffer is written to the operating system according to the wake-up time of the set, and the dirty page in Wal buffer. (Invoke the Write interface of the system).

Note that the Walwriter Fsync interface is configurable, with buffer writer and direct IO. If direct IO is configured, it is dropped directly and will not be written to the dirty page of the operating system.

Large memory host hidden write problem

Because write is actually written to the operating system, the operating system then schedules the dirty page to fall off the disk.

The operating system scheduling brush Dirty page involves several kernel parameters and also involves the file system.

Vm.dirty_background_ratio = 10 # indicates that when the dirty page ratio reaches 10% of the memory, the system triggers the background flush thread to brush the dirty page
Vm.dirty_ratio = 20 # When the dirty page ratio reaches 20%, the user process triggers the operation of the flush disk when it calls write.
vm.dirty_writeback_centisecs = # Background flush wake interval of thread (in 1% seconds)
VM.DIRTY_EXPIRE_CENTISECS = 6000 # background flush thread will live longer than the value of the Dirty page brush disk (similar to LRU) (in 1% seconds)
If the system memory is very large, when the background thread is triggered to brush dirty pages, you may need to brush a lot of dirty pages, resulting in sharp IO requirements.

Therefore, we can modify the kernel parameters to achieve sharpening purposes.

Vm.dirty_background_bytes = 102400000 # When the number of dirty pages reaches 100MB, the system triggers background flush thread Brush Dirty page
This setting may not be sufficient, however, because the database can be manipulated concurrently (Wal writer, bgwriter, backend processes can concurrently call write), That is, the speed at which dirty pages are generated at peak speeds may be much greater than the operating system background thread flush.

So in this case the OS dirty page is still likely to accumulate, resulting in sharp IO requirements.

Fsync

1. Create Database

When you create a database, you need the Copy Template Library dir, which, after each part of the file copy, invokes Sync_file_range and finally calls Fsync persistence.

Pic1

If you have a large number of template library files, or if the files are big, you can cause more OS dirty page.

2. Checkpoint

2.1 First mark a dirty page in shared buffer

2.2 to the page that is marked as dirty, call write

If the shared buffer is large and the business shape causes the database to produce dirty pages quickly, the checkpoint will instantly produce a lot of OS dirty page.

2.3 To related FD, call Fsync, persistent

The database checkpoint process calls Fsync, and if the OS background thread does not drop a dirty page to write out during the checkpoint process, the database checkpoint process Fsync produce a large amount of brush-disk IO.

3. Wal writer

Wal writer will brush the data in the Wal buffer to the Xlog file according to the configured Fsync system call method, scheduling interval.

If the Wal buffer configuration is large, while the database high concurrency produces a large number of redo, then Wal writer will also produce a large number of write-disk IO.

Fsync Hidden problems

1. If the number of template library files, or large files, create database may result in the instantaneous generation of more OS dirty page, after the file write, call Fsync caused a lot of write disk IO.

2. If the shared buffer is large and the business shape causes the database to produce dirty pages quickly, the checkpoint will instantly produce a lot of OS dirty page.

3. The database checkpoint process calls Fsync, and if the OS background thread does not drop a dirty page to write out during the checkpoint process, the database checkpoint process Fsync produce a large amount of brush-disk IO.

4. If the Wal buffer configuration is large, while the database high concurrency produces a large number of redo, then Wal writer will also produce a large number of write-disk IO.

Analysis of the demand problem of sharp IO

The previous analysis of database write, Fsync in a particular scenario (usually a very heavy-write scenario), can result in a large number of IO requirements.

These requirements are actually due to the database in order to improve performance, a large number of times the buffer, and do not have a good deal with buffer accumulation, resulting in a flood of write disk IO request.

By configuring the OS backend flush thread scheduling parameters, can be alleviated, but cannot be completely cured (unable to withstand high concurrent write, a bit like the feeling of Sanjing, LV bu strong also do not too high concurrent write operation generated OS dirty page).

So how did 9.6 improve?

9.6 Smoothing Fsync,write Optimization

1. New Sync_file_range asynchronous write scheduling strategy

Where feasible, trigger kernel writeback after a configurable number of writes, to prevent accumulation of dirty the data in K Ernel disk buffers (Fabien Coelho, Andres Freund)

PostgreSQL writes data to the kernel's disk cache, from where it would be flushed to physical storage at due time.

Many operating systems are not smart about managing this and allow large amounts of dirty data to accumulate before Ng to flush it all at once,

Causing long delays for new I/O requests until the flushing finishes.

This change attempts to alleviate this problem by explicitly requesting data flushes after a configurable interval.

On Linux, Sync_file_range () are used for this purpose, and the feature are on by default on Linux because that function has Few downsides.

This flushing capability are also available on the other platforms if they have () or msync (),

But those interfaces have some undesirable side-effects so the feature was disabled by default on Non-linux platforms.

The new configuration Parameters Backend_flush_after, Bgwriter_flush_after, Checkpoint_flush_after, and Wal_writer_ Flush_after control this behavior.
Control the write operation for a process in these 4 with the following 4 parameters

Backend_flush_after, (unit: BLCKSZ)
When the number of backend process write dirty page exceeds the configured threshold, the call to the OS Sync_file_range is triggered, telling the OS backend flush thread to the asynchronous brush disk.
thereby cutting the OS dirty page stack.

Bgwriter_flush_after, (unit: BLCKSZ)
When the number of Bgwriter process write dirty page exceeds the configured threshold, the call to the OS Sync_file_range is triggered, telling the OS backend flush thread to the asynchronous brush disk.
thereby cutting the OS dirty page stack.

Checkpoint_flush_after, (unit: BLCKSZ)
When the number of Checkpointer process write dirty page exceeds the configured threshold, the call to the OS Sync_file_range is triggered, telling the OS backend flush thread to the asynchronous brush disk.
thereby cutting the OS dirty page stack.

Wal_writer_flush_after (unit: size)
When the number of Wal writer process write dirty page exceeds the configured threshold, the call to the OS Sync_file_range is triggered, telling the OS backend flush thread to the asynchronous brush disk.
thereby cutting the OS dirty page stack.
9.6 Previous versions, we can assume that they are not restrained in abusing write calls, peaks are more likely to occur when the OS backend flush threads keep pace.

9.6 changed to a controlled use of write, that is, every paragraph, will remind the background thread brush dirty page.

However, because the Sync_file_range asynchronous interface is used, the problem may not be fully resolved, and then improved, when the OS dirty page exceeds the number of times, triggering the Sync_file_range synchronization call may be better (to suppress the production of dirty pages around the speed).

2. The checkpoint write, when the dirty page belonging to the same FD, is sorted and then write to reduce the discrete IO.

Perform checkpoint writes in sorted order (Fabien Coelho, Andres Freund)

Previously, checkpoints wrote out dirty pages with whatever order they happen to appear in shared buffers, which usually is Nearly random.

That's performs poorly, especially on rotating media.

This is the change causes Checkpoint-driven writes to is done in the order by file and blocks number, and to is balanced across tables Paces.
Code Profiling Reference

"PostgreSQL 9.6 checkpoint flexibility optimization (sync_file_range)--analysis and optimization of Io hang problem in single machine multi-instance

Src/backend/storage/file/fd.c

/*
* Pg_flush_data---Advise OS that the described dirty data should to be flushed
*
* Offset of 0 with nbytes 0 means, the entire file should be flushed;
* In this case, this function could have side-effects on the file ' s
* Seek position!
*/
void
Pg_flush_data (int fd, off_t offset, off_t nbytes)
{
...
/*
* Sync_file_range (Sync_file_range_write), currently Linux specific,
* tells the OS that writeback for the specified blocks should is
* Started, but that we don ' t want to a for completion. Note that
* This call might blocks if too much dirty data exists in the range.
* This is the "preferable method" on OSs supporting it, as it works
* Reliably when available (contrast to Msync ()) and doesn "t flush out
* Clean data (like fadv_dontneed).
*/
rc = Sync_file_range (FD, offset, nbytes,
Sync_file_range_write);
NAME
Sync_file_range-sync a file segment with disk
DESCRIPTION
Sync_file_range () permits fine control when synchronising the "open file referred to" by the "file descriptor fd with disk."

The offset is the starting byte of the file range to be synchronised. nbytes Specifies the length of the range to is synchronised, in bytes; If Nbytes is zero, then all bytes from offset through to "end of"
File are synchronised. Synchronisation is in units of the system page Size:offset be rounded down to a page boundary; (offset+nbytes-1) is rounded up to a page boundary.

The flags Bit-mask argument can include any of the following values:

Sync_file_range_wait_before
Wait upon write-out of ' all pages in ' specified range that have already been submitted to the device for driver T before performing any write.

Sync_file_range_write
Initiate write-out of all dirty pages in the specified range which are not presently submitted.

Sync_file_range_wait_after
Wait upon write-out of the ' all pages in the ' range after performing any write.

Specifying flags as 0 is permitted, as a no-op.

NOTES
None of the operations write out the file ' s metadata. Therefore, unless the application is strictly performing overwrites of already-instantiated disk blocks, there are no guar Antees that the data would
Be available after a crash.

Sync_file_range_wait_before and Sync_file_range_wait_after'll detect any I/O errors or enospc conditions and'll return These to the caller.

Useful combinations of the flags bits are:

Sync_file_range_wait_before | Sync_file_range_write
Ensures, all pages, specified range which were dirty when Sync_file_range () was called are placed under-Write-ou T. This is a start-write-for-data-integrity operation.

Sync_file_range_write
Start write-out of all dirty pages in the specified range which are not presently under. This is a asynchronous flush-to-disk operation. This isn't suitable for data integrity opera-
tions.

Sync_file_range_wait_before (or Sync_file_range_wait_after)
Wait for completion of write-out of all pages in the specified range. This can is used after a earlier Sync_file_range_wait_before | Sync_file_range_write operation to wait for completion of that
operation, and obtain its result.

Sync_file_range_wait_before | Sync_file_range_write | Sync_file_range_wait_after
This is a traditional fdatasync (2) operation. It is a write-for-data-integrity operation that would ensure that all pages in the specified range which were dirty when Sy Nc_file_range () was called are
Committed to disk.
Parameter detailed

1. Backend_flush_after (integer)

Whenever more than Backend_flush_after bytes have been written through a single backend, attempt to force the OS to issue Writes to the underlying storage.

Doing so would limit the amount of dirty data in the kernel ' s page cache, reducing the likelihood of stalls of S issued at the "end of" a checkpoint, or, the OS writes data back in larger batches in the background.

Often that'll result in greatly reduced transaction latency, but there also are, some cases, especially with workloads th At are bigger than shared_buffers, but smaller than the OS ' s page cache, where performance might. This setting could have no effect on some platforms.

The valid range is between 0, which disables controlled writeback, and 2MB. The default is 0 (i.e. no flush control).
(Non-default values of Blcksz change the maximum.)
2. Bgwriter_flush_after (integer)

Whenever more than Bgwriter_flush_after bytes have been written through the Bgwriter, attempt to force of the OS to issue Ites to the underlying storage.

This setting could have no effect on some platforms.

The valid range is between 0, which disables controlled writeback, and 2MB.

The default is 512kB on Linux, 0 elsewhere. (Non-default values of Blcksz change the default and maximum.)

This parameter can is set in the Postgresql.conf file or on the server command line.
3. Checkpoint_flush_after (integer)

Whenever more than Checkpoint_flush_after bytes have been written while performing a checkpoint, attempt to force the OS T O Issue these writes to the underlying storage.

Doing so would limit the amount of dirty data in the kernel ' s page cache, reducing the likelihood of stalls of S issued at the "end of" checkpoint, or when the OS writes data back in larger batches in the background.

This setting could have no effect on some platforms. The valid range is between 0, which disables controlled writeback, and 2MB.

The default is 256kB on Linux, 0 elsewhere. (Non-default values of Blcksz change the default and maximum.)

This parameter can is set in the Postgresql.conf file or on the server command line.
4. Wal_writer_flush_after (integer)

Specifies how often the WAL writer flushes WAL.

In case the last flush happened less than wal_writer_delay milliseconds ago and less than wal_writer_flush_after of WAL have been produced since, WAL is only written to the OS, not flushed to disk.

If Wal_writer_flush_after is set to 0 Wal be flushed every time the Wal writer the has written.

The default is 1MB.

This parameter can is set in the Postgresql.conf file or on the server command line.
Count

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

PostgreSQL 9.6 Smooth Fsync, write in-depth analysis

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

PostgreSQL 9.6 Smooth Fsync, write in-depth analysis

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support