Among the many parameters of PG, several parameters related to parameter checkpoint are quite mysterious. These parameters and checkpoint scheduling, the stability of the system is still relatively important, the following we for everyone to analyze, this must first from the PG data synchronization mechanism to talk about.
Data synchronization mechanism of PG
It is well known that the data changes that occur when the background process of a database executes a user transaction are written to the buffer pool, and the corresponding PG is the shared buffers. The buffer pool for PG is generally set to about 1/4 of total memory, and these data changes in the buffer pool are not required to be written to disk synchronously when the transaction commits. Since the transaction commits, the Wal log is written, and with the Wal log, the data can be recovered in exceptional cases to ensure data security, so it is not so important that the data itself is written to disk at the time of submission. PG is the data that is written back to disk only when it is needed, such as when a dirty page is high, or after a certain interval of time.
The process of dirty page processing is divided into several steps. First, background writer writes the changed pages (that is, dirty pages) inside the shared buffers to the operating system page cache by calling write. In the function Bgbuffersync can be seen, pg background writer process, according to the LRU linked list, scan shared buffers (actually a part of each scan), if found dirty page, call the system call write. You can control the time interval between each scan of background writer by setting the Bgwriter_delay parameter. Background writer, after calling write on a page, logs the file corresponding to the page (actually the table's segement, each table may have multiple segment, corresponding to multiple physical files) into an array of shared memory CheckpointerShmem->requests
, in the following order:
BackgroundWriterMain -> BgBufferSync -> SyncOneBuffer -> FlushBuffer -> smgrwrite | | V ForwardFsyncRequest <- register_dirty_segment <- mdwrite
These request will eventually be read by the Checkpointer process, put in pendingOpsTable
, and actually write the dirty page back to disk operations, is done by the checkpointer process. Checkpointer calls Smgrwrite every time, and writes all shared buffers dirty pages (that is, dirty pages that have not been cleaned by background writer) to the operating system page cache and deposit them pendingOpsTable
. This pendingOpsTable
stores all the dirty pages that have been write, including the dirty pages previously background writer has already handled. The checkpointer process of the PG then makes a pedingOpsTable
dirty page writeback operation based on the record (note that every time the FYSNC is called, a file in the Sync data table is written to the disk), and the order of the calls is as follows:
CheckPointGuts->CheckPointBuffers->->mdsync->pg_fsync->fsync
If Checkpointer does disk writes too frequently, it may write very little data at a time. We know that the disk for sequential write batch data is much more efficient than the random write, each write very little data, resulting in a large number of random write, and if we slow down the frequency of checkpoint, multiple random pages may be composed of a sequential batch write, the efficiency greatly improved. In addition, checkpoint will perform fsync operations, a large number of fsync may cause system IO congestion, reduce system stability, so checkpoint can not be too frequent. But the interval of checkpoint cannot be enlarged indefinitely. Because if a system outage occurs, recovery takes place from the last checkpoint point in time, and if the checkpoint interval is too long, the recovery time is slow and availability is reduced. The entire synchronization mechanism is as follows:
Figure 1. Data synchronization mechanism
Dispatch of Checkpoint
So how to dispatch checkpoint, that is, control the interval of checkpoint? PG provides several parameters: checkpoint_segments
, checkpoint_completion_target
and checkpoint_timeout
.
Decide whether to do checkpoint there are two metric dimensions:
The amount of data modification for the system.
There are two ways to evaluate the amount of modification: one is to record how many dirty pages are in the shared buffer, and how much of the buffer is accounted for, and the other is the amount of data modification that records the user's transaction. If the number of dirty pages or proportion of the system to assess the amount of changes, will be less accurate, the user may repeatedly modify the same page, dirty pages are not many, but the actual amount of modification is very large, this time should also be checkpoint, reduce recovery times. By recording the yield of the Wal log, it is possible to evaluate the amount of this modification, so checkpoint_segments
This parameter is used to specify how many Wal logs are generated, once checkpoint. For example, set to 16 o'clock, after generating 16 Wal log files (if each log file has a size of 16M, which produces a log of 16*16m bytes), a checkpoint is performed. The call to determine whether to trigger checkpoint is as follows:
XLogInsert->XLogFlush->XLogWirte->XLogCheckpointNeeded
The time from the last checkpoint.
That is, after the last checkpoint, how long must do once checkpoint. PG provides checkpoint_timeout
This parameter, the default value is 300 seconds, that is, if the last checkpoint after 300 seconds did not do checkpoint, it is forced to do a checkpoint.
So checkpoint_completion_target
what is the other argument for?
Checkpoint_completion_target parameters
This seemingly insignificant parameter actually has a great impact on checkpoint scheduling. How is it used? Checkpoint calls Buffersync, scans all shared buffers pages once, and writes to page cache if a dirty page is found. The ischeckpointonschedule ()
function is called every time a dirty page is finished. The primary logic of this function is to determine the number of newly generated log files divided by checkpoint_segments
and whether the result is less than checkpoint_completion_target
. Note that the number of newly generated log files here is the number of newly generated logs after the start of checkpoint, not the number of new logs since the last checkpoint ended. If ischeckpointonschedule ()
returns True, the Checkpointer process sleep,sleep for a certain amount of time before reading the next shared buffers page for write. The effect of this is that when all page write finishes, the number of newly generated log pages is approximately checkpoint_completion_target of checkpoint_segements
The set value of the
. For example, if checkpoint_segements
is a, checkpoint_completion_target
is 0.9, when the new 16th log file is generated after the last checkpoint, the process that writes the log triggers a checkpoint. The Checkpoiter process calls Createcheckpoint
, and the checkpoint,checkpointer process calls Buffersync
, which scans the shared Buffers Write dirty pages. When you write a dirty page each time, sleep is performed if the number of newly generated log files is less than 16*0.9, which is 15 log files. Finally, when the write dirty page finishes, the newly generated log file starting from the last checkpoint is approximately 16+15=31, which is
checkpoint_segments + checkpoint_segments * checkpoint_completion_target
Thus, the checkpoint_completion_target
direct control of the checkpoint in the speed of the write dirty page, so that it completes the newly generated log file number for the above expectations.
In addition to the number of log files, the IsCheckpointOnSchedule()
percentage of time from checkpoint to present is also checked checkpoint_timeout
to determine if sleep is less than checkpoint_completion_target
. Press checkpoint_completion_target
for 0.9, checkpoint_timeout
for 300 seconds, dirty page write completion time is checkpoint start time, about 270 seconds. In fact, the constraints on this time and the number of log files that are generated are both working.
When the dirty page is all finished, it is necessary to do a real disk operation, that is, Fsync. At this time there is no sleep between the fsync of each file, which is done as soon as possible. Generally, the total time to do fsync is not more than 10 seconds, so the time interval arrives checkpoint_timeout
or the number of new log files checkpoint_segments
before arrival (both starting from checkpoint start point) ends this checkpoint.
Summing up, the time spent on each checkpoint can be calculated using the following formula:
min(产生checkpoint_segments*checkpoint_completion_target个日志文件的时间,checkpoint_timeout*checkpoint_completion_target)+ 做fsync的时间
For example above, it will be:
min (产生15个日志文件的时间,270秒)+ fsync的时间
This time is generally less than checkpoint_segments
the time it takes to produce a log or checkpoint_timeout
. The effect of this synthesis is to checkpoint_segments
make a checkpoint every time a log or experience is generated checkpoint_timeout
. Between the start time of the two checkpoint, the checkpoint_completion_target
dirty page write is done at a proportional point in time, followed by a quick completion of Fsync, as shown in:
Figure 2. Checkpoint Process
The above is the checkpoint scheduling mechanism. We should pay attention to adjust the above several parameters, do not let checkpoint produce too frequently, otherwise frequent fsync operation will be system instability. For example, checkpoint_segments
generally set to 16 or more, checkpoint_completion_target
set to 0.9, checkpoint_timeout
for 300 seconds, so that the general checkpoint interval can reach more than 1 minutes.
Reference:
http://mysql.taobao.org/monthly/2015/09/06/
Pgsql Characteristic analysis · Talk about the dispatch of checkpoint