PostgreSQL Replication's second chapter understands the transaction log of PostgreSQL (3)

Source: Internet
Author: User
Tags postgresql

2.3 Understanding consistency and data loss

It is not possible to mine PostgreSQL transaction logs without regard to consistency. In the first part of this chapter, we have largely explained the basic idea of the transaction log. You already know that the ability to make data in a good shape without prior log changes is difficult or even impossible.

So far, most of us have been discussing the problem of collapse. It is absolutely not good to lose files because of corrupted entries in the data file. However, crashing is not the only problem you need to be concerned about. The other two important topics are:

• Performance

• Data loss

While this may be an obvious choice for an important subject, we have the feeling that these two themes are poorly understood, are respected, and therefore need to be considered.

In our daily business for PostgreSQL consulting experts and trainers, we usually see those who focus only on performance.

Performance is everything, we want to be fast; Tell us how to is fast ...

Potential data loss, or even a sense of concept to deal with it. For many people, this looks like a new thing. We see this: what is the benefit if data is lost or even faster? The key to doing this is not that performance is not important, and performance is very important. However, we just want to point out that performance is not the only component in the global.

2.3.1-to-disk path

To understand the problem of data loss and consistency, let's see how a block of data is sent to disk. Explains how this works:

When PostgreSQL is going to read or write a block, it usually passes through several levels. When a block is written, it is sent to the operating system. The operating system caches the data and performs write operations on the data. In some cases, the operating system will decide to pass the data to some low levels. This could be a disk controller. The disk controller will cache, reorder, and possibly have a cache layer before the data eventually ends up in the real physical storage device.

In our case, we used four layers. In many enterprise-grade systems, there may even be more layers. Imagine a virtual machine that is stored over a network mount, for example, SAN,NAS,NFS,ATA-OVER_ETHERNET,ISCSI, and so on. Many layers of abstraction have data passing through, and each layer will try to do its own part of the optimization.

From memory to memory

What happens when PostgreSQL passes a 8K block to the operating system? The only correct answer to this question may be: "There is something." When a write to a file is executed, there is absolutely no guarantee that the data will actually be sent to the disk.

In reality, writing a file is nothing more than a copy operation from PostgreSQL memory to some system memory. These two memory areas are in memory, so things may be lost in the event of a crash. In practice, if the entire memory fails to work due to a failure, there is no difference in who loses the data.

The following code snippet illustrates the basic issues we are facing:

test=# \d t_test

Table "Public.t_test"

Column | Type | Modifiers


ID | Integer |

test=# BEGIN;


test=# INSERT into T_test VALUES (1);


test=# COMMIT;


As in the previous chapters, we use only one column of tables. The goal is to run a transaction that inserts a row.

If a crash occurs shortly after submission, no data is at risk because nothing happens. If you crash after an INSERT statement, but before commit, nothing happens. The user has not issued a commit, so the transaction is well known to run, but not completed. If a crash occurs, the application will notice that things are not successful and (hopefully) react accordingly.

However, the situation is completely different, and if the user has issued a commit statement, it has successfully returned. No matter what happens, the user will expect the submitted data to be available.

[The user expects a successful write to be available after an unexpected restart.] The so-called acid standard also requires durability. In Computer science, ACID (atomicity, consistency, isolation, durability) is a set of properties that ensure that database transactions are handled reliably. ]

From memory to disk

To ensure that the kernel passes data from memory to disk, PostgreSQL must take some precautions. In COMMIT, the system call is initiated and it forces the data into the transaction log.

[At this point, PostgreSQL does not have to force data to the data file, because we can always repair bad data files from Xlog. If the data is stored securely in Xlog, the transaction can be considered secure. ]

The necessary mandatory data-to-disk system call is Fsync (). Some lists are copied from the BSD Handbook page. In our opinion, this is one of the best hand albums ever written to deal with this topic:

FSYNC (2) BSD System Calls Manual FSYNC (2)


Fsync--Synchronize a file ' s In-core State with

That on disk


#include <unistd.h>

int Fsync (intfildes);


Fsync () causes all modified the data and attributes of Fildes to being moved to a permanent storage device. This normally results in all In-core modified

Copies of buffers for the associated file to is written to a disk.

Note that while Fsync () would flush all data from the host to the drive (i.e. the "Permanent storage Device") and the drive it Self could not physically

Write the data to the platters for quite some time and it is written in an out-of-order sequence. Specifically, if the drive loses power or the OS

Crashes, the application may find is only some or none of the their data is written. The disk drive is also re-order the data so, later writes

May is present, while earlier writes is not. A theoretical edge case. This scenario was easily reproduced with real world workloads

and drive power failures.

Essentially, the kernel tries to make the image of its in-memory file consistent with the file image on disk. It is implemented through all changes to the storage device. It is also clear that we are not talking about a theoretical scenario here, and that brushing back to disk is a very important topic.

Without a disk refresh in commit, you simply cannot secure your data, which means that in a very cumbersome situation, you will actually lose data.

And, in essence, speed and consistency, they are actually relative work. Brushing back changes to disk is especially expensive because it involves real hardware. Our expenses are not 5%, but a lot more. With the introduction of SSDs, the cost has fallen sharply, but it is still considerable.

A word about the battery

Most production servers will use RAID controllers to manage disks. The important point here is that disk refreshes and performance are usually more closely related to RAID controllers. If the RAID controller does not have a battery, this is usually the case, then it takes a long time to refresh frantically. The RAID controller must wait for the slowest disk to return. However, if the battery is available, the RAID controller can assume a power loss that does not prevent a recognized disk write completion once the power is restored. This allows the controller to cache write and simply disguise the flush. As a result, a simple battery can easily improve performance by up to 10 times times.

[Keep in mind that what we have described in this section is a general question.] However, each hardware is different. We highly recommend that you check and understand your hardware and RAID configuration to see how the refresh is handled. ]

Beyond Fsync ()

Fsync () is not the only system call to flush data to disk. Depending on the operating system you are using, different refresh calls are available. In PostgreSQL, you can change the Wal_sync_method to determine your preferred refresh system. Again, this change can be done by adjusting the postgresql.conf.

The available methods are Open_datasync, Fdatasync, Fsync, Fsync_writethrough, and Open_sync.

[If you want to change these values, we strongly recommend that you review the man page of the operating system you are using to make sure that you have made the right choice.] ]

Consistency level for 2.3.2 PostgreSQL

Ensuring consistency and preventing data loss is expensive, and each disk refresh is expensive, and we should think twice before refreshing to disk. PostgreSQL provides multiple levels of data protection for the user to choose from. These different choices are represented by two important parameters, which can be found in postgresql.conf:

1. Fsync

2. Synchronous_commit
If Fsync is used, the Fsync parameter will control data loss. In the default configuration, PostgreSQL will always flush commits to disk. If the Fsync is turned off, however, there is no guarantee that the commit will survive the crash. Data is lost, and even data corruption may occur. To protect your data, it is necessary to keep the fsync in an open state. If you can afford to lose part or all of your data, you can relax the refresh criteria.

Synchronous_commit and xlog-writing related. Typically, PostgreSQL waits until the data has been completely written to Xlog. In particular, short transactions can withstand quite a lot, thus providing a variety of options:

On:postgresql will wait until XLOG has been fully successful in writing. If you want to store credit card data, you want to make sure that the Department of Financial Affairs is lost. In this case, flushing to disk is essential.

off: There is a time difference between reporting a successful and secure write to the client. In such cases, there can be damage. Let's assume a database that is stored on a Web site who is currently online. Assuming your system crashes, it's back in 20 minutes. Do you really care about your data? After 20 minutes, it is possible for everyone to log back and forth again. This is not worth sacrificing performance to protect data that is obsolete after a few minutes.

Local: In the case of replicating a DB instance, we will only wait for the local instance to flush to disk. The advantage here is that you have a high level of protection as you refresh to a disk; however, we can safely assume that there will not be two servers crashing at the same time, so we can relax a little bit of slave on the standard.

Remote_write:postgresql will wait until the synchronous standby server reports success for the given transaction.

Setting Fsync to off in the protocol, changing synchronous_commit to OFF will not cause a crash. However, in the event of a crash, we may lose some of the transactions that have been successfully submitted. The number of potential data loss is managed by an additional postgresql.conf called Wal_writer_delay. In the case of setting Synchronous_commit to OFF, we will never lose more than what is defined in the Wal_writer_delayconfig variable.

[Changing synchronous_commit may look like a small performance adjustment; However, in reality, changing the sync behavior is one of the dominant factors when running minor write transactions, and the gain may not be just a few percentage points, but if you're lucky, it could be 10 times times or even more ( Depends on hardware, workload, I/O subsystem, etc.). ]

Keep in mind that the configuration database is not just speed. Consistency is at least as important as speed, so you should carefully consider whether you want to exchange speed with potential data loss.

Therefore, it is important to fully understand these consistency-related topics as described in this chapter. When it comes to your cluster architecture, data security will be an important part of being able to determine if a particular architecture is meaningful to your data. After all, the job of a database is to protect data. The awareness of your durability requirements is definitely a big benefit.

Original address:

PostgreSQL Replication's second chapter understands the transaction log of PostgreSQL (3)

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.