PostgreSQL Replication's second chapter understands the transaction log of PostgreSQL (1)

Source: Internet
Author: User
Tags postgresql

In the previous chapters, we have understood the various replication concepts. This is not just a theoretical overview of the things that will enhance your consciousness for the next thing to be introduced, but will also introduce you to a broad topic.

In this chapter, we will be closer to the actual solution and understand how PostgreSQL works internally and what replication means. We'll see what the so-called transaction log (XLOG) does, and how it works. Xlog plays a major role in the PostgreSQL replication mechanism. It is necessary to understand how this part works.

2.1 PostgreSQL How to write data

The replication of PostgreSQL is all about writing data. Therefore, PostgreSQL internally writes a data block that is directly related to replication, replication concepts, and highly correlated. In this chapter, we will delve into the write operation. In this chapter you will learn the following things:

PostgreSQL how to write data

• What memory and storage parameters are involved

• How writes are optimized

• How writes are replicated

• How to ensure data consistency

Once you have finished reading this chapter, you are ready to understand the next chapter, which will teach you how to safely replicate your first database.

Disk layout for 2.1.1 PostgreSQL

The first problem we need to understand in this chapter is the disk layout of PostgreSQL. Understanding the disk layout is very helpful for checking an existing installation, and it is also helpful to design an efficient, high-performance installation.

Compared to other database systems, such as Oracle,postgresql relies on file systems to store data. PostgreSQL does not use the original device. The philosophy behind this is that if a file system developer has done his or her job well, there is no need to re-implement the FileSystem functionality over and over again.

View Data Catalog

To understand the filesystem layer used by PostgreSQL, we can look at what we can see in the Data Catalog (created by INITDB at installation):

[[Email protected]]$ ls-l

Total 92

-RW-------1 HS Staff 4 Feb 18:14 pg_version

DRWX------6 HS Staff 4096 Feb 18:14 base

DRWX------2 HS staff 4096 Feb 18:14 Global

DRWX------2 HS staff 4096 Feb 18:14 Pg_clog

-RW-------1 HS staff 4458 Feb 18:14 pg_hba.conf

-RW-------1 HS staff 1636 Feb 18:14 pg_ident.conf

DRWX------4 HS staff 4096 Feb 18:14 pg_multixact

DRWX------2 HS staff 4096 Feb 18:14 pg_notify

DRWX------2 HS staff 4096 Feb 18:14 pg_serial

DRWX------2 HS staff 4096 Feb 18:14 pg_snapshots

DRWX------2 HS staff 4096 Feb 18:19 pg_stat_tmp

DRWX------2 HS staff 4096 Feb 18:14 Pg_subtrans

DRWX------2 HS staff 4096 Feb 18:14 PG_TBLSPC

DRWX------2 HS staff 4096 Feb 18:14 pg_twophase

DRWX------3 HS staff 4096 Feb 18:14 Pg_xlog

-RW-------1 HS Staff 19630 Feb 18:14 postgresql.conf

-RW-------1 HS Staff 18:14 postmaster.opts

-RW-------1 HS Staff 18:14

You will see a series of files and directories, all of which are required to run a DB instance. Let's take a look at the details.

Version number of the Pg_version-postgresql

At startup, the file tells the file system whether the data directory contains the correct version number. Please note: Only the major version number is in this file. It is possible and easy to replicate between two different minor versions under the same major version.

[Email protected]]$ cat Pg_version


The file is readable plain text.

base-Actual Data Directory

In the data catalog, the base directory is one of the most important directories. He actually contains real data (system tables, indexes, etc.). Inside the base directory, each database has its own subdirectory:

[Email protected] base]$ ls-l

Total 24

DRWX------2 HS Staff 12288 Feb 11 18:14 1

DRWX------2 HS staff 4096 Feb 11 18:14 12865

DRWX------2 HS staff 4096 Feb 11 18:14 12870

DRWX------2 HS staff 4096 Feb 11 18:14 16384

We can easily connect these directories to the database within my system, and it is worth noting that PostgreSQL uses the object ID of the database. This has many advantages over using names, because object IDs never change and provide a good way to abstract various issues, such as making

Problems with different character sets, and so on.

test=# SELECT OID, datname from Pg_database;

OID |datname


1 | Template1

12865 | Template0

12870 | Postgres

16384 | Test

(4 rows)

Now, we can see how the data is stored within these special database directories. In PostgreSQL, each table relates to (at least) one data file. Let me create a table to see what happens:
test=# CREATE TABLE t_test (IDINT4);


We can check the system tables to get the so-called Relfilenode, which represents the storage file name on disk:

test=# SELECT Relfilenode, Relname

From Pg_class

WHERE relname = ' t_test ';

Relfilenode | Relname


16385 | T_test

(1 row)

Once the table is created, PostgreSQL creates an empty file on the disk:

[Email protected] base]$ ls-l 16384/16385*

-RW-------1 HS Staff 0Feb 12 12:06 16384/16385

A growing number of data files

Tables can sometimes be quite large, so it is more or less impossible to put all the relevant data for a table into a single file. To solve this problem, each additional 1GB of data PostgreSQL adds a file.

Therefore, if a file called 16385 is larger than 1GB, there will be a file called 16385.1. Once the file is filled, you will see a file named 16385.2, and so on. In this way, in some rare operating systems or embedded devices, the tables in PostgreSQL can be scaled securely and reliably without worrying about low-level file system limitations.

Performing I/O in blocks

To improve I/O performance, PostgreSQL always performs I/O in blocks of 8k size. Therefore, you will see that your data files are always growing in 8k increments. When it comes to physical replication, you must make sure that both (master and slave) are compiled with the same block size.

[Unless you explicitly compile PostgreSQL according to the different block sizes you use, you can always accept the fact that the data blocks are consistent and accurate 8k in size. ]

Relationship Forks

In addition to the data files discussed in the preceding paragraphs, PostgreSQL will implement the same file number to create additional files. As of now, these files are used to store table (FREESPACEMAP) Internal free space information, so-called Visibilitymap and so on. Later, there will be more types of relationships forks may be added.

global-Global Data

Global contains the world system tables. This directory is small and you should not expect excessive storage consumption.

Working with stand-alone data files

Users often forget one thing: a single PostgreSQL data file has little value. If you have only one data file, it is not possible to reliably store the data; it is not a good guess to easily try to extract data from a single file. Therefore, in order to read the data, you need a complete instance.

pg_clog-Submit Log

The commit log is an important part of a working database instance. It stores the state of transactions made on the system. There are four states of a transaction (transaction_status_in_progress,transaction_status_committed,

transaction_status_aborted, andtransaction_status_sub_committed), if the state of a thing's commit log is not available, PostgreSQL will not know if it should be seen.

If the commit log for a DB instance is compromised for some reason (perhaps because of file system corruption), you can make some interesting time ahead.

[If the commit log is corrupted, we recommend that you take a snapshot of the database instance (file system) and forge the commit log, which sometimes helps to get a reasonable amount of data from the problematic DB instance. ]

pg_hba.conf-Host-based network configuration

The pg_hba.conf file configures the internal firewall of PostgreSQL, which represents one of the two most important configuration files in a PostgreSQL cluster. It allows the user to define any type of authentication based on the source of the request. A database administrator, understanding the pg_hba.conf file is critical because this file determines whether a slave is allowed to connect to master. If you happen to miss something here. You may see an error message in the slave log (for example: no pg_hba.conf license ...)

pg_ident.conf-Identity Verification

The pg_ident.conf file can be used in conjunction with the pg_hba.conf file to configure identity authentication.

pg_multixact-Multi-transaction state data

The multi-object log manager can effectively handle shared row locks. There is no real impact in this directory and on replication-related things.

pg_notify-Monitoring/Notification data

In this directory, the system stores information about the listener/notification (asynchronous back-end interface). and replication has no real impact.

pg_serial-information about committing a serialized transaction

The information for the serialized transaction is stored here. We must store serialized transaction submission information on disk to ensure that long-running transactions do not inflate memory. Internally, a simple SLRU structure is used to track these transactions.

pg_snapshot-Output Snapshot

This is a file consisting of the information required by the PostgreSQL snapshot management.

In some cases, the snapshot must be output to disk to avoid entering memory. After the crash, those snapshots will be automatically cleaned up.

pg_stat_tmp-Provisional statistical data

Temporary statistics are stored in this file. This information is required by most pg_stat_* system views (and, therefore, the raw data is also provided for low-level functions).

pg_subtrans-Child Transaction Data

In this directory, we store information about the child transaction. Pg_subtrans (and Pg_clog) directories permanently (on disk) store transaction-related information. There are a limited number of memory leaves stored in memory, so in many cases, it is not necessary to actually read from disk. However, if you have a long-running transaction or an open transaction that is idle on the backend, it may be necessary to read and write this information from disk. They also allow for permanent storage of information through server restarts.

pg_tblspc-Symbolic links to table spaces

PG_TBLSPC is a very important directory. In PostgreSQL, a table space is an alternative storage location, represented by a directory that holds the data.

The most important thing here: If a database instance is fully replicated, we cannot simply rely on the fact that all servers in the cluster use the same disk layout and the same storage hardware. Here is a scenario where a master needs more I/O capability than a slave, and this slave may just be around to act as a backup or standby. To allow users to handle different disk layouts, PostgreSQL links the symbols to the PG_TBLSPC associated directory. The database will blindly follow those symbolic links to find these tablespaces, no matter where they are.

This provides great strength and flexibility for the end user. In general, controlling storage is necessary for replication and performance. Keep in mind that these symbolic links can only be changed afterwards. should be carefully considered.

(We recommend using the strategy described in this section only when he really needs it.) For most settings, the same file system layout is definitely recommended on master and slave. This can greatly reduce complexity. )

pg_twophase– information about pre-processing statements

PostgreSQL must store information about the two-phase commit. Although two-phase commit is an important feature, the directory itself is less important to a normal system administrator.

Pg_xlog–postgresql transaction log (WAL)

The PostgreSQL events log is the basic directory we have to discuss in this chapter. Pg_xlog contains all the files of the so-called Xlog. If you have used PostgreSQL in the past time, you may be familiar with the term Wal (Writeahead Log). Xlog and Wal are the two names of a thing. The same applies to transaction logs. These three terms are widely used, it is important to know that they are the same meaning.

The Pg_xlog directory is generally like this:

[[Email protected]_xlog]$ ls-l

Total 81924

-RW-------1 HS Staff 16777216 Feb 12 16:29


-RW-------1 HS Staff 16777216 Feb 12 16:29


-RW-------1 HS Staff 16777216 Feb 12 16:29


-RW-------1 HS Staff 16777216 Feb 12 16:29


-RW-------1 HS Staff 16777216 Feb 12 16:29


DRWX------2 HS staff 4096 Feb 18:14 Archive_status

You see a bunch of files that are always 16MB in size (the default setting) the file name of a xlog. x is usually 24 bytes long. The numbering is always hexadecimal. Therefore, the system accounting "9,a,b,c,d,e,f,10" and so on.

One important thing is that the size of the Pg_xlog directory doesn't go crazy over time, it's completely independent of the type of transaction you're running on your system. The size of the xlog is determined by the parameters in the postgresql.conf that will be discussed later in this chapter, in short: Whether you are running small transactions or large transactions, the size of Xlog is the same. You can easily run a transaction as large as 1TB, with only a very small number of xlog files. This may not be too effective, sensible performance, but it is technically completely feasible.

Postgresql.conf–postgresql the core file of the configuration file

Finally, there is a primary PostgreSQL configuration file. All configuration parameters can be changed in postgresql.conf, we will use this file to set up a wide range of replication and tuning of our DB instance to ensure that our replication settings provide us with superior performance.

[If you happen to use precompiled binaries, you may not be able to find postgresql.conf in your data directory.] It may be located in some subdirectories under/etc/(under Linux/unix) or where you choose under Windows. The exact location is highly dependent on the operating system you are using. The typical location of the data catalog is/var/lib/pgsql/data. But postgresql.conf are often under/etc/postgresql/9.x/main/postgresql.conf (as in Ubuntu and similar systems) or directly under the/etc/directory. ]

2.1.2 writing a row of data

Now that we know the disk layout, we'll go further into PostgreSQL and explore what happens when PostgreSQL writes a row of data. Once you have mastered this chapter, you will fully understand the concepts behind Xlog.

It is important to note that in this chapter on writing a row of data, we have simplified a number of processes to ensure that we can emphasize the focus and the ideas behind PostgreSQL Xlog.

a simple INSERT Statement

Let's assume we're doing a simple insert statement like this:

INSERT into foo VALUES (' ABCD '):

As one might imagine, the goal of an insert operation is to add a row of data to an existing table. In the previous section on PostgreSQL disk layout we have seen that each table will be associated with a file on disk.

Let's do a psychological experiment, assuming we're dealing with a table size of 10TB here. PostgreSQL will see the insert operation and look for some spare space within the table (using existing blocks or adding a new block). For the purposes of this example, we simply insert the data into the second block of the table.

As long as the server actually saves the transaction, nothing is wrong with it. What happens if someone pulls the plug after writing ABC instead of the entire data? When the server restarts, we will find ourselves in a situation where we have an incomplete block of records, and more interestingly, we may not even have the slightest idea where the block contains the bad records.

In general, a table that contains rows with incomplete unknown locations is considered a bad table. Of course, the system table corruption is not something, the PostgreSQL community will tolerate, in particular, not due to obvious design mistakes caused by similar problems.

in the wal-writing Period crashes

[We must ensure that PostgreSQL survives any interruption at any given point in time and that no data is lost or corrupted.] It's good to protect your data, and it's absolutely necessary. ]

To solve the problem we have just discussed, PostgreSQL uses the so-called Wal (Write Ahead Log) or simply xlog. Using Wal means writing logs before writing the data. So before we actually write the data to the table, we'll sort the log entries in order, which shows what we're going to do with our table. Shows the things that are working if:

As we can see, once we write the data to the log (1), we can mark the transaction as complete (2). The data is then written to the table (3).

[We have removed the memory portion of the equation-this will be discussed later in this section.] ]

Let's use two examples to illustrate the advantages of this approach:

To ensure that the concepts described in this chapter are rock-solid, we have to make sure that we can crash at any point in time without worrying about our data. Let's assume that we crash when writing Xlog. What will happen in this situation? In this case, the end user will know that the transaction is not successful, then he or she will not depend on the success of the transaction.

As soon as PostgreSQL starts, it can pass Xlog and replay everything to ensure that PostgreSQL is in a consistent state. So, if we don't implement it through wal-writing, the nasty thing has happened and we can't expect a write to be successful.

The Wal entry will always know if it's finished. Each Wal entry has an internal checksum, so once someone tries to replay the bad wal,postgresql can be detected immediately. This is especially important when a crash occurs when we may not be able to rely on the latest data written to disk. Wal will automatically resolve these issues in crash recovery.

[If PostgreSQL is configured correctly, crashes are fairly secure at any point in time.] ]

in the wal-writing after the crash

Now let's assume that we have successfully passed the wal-writing, and then we immediately collapsed, perhaps writing the underlying table. What if we only managed to write AB and not the entire data?

Well, in this case, we'll know what's missing during replay. Again, we go to replay the logs to make sure that all the data in our datasheets is safe to need.

Although it can be difficult to find data in a table after a crash, we always rely on the fact that we can find data in the Wal. Wal is continuous, and if we just keep track of how far the data is written, we can always proceed from there; Xlog will direct us to the data in the table, and it always knows where there is a change or where to make a change. PostgreSQL does not need to search the Wal for data; it simply replays it from the right angle.

[Once a transaction has finished writing the Wal, it will no longer be lost easily.] ]

2.1.3 Read Consistency

Now that we've seen how a simple write is done, we have to see what the impact of writing on reading is. The following image shows the basic architecture of the PostgreSQL database system:

For simplicity, we can think of a database instance as something that consists of three main parts:

1. PostgreSQL Data File

2. Transaction log

3. Shared Cache

In the previous chapters, we have discussed the data files. You also see some basic information about the transaction log itself. Now we're going to expand our model and add another part to this architecture diagram: The memory portion of the game, the so-called shared buffer.

Purpose of shared buffers

The shared buffer is the I/O cache for PostgreSQL. It helps to cache blocks of 8K, which are read from the operating system and helps keep writing back to disk to optimize efficiency (later in this chapter we'll discuss how this works).

[Shared buffers are heavy because they affect performance.] ]

However, when it comes to shared caching, performance is not the only problem we should be concerned about. Let's say we want to send a query. For the sake of simplicity, we also assume that we only need a block to handle this read request. What if we were to make a simple query? Maybe we're looking for something simple, like a phone number, or a username given a specific key. The following list shows what the PostgreSQL will do in a number of simplified ways, in the hypothetical case that the instance has just restarted:

1. PostgreSQL will find the required blocks in the cache (as previously described, this is a shared buffer). In the cache of a newly launched instance, it will not find a block of data.

2. PostgreSQL will request the data block from the operating system.

3. Once the data has been loaded from the operating system, PostgreSQL will put it in the first cache queue.

4. The query was successfully delivered.

Let's assume that an identical block will be used again by the second query. In this case, things will work as follows:

PostgreSQL will find the required block of data and cache hits.

PostgreSQL will find that a cache block has been reused and moved from a low level cache (Q1) to a higher level cache (Q2). The blocks in the second queue will remain in the cache longer because they prove to be more important than the blocks that are only at the Q1 level.

[How large should a shared buffer be?] Under Linux, 8GB is usually recommended. Under Windows, values below 1GB are proven to be useful (such as PostgreSQL9.2). Starting with PostgreSQL9.3, a higher value may be useful and feasible under Windows. Under Linux, the crazy big shared buffers are actually inverse-optimized. Of course, this is only a rule of thumb, and special installations may require different settings. ]

Mixed Read and write

Keep in mind that in this section, all is about understanding writing to ensure that our ultimate goal, comprehensive and in-depth understanding of replication, can be achieved. So we have to look at how reading and writing come together. Let's see how reading and writing come together:

1. A write request arrives.

2. PostgreSQL writes to the transaction log to ensure consistency can be achieved.

3. PostgreSQL acquires a block inside the PostgreSQL shared buffer and makes changes in memory.

4. A read request arrives.

5. PostgreSQL will query the cache and find the required data.

6. A cache hit will land and the query will be delivered.

What is the point of this example? Well, as you may have noticed, we never discuss the actual table below the write. We discussed the write to the cache, to the Xlog, and so on, but there is no write about the real data file.

[In this example, the rows we write in the table are or are not in the table, which is completely irrelevant.] The reason is simple: if we need a block that has just been modified, we will never get it from below the table. ]

It is important to understand that data is not normally written directly to the data file during or after a write operation. It is meaningful to improve efficiency by writing out the data later. The reason for this is that it has a subtle effect on replication. The data file itself is worthless because it is not necessarily complete and not necessarily correct. To run a PostgreSQL instance, you will always need the data files to accompany the transaction log. Otherwise, there is no way to survive the crash.

From a consistency perspective, a shared buffer is a view that completes the user's data here. If there is something not in the table, logically, it should be in memory.

In the event of a crash, memory will be lost, so Xlog is consulted and replayed to convert the data file to a consistent data store again. In any case, the data files are only half the story.

[In PostgreSQL9.2 and in previous versions, the shared buffer was only sysv/posix shared memory or in the Windows emulation sysv. PostgreSQL9.3 (not published at the time of writing) started using memory-mapped files, which was faster under Windows, with no difference in performance compared to under Linux, but slightly slower under BSDs. BSD developers have begun to solve the problem. Using Mmap can make the configuration easier because the mmap is not limited by the operating system, and there is no limit as long as there is enough memory. Sysvshmem are limited, and a large number of SYSVSHMEM are usually only assigned by the operating system to make the appropriate adjustments. From Linux distributions to Linux distributions, shared memory variables are configured by default. SuSE tends to be more of a substitute, while Redhat,ubantu and others tend to be more conservative. ]

Original address:

PostgreSQL Replication's second chapter understands the transaction log of PostgreSQL (1)

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.