PostgreSQL provides developers with a rich set of tools to manage concurrent access to data. Internally, data consistency is maintained by using a multi-version model (multi-version concurrency control, MVCC). This means that each SQL statement sees only a short period of time before the data snapshot (a database version), regardless of the current state of the underlying data. This protects the statement from seeing inconsistent data that could be caused by other concurrent transactions that perform updates on the same data rows, providing transactional isolation for each database session. MVCC avoids the traditional locking method of database system, and minimizes the lock contention to allow reasonable performance in multi-user environment.
The main advantage of using the MVCC concurrency control model instead of locking is that in MVCC, lock requests to query (read) data do not conflict with lock requests to write data, so reads do not block writes, and write never blocks read. PostgreSQL also maintains this assurance even when the most stringent transaction isolation level is provided by using the revolutionary serializable snapshot Isolation (SSI) level.
Today we start with the basics of PostgreSQL and talk about the features of MVCC and PostgreSQL.
Here, we first use a table to show the level of transaction in PostgreSQL, and we can turn it back when we see the later chapters.
1. Basic Knowledge 1.1 Transaction ID
When a transaction begins, the transaction management system in PostgreSQL assigns a unique identifier, the transaction ID (TXID), to the transaction. The TXID in PostgreSQL is defined as a 32-bit unsigned integer, which means it can record about 4.2 billion transactions. Usually TXID is transparent to us, but we can use the functions inside PostgreSQL to get the txid of the current transaction:
postgres=# BEGIN;BEGINpostgres=# SELECT txid_current(); txid_current -------------- 233(1 row)
In all Txid, there are several values that have special meanings:
Txid = 0 means Invalid Txid, usually used as the validity of the judgment Txid;
Txid = 1 means Bootstrap Txid, in the present case, only when initializing the database at intidb time
Txid = 2 means Frozen Txid, which is generally used in vacuum (mentioned later).
Remember this assumption: TXID Small firms modify tuples that are visible to txid large transactions, and vice versa.
1.2 tuples (tuple) structure
About the structure of tuples can say a lot of things, I also talked about some in this article. But we don't talk that deep here. About tuples (tuple) can be divided into ordinary tuples and toast tuples in principle. For the sake of simplification, we'll just talk about simple tuples.
The structure diagram of a simple tuple is as follows:
What we need to understand about the transaction-related structure is the heaptupleheaderdata structure, which is the "head" part of a tuple.
There are several fields that need us to understand below:
T_xmin is the txid of the transaction that inserts this tuple.
T_xmax is the txid of the transaction that updates or deletes this tuple. If this tuple is not deleted or updated, then the T_xmax field is set to 0, that is, the field invalid
The ID of the command that inserts this tuple is saved in T_cid. There may be multiple commands in a transaction, and the commands in the transaction are numbered sequentially (incrementing from 0). For the following transactions: Begin;insert; INSERT END. The first insert has a t_cid of 0 and the second insert has a t_cid of 1.
The identifier (that is, TID) for the tuple is saved in T_ctid. It points to the tuple itself or to the new "version" of the tuple. Because PostgreSQL's modification of records does not directly modify the user data in the tuple, it regenerates a tuple, and the old tuple points to the new tuple through T_ctid. If a record is modified more than once, there will be more than one version of the record. Each version is concatenated through T_ctid to form a version chain. With this version chain, we can find the latest version. The actual T_ctid is a two-tuple (x, y). where x (numbering starting from 0) represents the Page,y (starting from 1) of the tuple, which is represented in the first position of the page.
Insert/delete/update Operations for 1.3 tuples (tuple)
As mentioned above, the modification of records in PostgreSQL is not a direct modification of the tuple structure, but rather a re-generation of a tuple, with the old tuple pointing to the new tuple through T_ctid. So here we explain in turn how insert/delete/update is implemented.
Let's start with a schematic, the page header,line pointer these here talk about irrelevant structures I don't draw. We focus on the position relationship of tuple in page.
1.3.1 Insert Operation
Insert operation is the simplest, directly put a tuple into the corresponding page inside can. As follows:
Here we assume
So in the tuple_1 we can see:
T_xmin is set to 99, which is the txid,99 of the transaction.
T_xmax is set to 0 because the tuple has not been updated or deleted
T_ctid is set to 0, it is obvious that it was inserted by the first command of the transaction
T_ctid is set to (0,1), which is the first tuple of the table, so it is placed in the 1th position of the No. 0 page.
How can I verify it from outside the database? PostgreSQL provides a plugin: Pageinspect
We can look at the information on the database page as follows:
postgres=# CREATE EXTENSION pageinspect;CREATE EXTENSIONpostgres=# CREATE TABLE tbl (data text);CREATE TABLEpostgres=# INSERT INTO tbl VALUES('A');INSERT 0 1postgres=# SELECT lp as tuple, t_xmin, t_xmax, t_field3 as t_cid, t_ctid FROM heap_page_items(get_raw_page('tbl', 0)); tuple | t_xmin | t_xmax | t_cid | t_ctid -------+--------+--------+-------+-------- 1 | 99 | 0 | 0 | (0,1)(1 row)
1.3.2 Delete operation
As mentioned above, the delete operation in PostgreSQL only logically "deleted" the tuple, in fact, the tuple still exists in the database storage page, but the tuple has been some processing, making it become "invisible" in the query.
Let's say we delete the previously inserted tuple in a transaction with TXID 111, and the result is as follows:
As we can see, the T_xmax field of the Tuple_1 tuple has been modified and changed to 111, which is the txid of the transaction that deleted the Tuple_1.
When a transaction with TXID of 111 is committed, Tuple_1 becomes an invalid tuple, becoming a "dead tuple".
As you can see from the diagram, this tuple_1 remains on the database page. As the database runs, it can be predicted that this "dead tuple" will be more and more. They will be cleaned out when the Vacuum command is run, and the vacuum is not the focus of our discussion, omitted here.
1.3.3 Update operation
Update looks a little bit more complicated than the previous two operations. As we have said above, PostgreSQL's modification of records does not directly modify the user data in the tuple. PostgreSQL's handling of update is to delete old data before adding new data . The "translate" is the old tuple that will be updated is marked as delete, and then a new tuple is inserted.
Likewise, we assume that a record is inserted in a transaction with a TXID of 99 and that it is updated 2 times in a transaction with TXID 100. The process is as follows:
According to the principle of "delete old data first, add new data", when executing the first update command in a transaction with TXID 100, the operation is divided into two steps:
Delete old tuple tuple_1: sets the T_xmax value of the tuple_1 to be the Txid of the current transaction;
New tuple tuple_2: Set tuple_2 t_xmin,t_xmax,t_cid,t_ctid, set tuple_1 t_ctid point to Tuple_2
That
For Tuple_1:
t_xmax = 100;t_ctid : (0,1) => (0,2)
For tuple_2:
t_xmin = 100;t_xmax = 0;t_cid = 0;t_ctid = (0,2)
Similarly, the second Update command in a transaction that executes TXID 100 is also true:
For tuple_2:
t_xmax = 100;t_ctid : (0,2) => (0,3)
For Tuple_3:
t_xmin = 100;t_xmax = 0;t_cid = 1;t_ctid = (0,3)
When a transaction with TXID 100 is committed, Tuple_1 and tuple_2 become "dead tuple", whereas if the transaction is abort, tuple_2 and tuple_3 become "dead tuple".
Having said this, you should have a rough understanding of the implementation of these operations within the database. At the same time, you may be very curious about this ctid. This is a marker that marks the location of a tuple. So the question is, how is the tuple inserted position selected? Are they all stored in order? If that's the case, I think it would be nice to just record the migration of a tuple directly inside the page. The problem is that this location is selected as "technical", and PostgreSQL has a mechanism called FSM, which is the free Space Map. By discovering the free space on each page of a table, it is better to put it there. In particular, it is interesting to explore through the other plugin pg_freespacemap of PostgreSQL.
postgres=# CREATE EXTENSION pg_freespacemap;CREATE EXTENSIONpostgres=# SELECT *, round(100 * avail/8192 ,2) as "freespace ratio" FROM pg_freespace('accounts'); blkno | avail | freespace ratio -------+-------+----------------- 0 | 7904 | 96.00 1 | 7520 | 91.00 2 | 7136 | 87.00 3 | 7136 | 87.00 4 | 7136 | 87.00 5 | 7136 | 87.00....
With these basics, we can continue to discuss some of the more in-depth content.
2.MVCC Basic 2.1 Transaction Commit log (commit log)
The logs used by PostgreSQL are primarily xlog and clog, which are transaction logs and transaction commit logs. Xlog is a general log record, known as a log record, that records the process of transaction updates to the data and the final state of the transaction. Clog in the general database textbook is not mentioned, in fact, clog is an auxiliary form of Xlog, records the final state of the transaction. Because each Xlogh record is relatively large, if you need to determine the state of a transaction through the log, then using clog is much more efficient than using Xlog. While clog occupies a very limited amount of space, it is stored in shared memory and can be read quickly.
Let's take a look at the working principle of clog and its maintenance process.
First, in clog, PostgreSQL defines four transaction states:
IN_PROGRESS,COMMITTED,ABORTED,SUB_COMMITTED.
In addition to the last state, the other states are "people as their name", not much to say. And the sub_committed is for the sub-transaction, here is not much to discuss first.
Because it is stored in shared memory, the clog is stored as a 8KB-sized page. The internal transaction state of the clog page is stored in a similar "array" format. Where the subscript of "array" is the ID of the transaction, the value in "array" is the current state of the transaction corresponding to the subscript. Since clog only records the 4 states of a transaction, it can be represented by just 2 bits, that is, one byte can represent 4 transaction states. In a clog page (8KB), you can represent 8k*8b/2b=32k=2^15 clog records.
The following example:
At the T1 moment: Txid is committed for a 200 transaction, then the state of the transaction in clog Txid 200 is changed from In_progress to committed.
At T2 time: TXID is 201 for the transaction abort, then the state of the transaction in Clog Txid 201 is changed from In_progress to aborted.
The whole clog has been traced back to the state of all transactions in the transaction system. If there is a new transaction, add it directly after the array. When the current page (8KB size) is already full, add a new page to record.
When the PostgreSQL database is shutdown or checkpoint, the data in the clog is written back to the Pg_clog directory. Open the Pg_clog directory and we'll find some files named 0000,0001. These are the clog files. The maximum size of these files is 256KB. In other words, a file can store up to 256kb/8kb=32 of clog. If there are currently 8 clog, then 0000 files can be saved. And if there are currently 35 clog, then you need to 0000,0001 two files to store, and so on.
When the PostgreSQL database is started, the clog record under the Pg_clog folder is read in shared memory.
2.2 Transaction Snapshot (Transaction Snapshot)
Finally, I want to talk about the transaction snapshot.
Transactional snapshots I think it's a very figurative word, it's easy to literally understand its meaning. The so-called "snapshot", is holding the camera "click" A record of the current moment of information, after the shutter has been pressed after the changes we all can not detect, that is invisible.
Similarly, a transaction snapshot is when a transaction is executed, which transaction at a certain point in time is active, so-called active, which means that the transaction is either executing (in progress) or not starting.
PostgreSQL has built-in functions to get the current snapshot of the data: Txid_current_snapshot () function
This function allows you to get the current snapshot, but the snapshot information is a bit of a headache to interpret.
postgres=# SELECT txid_current_snapshot(); txid_current_snapshot ----------------------- 100:104:100,102(1 row)
The above data snapshot is a sequence of numbers consisting of: xmin:xmax:xip_list
Let me show you the meaning of each field.
The Txid of the oldest active transaction, which is the active transaction with the smallest TXID value, and all transactions txid less than that value, if 1. The status is commited visible,2. State aborted is dead,
The first unassigned Txid, all transactions with Txid greater than this value have not yet started at snapshot generation, i.e. not visible
The txid of all active transactions when the snapshot was generated.
In contrast to the above explanations, let's take a two representative example:
Txid less than 100 of transactions are not active;
TXID transactions that are greater than or equal to 100 are active.
Txid less than 100 of transactions are not active;
TXID transactions that are greater than or equal to 104 are active;
Txid of 100 and 102 are active, and 101 and 103 are not active.
So we're wondering, what's the point of taking so much space to determine if a transaction is active?
Here it comes.
In PostgreSQL, when the isolation level of a transaction is read committed, the snapshot is retrieved once for each command in the transaction, and if the isolation level of a transaction is repeatable When read or serializable, the transaction gets snapshot only when the first command executes .
Keep in mind that the differences above are caused by the difference in the visibility of the tuple, thus achieving different isolation levels.
Let's take a picture to explain the above.
We assume that the above three transactions are executed sequentially, where the isolation level of both transaction_a and Transaction_b is read Committed,transaction_c The isolation level is repeatable read. We t1~t5 the time period to explain:
T1
Transaction_a begins and executes the first command, at which time Txid and snapshot are obtained. The transaction system assigns the Transaction_a TXID to 200 and gets the current snapshot as 200:200:
T2
Transaction_b begins and executes the first command, at which time Txid and snapshot are obtained. The transaction system assigns Transaction_b TXID to 201 and gets the current snapshot as 200:200: Because Transaction_a is executing, Transaction_b cannot see the modifications in transaction_a.
T3
Transaction_c begins and executes the first command, at which time Txid and snapshot are obtained. The transaction system assigns Transaction_c TXID to 202 and gets the current snapshot as 200:200: Because Transaction_a is executing, Transaction_c cannot see Transaction_ Changes in a and transaction_b.
T4
Transaction_a has made a commit. The transaction management system removed the transaction_a information.
T5
Transaction_b and Transaction_c execute their select commands, respectively.
At this point, Transaction_b gets a new snapshot (because its isolation level is read COMMITTED), the snapshot is 201:201:. Because the TRANSACTION_A has been submitted. Transaction_a is visible to Transaction_b.
At the same time, because the isolation level of Transaction_c is repeatable READ, it still uses the first command to obtain snapshot 200:200, so transaction_a and Transaction_ B is still not visible to Transaction_c.
2.3-Tuple visibility rules
The rule of the so-called tuple visibility (tuple Visibility) is to use:
1.tuple中的t_xmin和t_xmax字段;2.clog3.当前的snapshot
To determine whether a tuple is visible or invisible to the execution statement in the current transaction. The so-called visible and invisible means that the current command handles the tuple when it is processed.
For the sake of brevity, we dodged the question of child affairs and about T_ctid. Only the simplest case is discussed.
We select 10 rules and divide them into three categories for illustration.
1. The status of T_xmin is aborted
We know that T_xmin is the transaction txid when a tuple is insert. If the state of the transaction is aborted, indicating that the transaction is canceled, it is natural that the tuple's insert is invalid and not visible. So the Rule1 is:
Rule 1: If Status (tuple.t_xmin) = Aborted⇒tuple is Invisible
2. The status of T_xmin is In_progress
If the t_xmin status of a tuple is in_progress, then it is very likely that it is not visible .
Because:
If this tuple is inserted by another transaction (not the current transaction), then the tuple is obviously not visible because the tuple is not committed (PostgreSQL does not support READ UNCOMMITTED).
Rule 2: If Status (t_xmin) = In_progress && T_xmin≠current_txid⇒tuple is Invisible
If this tuple is committed by the current transaction, and the T_xmax value of the tuple is not 0, that is, the tuple is inserted by the current transaction, but is removed by the current transaction update or delete, so it is obviously not visible.
Rule 3: If Status (t_xmin) = in_progress && t_xmin = Current_txid && t_xmax≠invaild⇒tuple is I Nvisible
Conversely, if this tuple is committed by the current transaction, and the T_xmax value of the tuple is 0, it indicates that the tuple is inserted by the current transaction and has not been modified, so it is visible.
Rule 4: If Status (t_xmin) = in_progress && t_xmin = current_txid && T_xmax = Invaild⇒tuple is V Isible
3. The status of T_xmin is committed
In contrast to the above, if a tuple's t_xmin state is committed, then it is quite possible that it is visible .
Let's list the rules and explain them later.
Rule 5: If Status (t_xmin) = COMMITTED && Snapshot (t_xmin) = Active⇒tuple is Invisible
Rule 6: If Status (t_xmin) = COMMITTED && Snapshot (t_xmin) ≠active && (T_xmax = INVALID | | Status (T_xmax) = aborted) ⇒tuple is Visible
Rule 7: If Status (t_xmin) = COMMITTED && Status (t_xmax) = in_progress && T_xmax = current_txid⇒ Tuple is Invisible
Rule 8: If Status (t_xmin) = COMMITTED && Status (t_xmax) = in_progress && t_xmax≠current_txid⇒ Tuple is Visible
Rule 9: If Status (t_xmin) = COMMITTED && Status (t_xmax) = COMMITTED && Snapshot (t_xmax) = active ⇒tuple is Visible
Rule: If Status (t_xmin) = COMMITTED && Status (t_xmax) = COMMITTED && Snapshot (T_xmax) ≠active ⇒tuple is Invisible
Rule5 is more obvious, for a tuple, the transaction in which it was inserted is committed (commited), and the transaction is active under the current snapshot, indicating that the transaction is in progress or not yet the command in the current transaction Started (forget the words, see the contents of Section 2.2), so that the transaction is inserted into the tuple pairs at the current is not visible;
Rule6, obviously, the tuple has not been modified or modified its transaction is abort = > The tuple has not been modified; the transaction x that inserts the tuple is inactive in the current snapshot ( Inactive indicates that transaction x is either committed or abort for the currently executing SQL command, so it is visible;
Rule7, if the tuple is the current transaction update or delete, the nature of this tuple for us is the old version, not visible;
Rule8, in contrast to Rule7, this tuple is changed by another transaction x (update or delete), and transaction x is not committed (PostgreSQL does not support READ UNCOMMITTED), so the modified tuple is not visible to us, we can see the current tuple, So the current tuple is visible;
Rule9, although the transaction x that modifies this tuple has already been committed, the transaction x is active in the current snapshot, that is, the command in the current transaction is in progress or not yet started (the second time this hypothesis is used), So the modification of transaction x is not visible to the current command, so we see this tuple;
Rule10, in contrast to the previous one, modifies this tuple's transaction x has been committed, and transaction x is inactive in the current snapshot (inactive indicates that the transaction x is either committed or abort for the SQL command currently being executed). So for the command in the current transaction, this transaction x has already been committed, so this tuple is already modified for the command in the current transaction, which is the old version, so it is invisible.
MVCC in 3.PostgreSQL
Also confessed some basic knowledge, the following formally entered the MVCC.
3.1-Tuple visibility detection
In this section we will talk about the detection of "tuple visibility" in PostgreSQL.
The detection of the so-called "tuple visibility" is the use of tuple visibility rules to determine whether a tuple is visible or invisible to the execution statements in the current transaction. We know that in PostgreSQL a tuple is a multi-version, and for a command in a transaction it needs to find a tuple of that version that the current command should see in the corresponding transaction.
Detection of tuple visibility not only helps to find the correct "version" of the tuple, but also can be used to resolve exceptions defined in the ANSI SQL-92 standard:
脏读;不可重复读;幻读
That is, different levels of transaction isolation can be implemented.
Let's just go to the example:
For simplicity, the isolation level of the transaction for TXID=200 is the isolation level of the transaction for read committed,txid=201 we have a read COMMITTED or repeatable read two case discussion.
The order in which the commands are executed is as follows:
T1 : txid=200 's transaction begins
T2 : txid=201 's transaction begins
T3 : txid=200 and txid=201 transactions execute the SELECT command separately
T4 : txid=200 Transaction Execution Update command
T5 : txid=200 and txid=201 transactions execute the SELECT command separately
T6 : txid=200 's transaction commit
T7 : txid=201 's transaction Execution Select command
Let's take a look at how PostgreSQL performs "tuple visibility" detection.
T3 :
At T3 time, the current table has only tuple_1, according to Rule6 the tuple is visible to all transactions;
T5 :
At the T5 moment the situation is different, we discuss the two transactions separately.
for txid = 200 transactions , at this moment, we know that tuple_1 is invisible ( according to Rule7), tuple_2 visible ( according to Rule4);
Therefore, the return result of the SELECT statement is:
postgres=# -- txid 200postgres=# SELECT * FROM tbl; name ------ Hyde(1 row)
for TXID = 201 Transactions , at this moment, we know that tuple_1 is not visible ( according to Rule8), tuple_2 is also invisible ( according to Rule2);
Therefore, the return result of the SELECT statement is:
postgres=# -- txid 201postgres=# SELECT * FROM tbl; name -------- Jekyll(1 row)
As we can see, the TXID = 201 Transaction does not read uncommitted updates for transactions TXID = 200, that is, to avoid dirty read problems. No dirty reads are caused in all of PostgreSQL's transaction isolation levels.
T7 :
At T7 time, only TXID = 201 of the transaction is still running, TXID = 200 of the transaction has been committed. Now we have two cases to discuss the behavior of TXID = 201 transactions.
1) TXID = 201 The isolation level of the transaction is read COMMITTED
At this point, because the transaction for TXID = 200 has been committed, the snapshot retrieved at this time is 201:201:. Therefore, we can know that tuple_1 is invisible ( according to Rule10), tuple_2 is visible ( according to Rule6),
Therefore, at the Read committed level, the return result of the SELECT statement is:
postgres=# -- txid 201 (READ COMMITTED)postgres=# SELECT * FROM tbl; name ------ Hyde(1 row)
As we can see, when the transaction is at the isolation level of Read Committed, the result is different for the same select two times, that is, non-repeatable reads .
2) TXID = 201 The isolation level of the transaction is repeatable READ
At this point, although a transaction of TXID = 200 has been committed, we know that at repeatable read/serializable, the transaction only gets one snapshot at the time the first command is executed, so snapshot still remains unchanged at this time: 200:200. Therefore, we can know that tuple_1 is invisible ( according to Rule9), tuple_2 is visible ( according to Rule5),
Therefore, at the Read committed level, the return result of the SELECT statement is:
postgres=# -- txid 201 (READ COMMITTED)postgres=# SELECT * FROM tbl; name -------- Jekyll(1 row)
We can see that when the transaction is at the isolation level of repeatable read, the results obtained by the same select and two times are not changed, that is, the non-repeatable read is avoided.
Here, we have solved the problem of dirty read and non-repeatable reading , then there is a phantom read . However, does Phantom read exist in the transaction at PostgreSQL at the isolation level of repeatable read?
We know that the definition of phantom reads: A transaction re-executes a query that returns a rowset that matches a search condition, and finds that the rowset that satisfies the condition has changed because of another recently committed transaction.
Obviously, in PostgreSQL, as a result of the snapshot isolation mechanism, we continue to be able to discover that the repeatable read/serializable isolation level eliminates the illusion of reading (I don't write it), that is, in the repeatable Read begins to avoid the illusion of reading, which is not the same as other databases, and the isolation level provided in PostgreSQL is more stringent.
About this series, but also left "how to avoid lost Update" and "SSI (Serializable Snapshot isolation)" Two issues not discussed, length is too long, not to write a paper, allow the landlord to drink tea, stay in the next article to say it ~