A brief introduction to Greenplum's MVCC multi-version control (mainly related to Cmin,cmax,xmin,xmax instructions)

Source: Internet
Author: User

familiar with the Greenplum database of friends should know that the base of the GP is using PostgreSQL database to implement the MPP architecture, and for the transaction control this piece, is also the use of PostgreSQL Multi-version control MVCC, to achieve the read and write separation, Obviously, the performance of database queries per second is improved.

at the read commit transaction isolation level, the query request only reads data changes for transactions that have been committed before the query request, and does not affect the current version of the data;

In the DML statement, the current version is manipulated. Therefore, the purpose of reading and writing separation is achieved, and the database concurrency ability is improved.

Let's review the MVCC Multi-version control in PostgreSQL.

in PostgreSQL , each transaction is given a transaction ID called XID. The transaction is not only a set of statements that are wrapped by begin-commit, but also a single insert, update, or DELETE statement. When a transaction begins, PostgreSQL increments the XID and assigns it to the transaction. PostgreSQL also stores transaction-related information on every row of records in the system, which is used to determine whether a row of records is visible to the current transaction. For example, when you insert a row of records, PostgreSQL stores the XID of the current transaction in this line and calls it xmin. Only those records that have been committed and xmin are smaller than the XID of the current transaction are visible to the current transaction. This means that you can start a new transaction and then insert a row of records until you commit (commit), and the line of records you insert is never visible to other transactions. After the commit, the new transaction created after the other will be able to see the new record because they meet the xmin < XID conditions, and the transaction that created the row is completed.

The mechanism is similar for DELETE and UPDATE, but the difference is that they use a value called Xmax to determine the visibility of the data for PostgreSQL. This image shows how MVCC works in transaction isolation in two concurrent transactions that insert/read data.

PostgreSQL using tags such as Xmin,xmax,cmin,cmax to implement multiple versions, they mean:

xmin : When a record (tuple) is created , the transaction ID is logged at this time, and each update is updated later.

Xmax: When a tuple or lock is updated or deleted , the transaction ID is logged at this time, or 0 if the record is not deleted.

Cmin: Command ID for inserting the tuple's command in the Insert transaction (starting from 0 )

Cmax: Command ID to delete the tuple's command in the Insert transaction (starting from 0 )

However, for the Greenplum database, after all, it is based on multiple Postgres instances to implement the MPP schema database, so the value of the above tag may be different from a single postgres. This is illustrated in our example below.

# load data, non-parallel, if you load data in parallel, consider using gpfdist or gpload, etc.

zhangyun_db=# COPY TEST_MVCC from '/home/gpadmin/mvcc.txt ' with delimiter as ' | ' null as ';

COPY 4

zhangyun_db=# select * from TEST_MVCC;

ID | Name

----+-----------

4 | Hadoop

3 | Greenplum

2 | Hive

1 | Spark

(4 rows)

zhangyun_db=# Select t.*, T.xmin, T.xmax, T.cmin, T.cmax from TEST_MVCC t;

ID |  name | xmin | Xmax | Cmin | Cmax

----+------------+--------+------+------+------

8 | Flink |    449908 |    0 |    0 | 0

4 | Hadoop |    449906 |    0 |    0 | 0

5 | HBase |    449775 |    0 |    0 | 0

7 | PostgreSQL |    457913 |    0 |    0 | 0

2 | Hive |    449910 |    0 |    0 | 0

3 | Greenplum |    449909 |    0 |    0 | 0

6 | Hawq |    449899 |    0 |    0 | 0

1 | Spark |    449905 |    0 |    0 | 0

(8 rows)

As can be seen, the xmin of the 8 Records is different (if it is a PostgreSQL database, this should be the same, because this data is created by the same transaction copy).

The other xmax are 0, indicating that the data has not been deleted since it was imported.

Let's show you how to perform an update in Greenplum data:

Please open two Linux terminals A and B for easy data comparison and viewing.

first executes at Terminal a , but does not commit:

zhangyun_db=# begin;

BEGIN

zhangyun_db=# Update TEST_MVCC Set name = ' Hive on Spark ' where id = 2;

UPDATE 1

Terminal B View:

zhangyun_db=# Select T.*, T.xmin, T.xmax, T.cmin, T.cmax from TEST_MVCC t;

ID |  name |  xmin | Xmax | Cmin | Cmax

----+------------+--------+--------+------+------

4 | Hadoop |      449906 |    0 |    0 | 0

7 | PostgreSQL |      457913 |    0 |    0 | 0

6 | Hawq |      449899 |    0 |    0 | 0

2 | Hive | 449910 |    450412 |    0 | 0

1 | Spark |      449905 |    0 |    0 | 0

8 | Flink |      449908 |    0 |    0 | 0

5 | HBase |      449775 |    0 |    0 | 0

3 | Greenplum |      449909 |    0 |    0 | 0

(8 rows)

As you can see, the Xmax of the data row with ID 2 has changed, but the data itself is unchanged because the transaction for Terminal A has not yet been committed.

Next, we perform the commit action in Terminal A, as follows:

zhangyun_db=# commit;

COMMIT

at the same time in Terminal B again view:

zhangyun_db=# Select T.*, T.xmin, T.xmax, T.cmin, T.cmax from TEST_MVCC t;

ID |  name | xmin | Xmax | Cmin | Cmax

----+---------------+--------+------+------+------

2 | Hive on Spark |    450412 |    0 |    0 | 0

6 | Hawq |    449899 |    0 |    0 | 0

5 | HBase |    449775 |    0 |    0 | 0

7 | PostgreSQL |    457913 |    0 |    0 | 0

4 | Hadoop |    449906 |    0 |    0 | 0

3 | Greenplum |    449909 |    0 |    0 | 0

8 | Flink |    449908 |    0 |    0 | 0

1 | Spark |    449905 |    0 |    0 | 0

(8 rows)

you can see the record with ID 2 and its xmin has changed.

according to the above results, do not know whether we have found that for Greenplum , update or delete have not modified the value of Cmin and Cmax.

in PostgreSQL , cmin and Cmax are used to determine whether the row version changes caused by other commands within the same transaction are visible. If all the commands within a transaction are executed in strict order, then each command can always see all the changes in the previous transaction without using the command ID. However, in a transaction, there are instances where commands are executed alternately, such as querying using cursors. When you fetch a cursor, you see a snapshot of the data when the cursor is declared rather than fetch execution, that is, changes to the data after the cursor is declared are not visible to the cursor.

This piece of content, follow-up extraction time analysis source and then write an article for analysis.

A brief introduction to Greenplum's MVCC multi-version control (mainly related to Cmin,cmax,xmin,xmax instructions)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.