Amazon Aurora: How to implement a distributed database without using a consistency protocol

Source: Internet
Author: User

Summary: This is Amazon Aurora's second article, published in the 2018 Sigmod, the topic is very attractive to avoid the I/O, commit, member changes in the process of using the consistency protocol. While everyone is using the consistency protocol (raft, Multi-paxos) today, Aurora is proposing to do without a consistency agreement, the main point being that these are available.

This is Amazon Aurora's second article, issued in 2018 Sigmod, the topic is very attractive to avoid the I/O, commit, member changes in the process of using the consistency protocol. While everyone is using the consistency protocol (raft, Multi-paxos) today, Aurora comes up with no consistency protocol to do, the main point is that the existing protocols are too heavy, and will bring additional network overhead, it can be understood that Aurora is 6 copies, after all, The main bottleneck is on the network. So, what did he do?

Because Aurora's many details are still not disclosed, so much of the content is my own interpretation, and the author of the question, if wrong, welcome to explore

This article also mainly answers this question.

Aurora is able to avoid distributed consensus during writes and commits by managing consistency points in the database ins Tance rather than establishing consistency across multiple storage nodes.

In Aurora, storage tier has no authority to decide whether to accept write, but must accept write from database. Then it is up to the database tier to decide whether this SCL, PGCL, VCL can move forward, that is to say storage tier itself is not a strong consistent system, but just a quorum system, need database tier to match Achieve strong consistency.

This is also the majority of the current system design is not the same place, most of the current system is based on the underlying strong consistent, stable kv (of course, can also be called block storage) storage, and then in the upper compute node to do the resolution and transformation of the Protocol. Aurora proposes that the underlying system only needs to be a quorum system, storage tier + database tier to achieve a strong consistent scenario.

For example, like spanner inside, each spanservers itself is a multi-copy, multi-copy through Multi-paxos to ensure consistency of data, and then the upper layer of F1 the main protocol conversion, the SQL protocol into KV request to request spanserver.

Our Polardb is also a set of systems, the underlying storage node Polarstore is a stable and reliable strong consistent system, the upper layer of the compute node POLARDB is a stateless node.

How is the next specific Aurora implemented?

Term:

Lsn:log Sequence Number
Each redo log has a unique monotonically incrementing log Sequence number (LSN), which is generated by database, and because Aurora is a write-and-read structure, it is easy to meet monotonically increasing

Scl:segment Complete LSN
The SCL (segment complete LSN) represents the largest LSN known to the current segment, and all records before this SCL are currently received by this node, and the data to the SCL location is continuous. The difference between this and the VCL is that the VCL is the committed LSN that all nodes acknowledge, and the SCL is the LSN that it considers to be committed, and the VCL can be considered storage node's commit index, and the SCL is just the Lastlogindex Au that records the current node Rora also uses this SCL to interact between nodes to get the log.

Vcl:volume Complete LSN
This VCL is the storage node that has committed the LSN, that is, storage node to ensure that the data is less than equal to this VCL has been confirmed submitted, once the confirmation of the submission, the next time the recovery, the data is guaranteed. If the storage node recovery stage, the data larger than the VCL must be deleted, VCL equivalent to commit Index. This VCL is only guaranteed at the storage node level, and it is possible that the subsequent database will allow VCL to delete the log from the beginning of a section.

Here VCL is just storage node to the database assurance that in my Storage node this layer of multiple nodes has been synchronized, and ensure consistency. This vcl is provided by storage node.

Pgcl:protection Group Complete LSN
Each shard has its own SCL, and this SCL is called PGCL. is equivalent to saying that the SCL is the database total SCL, each shard has its own PGCL, then this database SCL is equal to the largest PGCL

Cpl:consistency Point LSN
CPL is provided by the database to tell the storage node layer which logs can be persisted, in fact, this and the file system to ensure that the atomicity of multiple operations is the same way.

Why need CPL, can understand so, database need to tell storage node I have confirmed which logs, may have some logs I have submitted to storage node, but because the log needs truncate rollback operation, Then this CPL is to tell storage node exactly which logs I need, in fact, and the file system to ensure that multiple operations atomicity is a method, so generally every MTR (mini-transactions) inside the last record is a Cpl.

Vdl:volume Durable LSN
Because database will mark more than one CPL, the CPL inside the largest and smaller than the VCL is called VDL (Volume durable LSNs). Because VCL indicates that storage node believes that the LSN of the commit has been confirmed, smaller than the VCL, it is stated that these logs have all been committed at the storage node level, and the CPL is the database plane that tells storage node which logs are persisted, which What VDL said is that has been confirmed by the database layer, and storage node level also confirmed that the log has been persisted, then the database is now confirmed that the location point has been submitted.

So VDL is the database this layer has been confirmed to submit the location point, in general, VCL will be larger than VDL to come, this VDL is provided by the database, generally VDL is also the database level of concern, because VCL may contain a transaction in the uncommitted Part of the.

Mtr:mini Transaction
So the transaction commit process is this, each transaction has a corresponding "commit LSN", then the transaction commits to do other things, when to notify the transaction has been submitted successfully? is when VDL (VDL sent by databse, storage service to confirm the update) is greater than or equal to the "commit LSN", there will be a dedicated thread to notify the waiting client, you have committed the transaction has been completed.

If this transaction fails to commit, then what is the next recovery to do with it?

First of all, this recovery is handled by storage node, with each PG as the dimension, and when the database is up, it reads enough copies by quorum, and then gets VDL according to the contents of the copy, because each time the last record is a C PL, the largest of these Cpl is VDL, and then send this VDL to other copies, the redo log greater than VDL removed, and then restore the PG data

Scn:commit redo record for the transaction
That is, a transaction commit redo record, the maximum commit LSN within each transaction generated redo record. Used primarily to check if the transaction has been persisted.

This is done by ensuring that the SCN is definitely smaller than the VCL to ensure that the committed transaction is guaranteed to be persistent, so Aurora must be the lower VCL than the current transaction SCN before returning to the client

Pgm-rpl:protection Group Minimum Read Point LSN
This LSN is mainly used to recycle garbage, which means that the database reads the lowest LSN, and the data below this LSN can be cleaned up. So storage node only accepts read requests for data between PGMRPL-I and SCL

So what is the write process?

If there is a transaction in database tier that needs to be committed, then the transaction may involve multiple shards (protection group), then multiple mtrs are generated, and the MTRs submits the log records to storage tier in sequence. Each of these MTR may contain multiple log records, then the last LSN of the log records, also known as Cpl. Storage tier moves the local SCL forward. After receiving more than the majority of storage node confirmation, the database tier moves its own VCL forward. The next time the database tier sends a request, it will bring the new VCL information, then the other storage node nodes will update their VCL information.

So what is the reading process?

In Aurora's quorum practice, the reading did not go quorum.

From the master node, master is able to obtain and record each storage node's current VDL at the time of the quorum write, so read it directly with the latest VDL storage node.

For slave nodes, Master writes redo record to storage node simultaneously, asynchronously synchronizes redo log to slave node, and also updates VDL, VCL, SCL and so on from node, constructs local loca from node itself L Cache. and the slave node also has global VDL information for every storage node, so it can also be accessed directly to the node with the latest storage node.

Personal Opinion:

This article begins with quorum I/O, locally observable state, monotonically increasing log ordering three properties to achieve aurora without the need for a consistency protocol. Then we'll read it all

The monotonically increasing log ordering here is guaranteed by the LSN, which is similar to the Lamport logic clock (since only one node here is a write node, and if the write node is hung, there is a recovery process, so you can To make it easy to ensure that the LSN is incremented)

Locally observable state indicates the status that the current node sees, that is, each node sees a different state, each node (including database node and storage node) has its own view of the SCL, VCL, VDL, etc. , this information represents the state of the current node, and as Lamport logic clock article says, there is no way to judge the order of the States in the two different nodes in a distributed system, and the partial order relationship can only be determined if the two states have message passing. So here is the quorum I/O to determine this partial order relationship

Quorum I/O is equivalent to confirming a partial-order relationship after each quorum io is confirmed. For example, after a successful write, I can be sure that the current data node State must be before the other storage node. After one gossip node confirms information to each other, the state of the node that initiates the acknowledgement must also precede the other nodes. Therefore, the whole system after the write or after the restart of the gossip must be able to have a node in the first state of all nodes, then the state of this node is the state of the current system, the state contains the SCL, VCL, VDL information is consistent information

At each reading, his node must be at the top of the current partial-order relationship, since the read operation must be either write or recovery operation, in the process of write, recovery operation, agree with more than half of the nodes, and walk in the front of the partial order relationship . The local observable state of the current node is obtained. So the reading must be the latest content.

An important reason Aurora systems are easy to implement is that he is only one writer present, which guarantees that only one event may occur at the same time

At first, it would feel that this system is relatively trivial, not like Paxos/raft there is a relatively complete theoretical proof, but asked the author, the realization of this process is also through tla+ proof

Read the original? Please add a link description

This article is the original content of the cloud-Habitat community and cannot be reproduced without permission.

Amazon Aurora: How to implement a distributed database without using a consistency protocol

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.