Distributed consistency algorithms-Paxos and consistency-paxos

Last Update:2016-06-27 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Paxos is a consistent algorithm based on message transmission proposed by Leslie Lamport in 1990. The Paxos algorithm solves the problem of how a distributed system can reach an agreement on a value (resolution. In engineering practice, Paxos can be used to achieve multi-copy consistency, distributed locks, name management, and serial number allocation. For example, in a distributed database system, if the initial status of each node is the same and each node executes the same operation sequence, they can finally get a consistent state. To ensure that each node executes the same command sequence, a "consistency algorithm" must be executed on each command to ensure that the commands seen by each node are consistent. This article will first talk about the original Paxos algorithm (Basic Paxos), mainly describes the two-stage submission process, and then focuses on the Paxos algorithm variants (Multi Paxos), which are the optimization of Basic Paxos, it is more suitable for engineering practice. Finally, I will use Q & A to give my questions about Paxos algorithm and my understanding of these questions.

Concepts and terminology
Proposer: proposes the initiator to process client requests and send client requests to the cluster to determine whether the value can be approved.
Acceptor: The proposed approver is responsible for processing the received proposal. Their reply is a vote and some states are stored to determine whether to receive a value.
Replica: A node or replica. a server in a distributed system is generally a separate physical machine or virtual machine. It also assumes the advertiser and receiver roles in paxos.
ProposalId: each proposal has a number with a higher priority.
Paxos Instance: The Paxos Instance is used to reach an agreement on the same value among multiple nodes, for example, the same log serial number: logIndex. Different logindexes belong to different Paxos instances.
AcceptedProposal: proposal received in a Paxos Instance
AcceptedValue: the value of the proposal that has been received in a Paxos Instance.
MinProposal: In a Paxos Instance, the minimum proposed value currently received will be updated continuously

Basic-Paxos Algorithm
For a system built based on the Paxos protocol, more than half of the nodes in the system can communicate with each other online to provide services. Its core implementation of Paxos Instance mainly includes two phases: preparation phase (prepare phase) and proposal phase (accept phase ). As shown in:

1. Obtain a ProposalId. To ensure that ProposalId increases progressively, it can be generated using timestamp + serverId;
2. the requester broadcasts the prepare (n) request to all nodes;
3. the receiver compares n and minProposal. If n> minProposal, it indicates there is an update proposal. minProposal = n; otherwise (acceptedProposal, acceptedValue) will be returned;
4. After the requester receives more than half of the requests, if an acceptedValue is returned, it indicates that there is an update proposal. Save acceptedValue to the local device, and then jump to 1 to generate a higher proposal;
5. It indicates that there is no higher priority proposal in the current paxos instance. You can enter the second stage and broadcast accept (n, value) to all nodes;
6. The receiver compares n and minProposal. If n> = minProposal, acceptedProposal = minProposal = n, acceptedValue = value. If n> = minProposal, the result is returned;
Otherwise, minProposal is returned.
7. After the requester receives more than half of the requests, if there is a return value> n, it indicates there is an update proposal and jump to 1; otherwise, the value is agreed.
From the above process, we can see that in the case of concurrency, 4th or 7th steps may frequently retry, resulting in low performance. In more serious cases, it may lead to a situation where the agreement can never be reached, is the so-called "active lock", as shown in:

1. As the proponent, S1 initiates prepare (3.1) and reaches the majority in S1, S2, and S3;
2. Subsequently, S5 initiated prepare (3.5) as the proponent and reached a majority in S3, S4, and S5;
3. S1 initiates accept (3.1, value1). Because S3 proposes 3.5> 3.1, the accept request cannot reach the majority, and S1 tries to generate a new proposal.
4. S1 initiates prepare (4.1) and reaches the majority in S1, S2, and S3.
5. S5 initiated accpet (3.5, value5). Because S3 proposed 4.1> 3.5, the accept request could not reach the majority, and S5 tried to generate a new proposal.
6. S5 initiates prepare (5.5) and reaches the majority in S3, S4, and S5. As a result, the accept (4.1, value1) initiated by S1 will fail.

......

Role of the prepare stage
According to the description of Basic-Paxos, a value needs to be determined in two phases. due to multiple rounds, the performance is low, and the network RTT is at least twice. Can the prepare stage be omitted? As shown in:

1. S1 first initiates accept (1, red) and reaches the majority in S1, S2, and S3. red is persistent in S1, S2, and S3.
2. Then S5 initiates accept (5, blue). For S3, the value of acceptedValue will be changed to blue due to receiving the update proposal.
3. S3, S4, and S5 reach the majority, and blue is persistent in S3, S4, and S5.
4. The final result is that the values of S1 and S2 are red, while those of S3, S4, and S5 are blue.

Therefore, the two phases are indispensable. The role of the Prepare stage is to block the old proposal and return the received acceptedProposal. We can also see that if only S1 is proposed, there will be no problems. This is the Multi-Paxos we will talk about below.

Multi-paxos Algorithm
Paxos is to reach an agreement on a single value. Multi-Paxos is to reach an agreement on multiple consecutive paxos instances. The core reason here is that there is a Leader in the multi-paxos protocol. The Leader is the only Proposal in the system. In the lease cycle, all proposals have the same ProposalId. you can skip the prepare stage. The motion only involves the accept process. One ProposalId can correspond to multiple values, therefore, it is called Multi-Paxos.

Election
First, we need a leader. In fact, the essence of the master election is also a Paxos algorithm process, but this time the Paxos determines the value of "who is the leader. Any node can initiate A proposal. In the case of concurrency, multiple masters may occur. For example, A and B are successively elected as leaders. To avoid frequent master election, the leader-elected node should immediately establish its own leader Authority (let other nodes know that it is a leader) and write a special log (start-working log) confirm its identity. According to the majority principle, only one leader's startworking log can reach the majority. After the leader confirms his identity, he can use the lease mechanism (lease) to maintain his leader identity, so that other proposal will not initiate a proposal, so that he will enter the leader's term of office, because there is no concurrency conflict, therefore, you can skip the prepare stage and directly enter the accept stage. According to the analysis, after selecting the leader, all the logs in the leader's term of use only need one network RTT (Round Trip Time) to reach an agreement.

Master recovery process
Because Paxos has no restrictions, any node can participate in the election of a master and eventually become a leader, which cannot ensure that the newly selected leader contains all logs and may be empty, therefore, before the service is actually provided, there is still a recovery process for obtaining all submitted logs. The new Master queries the maximum logId from all Members. After receiving the majority response, select the maximum logId as the log recovery end point, the significance of the majority here is that the recovery end point contains all the consensus logs, and of course may also contain the logs that do not reach the majority. After obtaining the logId, you can use the paxos protocol for each logId one by one from the beginning, because the system cannot provide services until the new Master obtains all the logs. For optimization, the confirm mechanism is introduced to notify other acceptor of the agreed logId, And the acceptor writes a confirm log to the log file. After the new master restarts, it will scan the local logs. For logs that already have the confirm log, it will not re-launch the paxos. Similarly, in response to a client request, a new round of paxos is required for logs that do not have a confirm log. Because there is no strict requirement on the location of the confirm log, you can send it in batches. To ensure that you do not need to perform paxos on too many price-raising logs during the restart, you need to keep the confirm log at a certain distance from the latest submitted logId.

Performance Optimization
The Basic-Paxos log Validation requires at least two disk write operations (prepare, promise) and two network RTT (prepare, promise ). Multi-Paxos uses a one-stage commit (eliminating the need for Prepare) to shorten a log validation to an RTT and a disk write. By using the confirm mechanism, the recovery time of the new master can be shortened. To improve performance, we can also implement a batch of logs submitted as a group, either successfully or not, which is similar to group-commit in exchange for throughput through RT.

Security (Exception Handling)
1. Leader exception
During the term of service, the Leader needs to send a periodic heartbeat to each node and has told it that it is still alive (working normally). If a node still does not receive a heartbeat during the timeout period, it will try to initiate the election process. If the Leader is abnormal, all nodes will time out successively. In this case, select a new master, then the new master enters the recovery process, and finally provides external services. The exceptions we usually call include the following three types:
(1). process crash (OS crash)
The Leader process crash is similar to the OS crash process. If the restart time is greater than the heartbeat timeout time, the node determines that the leader has crashed and the primary process is re-elected.
(2). node network exception (Network partition of the node)
A Leader network exception also causes other nodes to fail to receive the heartbeat, but it is possible that the leader is alive, but the network jitter occurs. Therefore, the heartbeat timeout cannot be set too short, otherwise, the primary node is often selected due to network jitter. In another case, if the IDC where the node is located is partitioned, the nodes in the same IDC can communicate with each other. If the nodes in the IDC can constitute the majority, the service will be normally provided to the outside world, if no, such as a total of four nodes and two IDCs, the majority cannot be reached in any IDC After partitioning, leading to the failure to select the primary node. Therefore, the number of Paxos nodes is generally an odd number, and the distribution of IDC nodes should also be considered during node deployment.
(3) disk faults
For the first two exceptions, the disks are all OK, that is, the received logs and the corresponding confirm logs are all in. If the disk is faulty, adding a node is similar to a new node without any log or Proposal information. This may cause a problem: this node may promise a proposal smaller than the maximum proposalID that has been promise, which violates the Paxos principle. Therefore, after the node is restarted, it cannot participate in the Paxos Instance. It must catch up with the Leader first. When a complete paxos instance is observed, the node cannot end with promise/ack.
2. Follower exception (downtime, disk damage, etc)
For a Follower exception, the processing is much simpler, because follower itself does not provide external services (logs may not be complete). For a leader, as long as the majority can be reached, it can provide external services. After the follower is restarted, there is no promise capability until it catches up with the leader.

Q &
1. What is the difference between the Paxos data synchronization mode and the traditional one-master-N-Slave Data synchronization mode?
In general, the high availability of traditional databases is implemented based on the master and slave databases. One master and one slave have two copies. After the master database crash, the HA tool is used for switching, upgrade the slave database to the master database. In scenarios with strong consistency, strong synchronization can be enabled for replication. Both Oracle and Mysql adopt similar replication modes. However, if the standby database network jitters or crash occurs, log synchronization fails and the service is unavailable. To this end, we can introduce the multi-copy mode of one master N slave. We compare the multi-copy mode with three replicas. One is based on the traditional one master 2 slave mode, another paxos-based one-master-two-slave architecture. In the traditional one-master-two-slave mode, if one copy receives the log and the log is persistent successfully, the log is returned, to some extent, the network jitter and backup database crash problems are solved. However, if you still need to use the HA tool to switch the master database, how can we ensure the availability of the HA switching tool. Paxos-Based Multi-copy synchronization actually introduces the consistency protocol on the basis of 1 master N slave, so that the availability of the entire system is completely controlled by three replicas, without the need for additional HA tools. In fact, many systems use zookeeper and other third-party interfaces to implement distributed locks to ensure consistency of multi-node HA tools in obtaining Master/Slave information. In essence, they are also implemented based on Paxos.

2. What is the relationship between distributed transactions and the Paxos protocol?
In the database field, distributed transactions are mentioned when it comes to distributed systems. The Paxos protocol and distributed transactions are not at the same level. The role of distributed transactions is to ensure the atomicity of cross-node transactions. The nodes involved in transactions are either committed (successfully executed) or not committed (rollback ). The consistency of distributed transactions is usually guaranteed through 2 pcs (Two-Phase Commit, 2 PC), which involves a coordinator and several participants. In the first stage, the Coordinator asks the participant whether the transaction can be executed, the participant replies to agree (the local execution is successful), and the reply is canceled (the local execution fails ). In the second stage, the Coordinator makes a decision based on the voting results of the first stage, and can only be submitted when all the participants agree to submit the transaction, otherwise roll back. The biggest problem with the 2 PC is that the Coordinator is a single point (a slave node is required), and the Protocol is a blocking protocol. Any participant failure needs to wait (the timeout mechanism can be added ). The Paxos protocol is used to solve the consistency problem between multiple replicas. For example, log synchronization ensures the log consistency of each node, or selects the primary node (when the primary node fails) to ensure that the voting is consistent and the primary node is unique. In short, 2 PC is used to ensure the atomicity of transactions on multiple data shards. Paxos is used to ensure the consistency of the same data shard on multiple replicas, so the two can be complementary, it is by no means an alternative relationship. The Paxos protocol can be used to solve single-point problems of 2 PC coorders. When the Coordinator has a problem, select a new coordinator to continue providing services. In engineering practice, Google Spanner and Google Chubby use Paxos to synchronize multiple copies of logs.

3. How to Apply Paxos to traditional database replication protocols?
The replication protocol is equivalent to the Paxos custom application. By voting a series of logs to confirm that the majority is reached, it is equivalent that the log has been persisted in the majority. Copies synchronize with the Master node by applying persistent logs. Because of the ACID feature of the database, it is essentially from a consistent state to another consistent State. Each transaction operation changes the database status and generates a log. Because client operations are sequential, you must ensure the log sequence. In any copy, not only must all logs be persistent, but also the order. Each log is identified by a logID, And the logID is strictly incrementing (in the order indicated). The leader votes on each log to reach the majority. If leader switching occurs midway through, for the "empty" logID in the new leader, you need to vote again to confirm the log validity.

4. Can non-leader nodes of Multi-Paxos provide services?
In the Multi-Paxos Protocol, only the leader ensures that all the persistent logs are contained. Of course, the local persistent logs may not necessarily reach the majority. Therefore, for logs without confirm, you need to vote again and return the latest results to the client. The non-leader node does not necessarily have all the latest data and needs to be confirmed by the leader. Therefore, in general engineering implementation, all read/write services are provided by the leader.

5. What should I do if the client request fails?
The client initiates a request to the leader. The leader crash before returning the request. For the client, this operation may succeed or fail. Therefore, the client needs to check the operation results and determine whether to perform the operation again. If the leader persists locally and does not reach the majority, the new leader first obtains the maximum logID from each copy as the recovery end point, paxos validation is performed on logs without confirm locally. if the majority is reached at this time, the application is successful. If not, the application is not applied. The client checks whether the operation is successful. Of course, the specific project practice involves the client timeout time, Master selection time, and log recovery time.

References

Https://ramcloud.stanford.edu /~ Ongaro/userstudy/paxos.pdf

Http://www.cs.utexas.edu/users/lorenzo/corsi/cs380d/papers/paper2-1.pdf
Http://research.microsoft.com/en-us/um/people/lamport/pubs/paxos-simple.pdf
Https://zhuanlan.zhihu.com/p/20417442
Http://my.oschina.net/hgfdoing/blog/666781
Http://www.cnblogs.com/foxmailed/p/5487533.html
Http://www.wtoutiao.com/p/1a7mSx6.html

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Distributed consistency algorithms-Paxos and consistency-paxos

Contact Us

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support