The Paxos algorithm is a Leslie Lambert (Leslie Lamport)-based consistency algorithm that was proposed in 1990. the problem solved by the Paxos algorithm is how a distributed system can agree on a value (resolution) . In the sense of engineering practice, it is possible to realize multi-copy consistency, distributed lock, name management, serial number distribution and so on through Paxos. For example, in a distributed database system, if the initial state of each node is consistent, each node executes the same sequence of operations, then they can finally get a consistent state. To ensure that each node executes the same sequence of commands, a "consistency algorithm" is executed on each instruction to ensure that the instructions seen by each node are consistent. This article will first describe the original Paxos algorithm (Basic Paxos), mainly describes the two-phase commit process, and then will focus on the Paxos algorithm variant (Multi Paxos), it is the Basic Paxos optimization, and more suitable for engineering practice, and finally I will pass q& A in the way that gives me questions about learning Paxos algorithms and my understanding of these questions.
concepts and terminology
acceptor: The proposed approver is responsible for handling the received offer, and their reply is a vote, storing some state to decide whether to receive a value.
replica: a node or replica, a server in a distributed system, typically a separate physical or virtual machine, while assuming the proponents and receiver roles in Paxos.
proposalid: Each proposal has a number, and the proposal with a high number has a high priority.
paxos Instance:paxos to agree on the same value across multiple nodes, rather than as a log sequence number: Logindex, Different logindex belong to different Paxos Instance.
acceptedproposal: Within a Paxos instance, a proposal has been received
acceptedvalue: The value of the proposal that has been received within a Paxos instance.
minproposal: Within a Paxos instance, the currently received minimum proposed value is continuously updated
basic-paxos algorithm
1. Gets a proposalid, in order to ensure proposalid increment, can be generated by timestamp +serverid mode;
2. The proposer broadcasts prepare (n) requests to all nodes,
3. Receiver compares N and Minproposal, if n>minproposal, Indicates an updated proposal, minproposal=n; otherwise (acceptedproposal,acceptedvalue) returned;
4. When the proposer receives a majority of the request, if it finds a acceptedvalue return, indicates an updated proposal, saves the acceptedvalue to local, and then jumps to 1, generating a higher proposal;
5. In the current Paxos instance, there are no higher priority proposals that can go into the second phase, broadcast accept (n,value) to all nodes;
6. Receiver compares N and Minproposal, if n>=minproposal, Acceptedproposal=minproposal=n, Acceptedvalue=value, returned after local persistence,
7. When the proposer receives a majority of the request, if it finds a return value of >n, indicating that there is an updated proposal, jump 1; otherwise value is agreed.
1.S1, as a proposer, initiated prepare (3.1) and reached a majority in s1,s2 and S3;
2. Subsequently, S5, as a proposer, initiated the prepare (3.5) and reached a majority in s3,s4 and S5;
3.S1 initiated the Accept (3.1,value1), because S3 proposed 3.5>3.1, resulting in the accept request cannot reach the majority, S1 attempt to regenerate the proposal
4.S1 initiated prepare (4.1) and reached a majority in s1,s2 and S3
5.S5 initiated Accpet (3.5,VALUE5), 4.1>3.5 attempted to regenerate the proposal due to S3 's proposed S5, which prevented the accept request from reaching a majority
6.S5 initiated prepare (5.5) and reached a majority in s3,s4 and S5, resulting in a subsequent S1-initiated accept (4.1,VALUE1) failure
......
The role of the prepare stage
From the description of the Basic-paxos, it is necessary to finalize a value through two stages, due to the many samsara, resulting in low performance, at least two times the network RTT. So can the prepare phase be omitted? As shown in the following:
1.S1 first initiated the Accept (1,red), and in S1,s2 and S3 reached the majority, red on the s1,s2,s3 on the persistence of
2. Subsequently S5 initiated the Accept (5,blue), and for S3, the Acceptedvalue value will be changed to blue as a result of receiving the updated proposal
3. Then S3,s4 and S5 reached a majority, and blue was persistent in S3,S4 and S5
4. The final result is that the value of S1 and S2 is red, and the value of S3,S4 and S5 is blue and there is no agreement.
So the two phases are essential, and the prepare phase is the function of blocking the old proposal and returning the acceptedproposal that has been received. It is also possible to see that, assuming that only S1 offers, there will be no problem, and this is the multi-paxos we are going to talk about.
Multi-paxos algorithm
Paxos is to agree on a value, Multi-paxos is a continuous number of Paxos instance to agree on multiple values, here the most important reason is Multi-paxos protocol has a leader. Leader is the only proposal in the system, all proposals in the lease lease cycle have the same proposalid, can skip the prepare phase, the bill only accept the process, a proposalid can correspond to multiple value, So called Multi-paxos.
election
New Master Recovery process
Since there is no limit in Paxos, any node can participate in the selection of the master and eventually become leader, there is no guarantee that the newly elected leader contains all the logs, there may be holes, so before the actual service, there is a recovery process to get all the submitted logs. The new master queries all members for the maximum Logid request, and after receiving a majority response, selects the largest logid as the log recovery end point, where the majority meaning is that the recovery end point contains all the agreed logs and, of course, a log with no majority. Once you get Logid, you start from scratch on each Logid Paxos protocol, because the system is unable to provide services until the new master gets all the logs. In order to optimize, the confirm mechanism is introduced, which is to tell other Acceptor,acceptor to write a confirm log to the log file by Logid the agreed. Then the new master after the reboot, scan the local log, for the log that already has confirm logs, will not re-launch Paxos. Similarly, in response to a client request, there is a need to re-launch a round of Paxos for log without confirm log. Due to the lack of strict confirm log location, can be sent in bulk. To ensure that you do not need to paxos too much of the raised log in order to restart, you need to keep confirm logs at a distance from the most recently committed logid.
Performance optimization
Basic-paxos A log acknowledgment requires at least 2 disk write operations (prepare,promise) and 2 network RTT (prepare,promise). Multi-paxos uses a one-phase commit (omitting the prepare phase) to shorten a log acknowledgment to one RTT and one disk write, and through the confirm mechanism, the recovery time of the new master can be shortened. To improve performance, we can also implement a batch of logs as a group commit, either successfully or unsuccessfully, similar to Group-commit, in exchange for throughput through Rt.
Security (Exception handling)
1.Leader Exception
leader during the term of office, it is necessary to send the heartbeat to each node periodically, it has been told that it is still alive (normal work), and if a node still does not receive a heartbeat within the timeout period, it will attempt to initiate the Select master process. Leader abnormal, then all nodes will appear time-out, enter the main process, select the new master, and then the new master into the recovery process, and finally provide services to the outside. The exceptions we typically call include the following three categories:
(1). Process crash (OS crash)
leader process crash and OS crash similar, as long as the restart time is greater than the heartbeat timeout will cause the node to think leader Hung, triggering the re-selection of the main process.
(2). Node Network exception (node-located network partition)
leader Network Anomaly will also cause other nodes to not receive the heartbeat, but it is possible that leader is alive, but there is a network jitter, so the heartbeat timeout can not be set too short, or easily because the network jitter caused by frequent selection of the host. Another situation is that the node is located in the IDC partition, then the same IDC node can also communicate with each other, if the IDC node can constitute a majority, then normal external services, if not, such as a total of 4 nodes, two IDC, the partition will find any IDC can not reach a majority, Cause the problem of not electing the Lord. Therefore, the general Paxos node number is an odd number, and in the deployment of nodes, the distribution of the IDC node should also be considered.
(3). Disk Failure
The previous two kinds of exceptions, the disk is OK, that is, the log received and corresponding confirm log are in. If the disk fails, the node is added like a new node without any log and proposal information. This situation can lead to the problem that this node may promise a smaller proposal than the one that has been promise, which violates the Paxos principle. Therefore, after restarting, the node can not participate in the Paxos Instance, it needs to catch up with leader, when a complete Paxos Instance is observed, the node can not promise/ack state at the end.
2.Follower Exceptions (downtime, disk corruption, etc.)
for the follower exception, the processing is much simpler, because follower itself does not provide services (logs may not be complete), for leader, as long as the majority can be reached, can provide services to the outside. After follower restart, there is no promise ability until the leader is caught.
q&a
1.paxos protocol data synchronization method relative to the traditional 1 master N-Standby synchronization?
in general, the high availability of traditional databases is based on the main preparation, 1 Master 1 2 copies, the main library crash, Through the HA tool to switch, improve the standby library as the main library. In strong-consistent scenarios, replication can turn on strong synchronization, and both Oracle and MySQL are similar replication modes. However, if the standby network jitter, or crash, causes the log synchronization to fail, the service is unavailable. To this end, we can introduce 1 main n prepared multi-copy form, we compare are 3 copies of the case, one is based on the traditional 1 Master 2, and the other based on the Paxos 1 Master 2 standby. Traditional 1 Master two standby, log synchronization, as long as there is a copy of the log received and the persistence of success, you can return to a certain extent, to solve the network jitter and the crash problem of the standby library. However, if there is a problem with the main library, or the use of the HA tool to switch, then the availability of HA switch tool to ensure that it becomes an issue. Paxos-based multi-copy synchronization actually introduces a consistency protocol on the basis of 1 master N, so the availability of the entire system is completely controlled by 3 copies, without the need for additional HA tools. In fact, many systems in order to ensure that the multi-node HA tool to obtain the consistency of the main information, the use of zookeeper and other third-party interface to achieve the distributed lock, in fact, the essence is based on Paxos to achieve.
2. The relationship between distributed transactions and Paxos protocols?
in the database area, referring to distributed systems, we refer to distributed transactions. The Paxos protocol is not the same thing as a distributed transaction. The role of distributed transactions is to ensure the atomicity of cross-node transactions, where the nodes involved in the transaction are either committed (executed successfully) or not committed (rolled back). The consistency of distributed transactions is usually ensured by 2PC (two-phase Commit, 2PC), which involves a coordinator and several participants. In the first phase, the facilitator asks if the participant's transaction can be executed, the participant responds with consent (local execution succeeds), and the reply is canceled (local execution failed). In the second phase, the facilitator makes decisions based on the first stage of voting results, and rolls back if and only if all the participants agree to commit the transaction. The biggest problem with 2PC is that the coordinator is single point (need to have a backup node), and the Protocol is the blocking protocol, and any one participant fails, all waiting (can be added by adding a timeout mechanism). The Paxos protocol is used to resolve consistency issues between multiple replicas. such as log synchronization, to ensure that the log consistency of each node, or select the main (main fault case), to ensure that the vote agreed, the uniqueness of the election. In short, the 2PC is used to guarantee the atomicity of transactions on multiple data shards, and the Paxos protocol is used to ensure consistency of the same data shards across multiple replicas, so the two can be complementary relationships, not alternative relationships. For the 2PC Coordinator single point problem, you can use the Paxos protocol to resolve, when the coordinator problem, select a new coordinator to continue to provide services. In engineering practice, Google spanner,google Chubby is the use of Paxos to achieve multi-copy log synchronization.
3. How do I apply Paxos to a traditional database replication protocol?
The replication protocol is equivalent to a custom application for Paxos, which is achieved by voting on a series of logs to achieve a majority, which is equivalent to a log that has been successfully persisted in the majority. Replicas synchronize with the master node by applying a log that has been persisted. Because of the acid nature of the database, the essence is from a consistent state to another consistent state, with each transaction operation being a change to the state of the database and generating a log. Due to the sequencing of client operations, it is necessary to ensure that the order of the logs is sequential, in any copy, not only to ensure that all logs are persisted, but also to ensure the order. For each log, through a logid mark, Logid strict increment (order of the Mark), by leader to each log to vote to reach the majority, if the Midway leader switch, for the new leader "empty", need to re-vote, confirm the validity of the log.
Can the 4.multi-paxos non-leader node provide service?
Only leader in the Multi-paxos protocol ensures that all already persisted logs are included, although the locally persisted log does not necessarily have a majority, so for a log that does not have confirm, another poll is required, and the latest results are returned to the client. The non-leader node does not necessarily have all the latest data and needs to be confirmed by leader, so in general engineering implementations, all read and write services are provided by leader.
5. How do I handle a failed client request process?
The client initiates a request to leader, and leader crash before returning. For the client, this operation may or may not succeed. Therefore, the client needs to check the results of the operation and determine whether to re-operate. If leader is persisted locally, and crash is not reached in the majority, the new leader first obtains the maximum logid from each replica as the recovery end point, confirm confirms that it does not have paxos locally, and if the majority is reached at this time, the application succeeds, If not, it is not applied. When the client checks, it knows if its operation was successful. Of course, in specific engineering practice, this involves the client timeout period, as well as the time to select the master and log recovery time.
Reference documents
Https://ramcloud.stanford.edu/~ongaro/userstudy/paxos.pdf
Http://www.cs.utexas.edu/users/lorenzo/corsi/cs380d/papers/paper2-1.pdf
Http://research.microsoft.com/en-us/um/people/lamport/pubs/paxos-simple.pdf
https://zhuanlan.zhihu.com/p/20417442
http://my.oschina.net/hgfdoing/blog/666781
Http://www.cnblogs.com/foxmailed/p/5487533.html
Http://www.wtoutiao.com/p/1a7mSx6.html
Distributed consistency Algorithm--paxos