Distributed consistency Algorithm--paxos
The Paxos algorithm is a Leslie Lambert (Leslie Lamport)-based consistency algorithm that was proposed in 1990. The problem solved by the Paxos algorithm is how a distributed system can agree on a value (resolution). In the sense of engineering practice, it is possible to realize multi-copy consistency, distributed lock, name management, serial number distribution and so on through Paxos. For example, in a distributed database system, if the initial state of each node is consistent, each node executes the same sequence of operations, then they can finally get a consistent state. To ensure that each node executes the same sequence of commands, a "consistency algorithm" is executed on each instruction to ensure that the instructions seen by each node are consistent. This article will first describe the original Paxos algorithm (Basic Paxos), mainly describes the two-phase commit process, and then will focus on the Paxos algorithm variant (Multi Paxos), it is the Basic Paxos optimization, and more suitable for engineering practice, and finally I will pass q& A in the way that gives me questions about learning Paxos algorithms and my understanding of these questions.
Concepts and terminology
Proposer: The proposed initiator, which processes the client request, sends the client's request to the cluster in order to determine whether the value can be approved.
Acceptor: The proposed approver, responsible for handling the received proposals, their reply is a vote, will store some state to decide whether to receive a value.
Replica: A node or replica, a server in a distributed system, typically a single physical machine or virtual machine, while assuming the proponents and receiver roles in Paxos.
Proposalid: Each proposal has a number, and the proposed high priority is high.
The process used in Paxos Instance:paxos to agree on the same value across multiple nodes is comparable to a log sequence number: Logindex, different logindex belong to different Paxos Instance.
Acceptedproposal: A proposal that has been received within a Paxos instance
Acceptedvalue: The value of the proposal that has been received within a Paxos instance.
Minproposal: Within a Paxos instance, the currently received minimum proposed value is constantly updated
Basic-paxos algorithm
The system based on the Paxos protocol requires that more than half of the nodes in the system are online and communicate with each other normally. Its core implementation of Paxos instance consists of two phases: the preparation phase (prepare phase) and the proposal phase (accept phase). As shown in the following:
1. Obtain a proposalid, in order to ensure proposalid increment, can be generated by time stamp +serverid mode;
2. The proponents broadcast prepare (n) requests to all nodes;
3. The recipient compares N and Minproposal, if N>minproposal, indicates an updated proposal, minproposal=n; otherwise (acceptedproposal,acceptedvalue) returns;
4. When the proposer receives a majority of the request, if it finds a acceptedvalue return, indicates that there is an updated proposal, saves the acceptedvalue to local, and then jumps 1, generating a higher proposal;
5. In the present Paxos instance, there is no higher priority proposal, which can enter the second stage, broadcasting accept (N,value) to all nodes;
6. The receiver compares N and Minproposal, if n>=minproposal, then acceptedproposal=minproposal=n,acceptedvalue=value, after local persistence, returns;
Otherwise, return minproposal
7. When the proposer receives a majority of the request, if it finds a return value of >n, indicating that there is an updated proposal, jump 1; otherwise value is agreed.
From the above process, we can see that in the case of concurrency, there may be a 4th or 7th step of frequent retries, resulting in poor performance, the more serious may lead to never reach a consistent situation, is called "Live Lock", as shown:
1.S1, as a proposer, initiated prepare (3.1) and reached a majority in s1,s2 and S3;
2. Subsequently, S5, as a proposer, initiated the prepare (3.5) and reached a majority in s3,s4 and S5;
3.S1 initiated the Accept (3.1,value1), because S3 proposed 3.5>3.1, resulting in the accept request cannot reach the majority, S1 attempt to regenerate the proposal
4.S1 initiated prepare (4.1) and reached a majority in s1,s2 and S3
5.S5 initiated Accpet (3.5,VALUE5), 4.1>3.5 attempted to regenerate the proposal due to S3 's proposed S5, which prevented the accept request from reaching a majority
6.S5 initiated prepare (5.5) and reached a majority in s3,s4 and S5, resulting in a subsequent S1-initiated accept (4.1,VALUE1) failure
......
The role of the prepare stage
From the description of the Basic-paxos, it is necessary to finalize a value through two stages, due to the many samsara, resulting in low performance, at least two times the network RTT. So can the prepare phase be omitted? As shown in the following:
1.S1 first initiated the Accept (1,red), and in S1,s2 and S3 reached the majority, red on the s1,s2,s3 on the persistence of
2. Subsequently S5 initiated the Accept (5,blue), and for S3, the Acceptedvalue value will be changed to blue as a result of receiving the updated proposal
3. Then S3,s4 and S5 reached a majority, and blue was persistent in S3,S4 and S5
4. The final result is that the value of S1 and S2 is red, and the value of S3,S4 and S5 is blue and there is no agreement.
So the two phases are essential, and the prepare phase is the function of blocking the old proposal and returning the acceptedproposal that has been received. It is also possible to see that, assuming that only S1 offers, there will be no problem, and this is the multi-paxos we are going to talk about.
Multi-paxos algorithm
Paxos is to agree on a value, Multi-paxos is a continuous number of Paxos instance to agree on multiple values, here the most important reason is Multi-paxos protocol has a leader. Leader is the only proposal in the system, all proposals in the lease lease cycle have the same proposalid, can skip the prepare phase, the bill only accept the process, a proposalid can correspond to multiple value, So called Multi-paxos.
Election
First we need to have a leader, in fact, choose the essence of the Lord is also a Paxos algorithm process, but this time Paxos determine "who is leader" this value. Since any one node can initiate the proposal, in the concurrency situation, there may be a multi-master situation, such as a, a and a leader. In order to avoid frequent selection of the main, elected leader nodes to immediately establish their own leader authority (let other nodes know it is leader), write a special log (start-working log) to confirm their identity. According to the majority principle, only one leader StartWorking log can reach the majority. Leader confirmed identity, can be through the lease mechanism (lease) to maintain their leader identity, so that the other proposal no longer initiate the proposal, so entered the leader term, because there is no concurrency conflict, so you can skip the prepare phase, directly into the accept phase. The analysis shows that when leader is selected, all logs in Leader's tenure require only one network RTT (Round trip time) to reach agreement.
New Master Recovery process
Since there is no limit in Paxos, any node can participate in the selection of the master and eventually become leader, there is no guarantee that the newly elected leader contains all the logs, there may be holes, so before the actual service, there is a recovery process to get all the submitted logs. The new master queries all members for the maximum Logid request, and after receiving a majority response, selects the largest logid as the log recovery end point, where the majority meaning is that the recovery end point contains all the agreed logs and, of course, a log with no majority. Once you get Logid, you start from scratch on each Logid Paxos protocol, because the system is unable to provide services until the new master gets all the logs. In order to optimize, the confirm mechanism is introduced, which is to tell other Acceptor,acceptor to write a confirm log to the log file by Logid the agreed. Then the new master after the reboot, scan the local log, for the log that already has confirm logs, will not re-launch Paxos. Similarly, in response to a client request, there is a need to re-launch a round of Paxos for log without confirm log. Due to the lack of strict confirm log location, can be sent in bulk. To ensure that you do not need to paxos too much of the raised log in order to restart, you need to keep confirm logs at a distance from the most recently committed logid.
Performance optimization
Basic-paxos A log acknowledgment requires at least 2 disk write operations (prepare,promise) and 2 network RTT (prepare,promise). Multi-paxos uses a one-phase commit (omitting the prepare phase) to shorten a log acknowledgment to one RTT and one disk write, and through the confirm mechanism, the recovery time of the new master can be shortened. To improve performance, we can also implement a batch of logs as a group commit, either successfully or unsuccessfully, similar to Group-commit, in exchange for throughput through Rt.
Security (Exception handling)
1.Leader exception
leader during the term, the heartbeat needs to be sent to each node periodically, it has been told that it is still alive (working), and if a node still does not receive a heartbeat within the timeout period, it will attempt to initiate the selection process. Leader abnormal, then all nodes will appear time-out, enter the main process, select the new master, and then the new master into the recovery process, and finally provide services to the outside. The exceptions we typically call include the following three categories:
(1). Process crash (OS crash)
Leader process crash and OS crash, as long as the restart time is greater than the heartbeat timeout causes the node to think leader hangs, triggering the re-selection of the main process.
(2). Node Network exception (node's network partition)
Leader Network exception can also cause other nodes to not receive the heartbeat, but it is possible that leader is alive, but there is a network jitter, so the heartbeat timeout can not be set too short, or easily because the network jitter caused by frequent selection of the host. Another situation is that the node is located in the IDC partition, then the same IDC node can also communicate with each other, if the IDC node can constitute a majority, then normal external services, if not, such as a total of 4 nodes, two IDC, the partition will find any IDC can not reach a majority, Cause the problem of not electing the Lord. Therefore, the general Paxos node number is an odd number, and in the deployment of nodes, the distribution of the IDC node should also be considered.
(3). Disk Failure
The first two exceptions, the disks are OK, that is, the logs received and the corresponding confirm logs are in. If the disk fails, the node is added like a new node without any log and proposal information. This situation can lead to the problem that this node may promise a smaller proposal than the one that has been promise, which violates the Paxos principle. Therefore, after restarting, the node can not participate in the Paxos Instance, it needs to catch up with leader, when a complete Paxos Instance is observed, the node can not promise/ack state at the end.
2.Follower Exceptions (downtime, disk corruption, etc.)
for follower exceptions, the processing is much simpler because follower itself does not provide services (logs may not be complete), for leader, as long as the majority can be reached, can provide services. After follower restart, there is no promise ability until the leader is caught.
What is the difference between the
Q&a
1.Paxos protocol data synchronization method compared to the traditional 1 master N-Standby synchronization method?
in general, the high availability of traditional databases is based on the primary and standby, 1 Master 1 is prepared 2 copies, the main library crash, through the HA tool to switch, improve the standby library as the main library. In strong-consistent scenarios, replication can turn on strong synchronization, and both Oracle and MySQL are similar replication modes. However, if the standby network jitter, or crash, causes the log synchronization to fail, the service is unavailable. To this end, we can introduce 1 main n prepared multi-copy form, we compare are 3 copies of the case, one is based on the traditional 1 Master 2, and the other based on the Paxos 1 Master 2 standby. Traditional 1 Master two standby, log synchronization, as long as there is a copy of the log received and the persistence of success, you can return to a certain extent, to solve the network jitter and the crash problem of the standby library. However, if there is a problem with the main library, or the use of the HA tool to switch, then the availability of HA switch tool to ensure that it becomes an issue. Paxos-based multi-copy synchronization actually introduces a consistency protocol on the basis of 1 master N, so the availability of the entire system is completely controlled by 3 copies, without the need for additional HA tools. In fact, many systems in order to ensure that the multi-node HA tool to obtain the consistency of the main information, the use of zookeeper and other third-party interface to achieve the distributed lock, in fact, the essence is based on Paxos to achieve.
2. What is the relationship between distributed transactions and the Paxos protocol?
in the database domain, when it comes to distributed systems, it mentions distributed transactions. The Paxos protocol is not the same thing as a distributed transaction. The role of distributed transactions is to ensure the atomicity of cross-node transactions, where the nodes involved in the transaction are either committed (executed successfully) or not committed (rolled back). The consistency of distributed transactions is usually ensured by 2PC (two-phase Commit, 2PC), which involves a coordinator and several participants. In the first phase, the facilitator asks if the participant's transaction can be executed, the participant responds with consent (local execution succeeds), and the reply is canceled (local execution failed). In the second phase, the facilitator makes decisions based on the first stage of voting results, and rolls back if and only if all the participants agree to commit the transaction. The biggest problem with 2PC is that the coordinator is single point (need to have a backup node), and the Protocol is the blocking protocol, and any one participant fails, all waiting (can be added by adding a timeout mechanism). The Paxos protocol is used to resolve consistency issues between multiple replicas. such as log synchronization, to ensure that the log consistency of each node, or select the main (main fault case), to ensure that the vote agreed, the uniqueness of the election. In short, the 2PC is used to guarantee the atomicity of transactions on multiple data shards, and the Paxos protocol is used to ensure consistency of the same data shards across multiple replicas, so the two can be complementary relationships, not alternative relationships. For the 2PC Coordinator single point problem, you can use the Paxos protocol to resolve, when the coordinator problem, select a new coordinator to continue to provide services. In engineering practice, Google spanner,google Chubby is the use of Paxos to achieve multi-copy log synchronization.
3. How do I apply Paxos to a traditional database replication protocol?
The replication protocol is equivalent to a custom application for Paxos, which is achieved by voting on a series of logs to achieve a majority, which is equivalent to a log that has been successfully persisted in the majority. Replicas synchronize with the master node by applying a log that has been persisted. Because of the acid nature of the database, the essence is from a consistent state to another consistent state, with each transaction operation being a change to the state of the database and generating a log. Due to the sequencing of client operations, it is necessary to ensure that the order of the logs is sequential, in any copy, not only to ensure that all logs are persisted, but also to ensure the order. For each log, through a logid mark, Logid strict increment (order of the Mark), by leader to each log to vote to reach the majority, if the Midway leader switch, for the new leader "empty", need to re-vote, confirm the validity of the log.
Can the 4.multi-paxos non-leader node provide service?
Only leader in the Multi-paxos protocol ensures that all already persisted logs are included, although the locally persisted log does not necessarily have a majority, so for a log that does not have confirm, another poll is required, and the latest results are returned to the client. The non-leader node does not necessarily have all the latest data and needs to be confirmed by leader, so in general engineering implementations, all read and write services are provided by leader.
5. How do I handle a failed client request process?
The client initiates a request to leader, and leader crash before returning. For the client, this operation may or may not succeed. Therefore, the client needs to check the results of the operation and determine whether to re-operate. If leader is persisted locally, and crash is not reached in the majority, the new leader first obtains the maximum logid from each replica as the recovery end point, confirm confirms that it does not have paxos locally, and if the majority is reached at this time, the application succeeds, If not, it is not applied. When the client checks, it knows if its operation was successful. Of course, in specific engineering practice, this involves the client timeout period, as well as the time to select the master and log recovery time.
Reference documents
Https://ramcloud.stanford.edu/~ongaro/userstudy/paxos.pdf
Http://www.cs.utexas.edu/users/lorenzo/corsi/cs380d/papers/paper2-1.pdf
Http://research.microsoft.com/en-us/um/people/lamport/pubs/paxos-simple.pdf
https://zhuanlan.zhihu.com/p/20417442
http://my.oschina.net/hgfdoing/blog/666781
Http://www.cnblogs.com/foxmailed/p/5487533.html
Http://www.wtoutiao.com/p/1a7mSx6.html
Category: Distributed Tags: distributed Paxos Multi Paxos Conformance protocol Distributed Transaction Lock
Consistency algorithm--paxos