1. Overview
The Paxos algorithm is used to implement a fault-tolerant distributed system, which has been known for its obscure and difficult to understand. This may be because the algorithm was first used in Greek. In fact, it is the most straightforward of all distributed algorithms. The essence of the Paxos algorithm is actually a consensus algorithm (I do not agree with the domestic consensus algorithm translation into a consistency algorithm, because the standard of consistency in English should be consistency, the consensus in fact is the expression to reach a consensus meaning )--also known as the conferencing algorithm (synod algorithm).
The Paxos consensus algorithm is derived from some of the basic conditions that must be met, and we will describe these basic conditions in the next section. The final section of this article will discuss in detail the complete Paxos algorithm and the state machine approach that is commonly used when building distributed systems-a method that is well known because it often appears in various distributed system theory articles.
2. Consensus algorithm
2.1 origin of the problemAssume that each process in the process collection can initiate a proposal that contains a value. One consensus algorithm is to ensure that only one proposal is accepted in many of the proposals. If no proposal was proposed, then naturally there would be no proposal to be accepted. But once a proposal is accepted, the process in the process collection is informed of the situation. This process of consensus requires that the following conditions be met:
- Can only accept the proposal issued
- Can only accept one proposal
- The process was unable to know which proposal had been selected until the proposal had been agreed upon.
we do not discuss Liveness's needs in detail here, but the general goal is to ensure that a proposal is eventually chosen and, once accepted, the process will eventually be informed of the proposal. There are three types of proxy roles in this consensus algorithm: proposal Initiator (proposer), Recipient (acceptor), and learner (learner). In a specific implementation, a single process may concurrently serve multiple roles, but this article does not discuss the allocation of roles between processes.
now assume that the roles communicate with each other through messages. We use a common model: asynchronous non-Byzantine (Non-byzantine) models. In this model:
- Agent roles (agents) are processed at various speeds and are likely to fail and restart. Because all agents are likely to fail and restart after the proposal is selected, the failed agent must save some additional information.
- Messages can be sent at different speeds, and messages may be duplicated or lost, but messages cannot be corrupted
2.2 Selected proposal value (choosing a value)
The easiest way to select a proposal is to create only one recipient agent (acceptor agent). A proposal initiator sends a proposal to the recipient, which designates the first proposal it receives. This method is simple but unsatisfactory, because the receiver will not be able to use it once the system is down-this is a single point of failure. so we're going to try another option to select the proposal. This time we use multiple recipient proxies, and a proposal initiator sends a proposal to a group of recipients. A recipient may receive a value from the proposal, or it may not. The proposal was chosen only when enough recipients accepted a proposal. So now the question is, what does it mean enough? --to ensure that only one proposal is selected, we use a large enough set to represent more than half of the agents. Since any two or more agent collections must contain at least one common recipient, this scenario must be feasible if you limit a recipient to accept at most one value.
In the event that the message does not fail or is lost, even if a proposal initiator submits only one proposal, we have to select that proposal. This means that the following conditions must be met:
P1: A recipient must accept the first proposal it receivesbut this condition raises the question of a different proposal from almost the same sponsors, which could create a situation where every recipient has accepted a proposal, but none has been accepted by more than half of the recipients. Even if there are only two proposals, if each is accepted by a different half of the recipients, then as long as one receiver goes down, the other recipients will not be able to get the selected proposal.
The P1 conditions mentioned above and the "proposal only to be selected by more than half of the recipients" means that a recipient must be able to accept multiple proposals. For proposals that the recipient might accept, Paxos would assign a proposal number (natural number) to each proposal and distinguish the different proposals, so a proposal would actually consist of a proposal number and a value from the proposal. To avoid ambiguity, Paxos requires that each proposal have a different proposal number. Of course, this depends on the implementation of the algorithm, but at the moment we don't care about the specifics. When a proposal is accepted by more than half of the recipients, Paxos will select the value of the proposal, at which point we can say that the proposal (and its value) is selected.
Paxos allows multiple proposals to be selected, provided that the selected proposal must have the same value. Based on the derivation of the proposal number, this is sufficient to guarantee:
P2: If the value of the selected proposal is V, then the value of the subsequent selected proposal with a higher number must also be vbecause the proposal number is in order, the condition P2 can strictly guarantee that only one proposal value can be selected.
If a proposal is to be selected, the proposal must be accepted by at least one recipient. Therefore, Paxos satisfies the P2 condition by satisfying the following conditions:
P2A: If the value of the selected proposal is V, then the recipient must have a value of V if they want to accept the subsequent proposalPaxos still needs to maintain the conditions P1 to ensure that a proposal will be selected. Since communication is asynchronous, a proposal may exist when a particular recipient C has never received any proposal. At this point, if a new proposal initiator is awakened and a proposal with a larger number but a different value is given to C, then it is necessary to accept the proposal according to the conditions P1,c, thus violating the P2A's provisions. What do we do now? We have to meet both conditions P1 and P2A. This requires us to refine the P2A statement as:
P2B: If the value of the selected proposal is V, then the value of any subsequent proposal initiated by any sponsor (with a larger proposal number) must also be vThe proposal must be issued first to be accepted, so satisfying the p2b nature satisfies the p2a, and thus satisfies the P2. Here's a look at how you can meet p2b. Assuming that the number of a proposal is m, the value is V, we need to prove that any subsequent proposal with the number N (n > M) is also v. Using an inductive approach to n may make it easier to prove. That is, if the value of a proposal that has a number between [M, N-1] is shown to be V, then we can prove that the value of the proposal with number n is also v. To select a proposal numbered m, it is inevitable that a set of C contains more than half of the recipients that have made each recipient in C accept the proposal. By combining this formulation with the inductive method, it will be found that proposal M is chosen to actually represent:
- Each recipient in Set C has received a proposal in [M, N-1], and a number accepted by a recipient is located in each proposal between [M, N-1], with a value of V.
since any set S containing more than half of the recipients contains at least one member of C, we can infer that the value of the proposed number n is V, and the main basis is:
p2c: for arbitrary v and N, if a proposal is issued with a value of v number N, there is a set S that contains more than half of the recipients, so that all recipients in (a) s have not received a proposal with a number less than N, or (b) The value of the proposal with the largest number in all proposals in s where the recipient accepts a number less than n is v. Therefore, if the P2C condition is satisfied, the P2B is satisfied. If you want to ensure that p2c is satisfied, the sponsor who wants to issue a number n proposal must be informed that the proposal is or will be accepted by more than half of the recipients, which is the number closest to N (if any). It is easy to find and find out which proposals have been accepted, and it is much more difficult to predict which proposals will be accepted. So we don't make any predictions, instead, the sponsors just want to make sure that they don't accept the proposal again. In other words, the sponsor simply asks the recipient to stop accepting proposals that are numbered less than N. Okay, we can now implement the algorithm to send the proposal, the algorithm is as follows:1. A proposal initiator selects a new number n and sends a request to each recipient in a collection of recipients, and expects the recipient to:(a) to give a commitment to no longer accept any proposal with a number less than N, and(b) Provide the proposal (if any) that it has accepted the number closest to N
Paxos calls such a request as a prepare request with number N. 2. If the sponsor of the proposal receives a response from more than half of the recipients (response), it can initiate a proposal with a value of V, where V is the value of the proposal with the largest number in all responses. However, if the responder does not give feedback on any proposal to the initiator, the initiator can use any value. The sponsor then sends a proposal to a group of recipients asking them to accept the proposal. (The recipient collection of the response request is not required to be the group of recipients responding to the initial prepare request.) We call this request an accept request.
Okay, the above is the algorithm of the initiator of the proposal. So what is the receiver algorithm? It receives two types of requests from the initiator: The prepare request and the accept request. In principle, the recipient will not compromise security even if they discard these two types of requests. Therefore, only if it is allowed to respond to a request will we explicitly state it. The recipient is always able to respond to a prepare request. In addition, if it is not explicitly forbidden, it can also respond to an accept request to receive the proposal. In other words:
P1a: When and only if a recipient does not respond to a prepare request with a number greater than n, it can accept a proposal with number nwe can see that the P1A actually includes the P1 statement. Now we can have a complete algorithm to select the proposed value to meet those requirements--and of course everything is premised on the assumption that the number of each proposal is unique. The final algorithm simply introduces a small optimization.
Suppose a receiver receives a prepare request, the number is N, but it has previously responded to a prepare request with a number greater than n, which means it should not respond to any new proposal with number N. The recipient can no longer respond to this newly received prepare request because it will no longer accept a proposal with a number of N. So we have to let the receiver ignore the prepare request, and the recipient ignores the prepare request that contains the proposal that it has accepted. -This is the optimization described in the previous paragraph. after this optimization of the algorithm, a receiver only has to remember that it has accepted the maximum number of proposals and the maximum number of prepare requests it has responded to. Regardless of success or failure, p2c are to be satisfied. For this reason, a recipient has to remember this information, even after it fails to restart. It is worth noting that the sponsor of the proposal is always able to discard a proposal and delete all the information associated with the proposal-as long as it has never attempted to launch another proposal with the same number. we can see that the implementation of the Paxos algorithm is divided into the following two phases, combining the behavior of the initiator and the receiver:
Stage 1 (a) the sponsor of the proposal chooses the proposal number N and uses that number to send a prepare request to more than half of the recipients(b) If a recipient receives the prepare request of number N and the number in all prepare requests that it has previously responded to is not greater than N, then the receiver will respond to the request and ensure that no proposal with a number less than N will be accepted in the future and return the proposal with the largest number it has accepted (if present).
Stage 2(a) if the sponsor of the proposal receives a response from more than half of the recipient's prepare request (number N), then it sends an accept request to the corresponding request for each recipient whose number is the N value is V's proposal is accepted. where V is the value of the proposal that contains the maximum number of responses contained in the proposal or is an arbitrary value if the feedback response does not contain any proposals at all. (b) When the recipient receives an accept request containing a number n proposal, it will determine that it will not accept the proposal if it has previously responded to a prepare request with a number greater than n, otherwise it must accept the proposal.
only by following the steps in the above algorithm, a proposal initiator can initiate multiple proposals. It can also discard proposals at any time in the middle of the process. (Of course, the correctness is guaranteed, even if the proposal is discarded for a long time after the request and response are received by both the recipient and the initiator.) If an initiator has already started trying to send a higher-numbered proposal, discarding that smaller-numbered proposal at this point seems like a good choice. Therefore, if a recipient ignores a prepare request or accept request because it has received a higher number of prepare requests, the recipient should also notify the corresponding proposal initiator and let it discard its proposal as well. This does not undermine correctness and optimizes the performance of the algorithm.
2.3 Be informed of the selected proposal value (learning a chosen value)to be informed that a proposal value is selected, a learner (learner) must be able to identify proposals that have been accepted by more than half of the recipients. The simplest approach is for each recipient to respond to all learners once it has accepted a proposal and issue the proposal to them. This will allow learners to learn as quickly as possible that a proposal has been selected, but this scenario requires each recipient to respond individually to each learner-the total number of responses to be sent is the product of the number of recipients and the number of learners. If there is no Byzantine error (Byzantine failure), it is easy for a learner to learn from another learner that a proposal value is also selected. We can have the recipient send the message of their acceptance of the proposal to a unified learner (distinguished learner), which is then sent by this unified learner to other learners to tell them that a proposal has been selected. This method requires an additional round of communication to allow all learners to be informed of the selected proposal value. At the same time, the scheme is unlikely, as this unified learner is likely to be a single point of failure. The number of responses required for this scenario must be equal to the sum of the recipients and learners.
we made improvements to the above scenario, and this time the recipient sends the message that they received the proposal to a group of learners, and each of the learners in the group then notifies the other learners. Using a group of learners has better reliability, but at the cost of introducing more communication.
because messages can be lost, there is a possibility that when a proposal value is selected, no learner can be informed. The learner could have asked what proposals the recipient had accepted, but the recipient's failure might have made it impossible, meaning that the learner was not able to know if more than half of the recipients had received a proposal. If so, it is only after the new proposal has been selected that the learner is likely to perceive it. So if a learner needs to know if a proposal is selected, it can drive the sponsor to launch a new proposal using the algorithm described above.
2.4 Progresswe can easily build a scenario where two initiator processes are constantly initiating a set of incremental numbers, but none of them wins. Initiator p completed Phase 1 of the proposal for numbering N1, while the other initiator Q also completed phase 1 of the numbering N2>N1 proposal. The accept request for the number N1 in Phase 2 of the initiator p will be ignored because the recipient is making sure no proposal with a number less than N2 is accepted. Thus, the initiator p uses the number N3 > N2 to open and complete the new phase 1, causing the second stage of the initiator Q to accept the request is also ignored. So this "tragedy" will continue to go on, the whole consensus algorithm will not have any progress. in order to deal with this situation, Paxos chooses a special initiator, which specifies that the proposal can only be sent by it. If the initiator is able to successfully communicate with more than half of the recipients and it uses a greater number of proposals than is already used, then the proposal it sends will be selected. If a proposal is used that is smaller than the number of a request, the initiator discards the smaller numbered proposal and retries until eventually it chooses a proposal number that is large enough. If the majority of the system's components (initiators, receivers, and networks) are working properly, the scenario of using this single particular initiator is to avoid liveness problems (liveness problems can probably be understood as a result of some reason, such as deadlocks or starvation, which makes the system impossible to move forward). Many papers suggest that a reliable algorithm for selecting an initiator must be done either randomly or in real time-for example, using timeouts. But regardless of whether the election succeeds or fails, security is the first guarantee.
2.5 ImplementationThe Paxos algorithm involves a set of process networks. In its consensus algorithm, each process plays the role of the initiator, receiver, and learner of the proposal. The algorithm first chooses a leader process to act as the role of a particular initiator and a unified learner. The Paxos consensus algorithm is the algorithm we described above-in this algorithm, the request and response of the proposal is sent and passed as a normal message. (The response message will be labeled with the corresponding proposal number to correlate the corresponding request and block ambiguity.) When implemented, you use persistent storage to hold the information that the recipient must remember to respond to the recipient's failure. Recipients will also save them in the persistence device before sending the response. here's how to make sure that you don't issue multiple proposals with the same number. The different initiator processes need to select the proposal number from the numbered collection that never intersects, so that two different initiator processes never issue the same numbered proposal. Each initiator saves the maximum number of proposals it sends to the persistence device, and then uses a larger number to turn on the algorithm's Phase 1.
3. Implement a state machineA simple way to implement a distributed system is to implement a set of client programs that send commands to a centralized server. This server can be implemented as a deterministic state machine, which executes client commands in some order. This state machine receives a command from the current state as input and produces the corresponding output, and then transitions to another state. For example, the client program for a distributed banking system may be a teller of operations, while a state machine consists of the account balances of all users. A withdrawal operation involves executing a state machine command to reduce the balance of an account with the amount of the withdrawal, and to return the balance before and after the withdrawal.
A single server in the implementation of a single centralized server constitutes a single point of failure. As a result, we can use a set of servers, each of which implements a state machine independently. Because state machines are deterministic, all servers produce the same sequence of states and the same output as long as they perform the same sequence of commands. A client that initiates a command is free to use the output returned by any server.
if you want to ensure that all servers perform the same sequence of state machine commands, we implement a separate set of Paxos consensus algorithm instances. The value selected by the I instance is the I-State machine command in the sequence. Each server plays all roles (initiator, receiver, learner) in each instance of the algorithm. Now let's assume that the set of servers is fixed, so all the algorithm instances are using the same set of servers.
In general operations, a server is selected as a leader, and it plays the role of a particular initiator (the only process that attempts to initiate a proposal) in all instances of the consensus algorithm. The client sends commands to this leader, and the leader later determines the position of each command in the sequence. If the leader says that the sequence number of a client command is 135, then it lets the 135th algorithm instance select the command as the selected proposal. Typically this is successful, and of course it can fail, such as when a server fails or if another server thinks it is leader and it believes that the 135th command should go somewhere else. However, the consensus algorithm ensures that only one command can be selected as number 135th command.
The key to the efficiency of this method is that only the second stage of the Paxos algorithm will select the proposed value. Recall that the result of phase 1 may be 1. Determined the value that was initiated or; 2. The initiator can also present a new value.
Now let's talk about how the implementation of the Paxos state Machine Works, and then discuss what might go wrong--mainly related to how the leader hangs up and how the new leader is chosen. (System startup is a special case, because no command is issued at this time)as a learner of all algorithmic instances, the new leader should perceive most of the commands that have been selected. Suppose it is known that commands 1-134,138 and 139--are, in other words, the commands selected in Instances 1-134, 138, and 139. (We'll find out later how a command sequence gap is generated.) As a result, it executes instance 135-137 and all Phase 1 parts that are larger than 139 instances (which is described later). Assuming that the result is executed, the command selected in instances 135 and 140 is determined, but the command has no limitations in other instances. The leader then executes phase 2 of the algorithm for instances 135 and 140, and therefore selects commands 135 and 140.
Now, leaders and other servers that know as much as leaders can execute command 1-135. But it cannot perform 138-140, as the leader knows, because commands 136 and 137 have not yet been chosen. The leader can of course use the next two commands requested by the client as commands 136 and 137, but we just want to fill this hole right away, by initiating a special "do Nothing (no-op)" command as 136 and 137. The command does not modify the state. (Phase 2 of executing 136 and 137 instances) once these two no-op operations are selected, command 138-140 can be executed.
command 1-140 can now be selected. At the same time the leader has completed Phase 1 for all instances greater than 140, and now it is free to initiate arbitrary proposal values in phase 2 of these instances. The leader assigns the 141th command to the next command initiated by the client and commits it in phase 141th, instance 2. It then sends the next client command received as command 142, and so on.
In one case, the leader was informed that the order number 141th had been issued before it had been selected. It is possible that all messages sent during the processing of command 141th will be lost and the 142th command will be selected before other servers are informed of the leader's 141th command. If the leader does not receive a response from the Phase 2 message in the 141 instance, it will re-send the messages. If all is normal, the command it initiates will be selected. But a leader may fail first, leading to gaps in the command sequence. In general, we will assume that a leader is leading the other learner alpha--that is, the leader can initiate the resolution of the i + 1 to i + α command after the 1-I command has been determined, thus creating a void in the Α-1 command. The newly selected leader is an infinite number of algorithm instance execution phases in the example above are examples of instance 135-137 and all instances greater than 139. It sends a small enough message to the other server to implement the same proposal number for all instances. In Phase 1, more messages are returned only if the recipient has received a Phase 2 message from an initiator. (In this scenario, only instances 135 and 140 are in this case) so a server (receiver) can respond to all instances with a very short message. Executing an infinite number of instances is not a problem. as leaders hang out and elect new leaders at a low rate, the real cost of executing a state machine command-the price of agreeing on a command or proposal-is simply the cost of executing the algorithm phase 2. We can prove that the cost of the Paxos algorithm phase 2 may be the least expensive in all consensus algorithms that require fault tolerance. So Paxos is very efficient in nature.
The above discussion of the normal operation of the system is to assume that there is always a leader process, and that the gap between the leader's hanging out and the new leader being elected is short. In some unusual cases, the leader election may also fail. No new commands can be executed without the server acting as a leader at this point. If multiple servers consider themselves to be leaders, they are likely to initiate a request for a proposed value in the same algorithm instance, which may make all proposals or orders not be selected. But as I said before, security is something to be sure of. Two different servers have no disagreement with the commands of the I state machine. The election of a leader is only to ensure that the consensus algorithm can continue to move forward. If a server collection changes, there must be a mechanism to detect which servers implement which instances of the consensus algorithm. The simplest way is through the state machine itself. The current server collection can also be modified with normal state machine commands as part of the state. A leader process can obtain alpha commands in advance, specifically by specifying a set of servers to execute (i + α) instances after the state Machine command is completed. This is a simple implementation of the complex reconfiguration algorithm.
[Translate] Paxos algorithm detailed