This article is from the thesis: Paxos made practical
The Paxos has three steps in the implementation:
1) Proposer S1 Select a proposal number n, this number should contain the unique identifier of the proposer machine, so that two different machines will not have the same proposal number. Proposer broadcast the information prepare (n). The machine receiving this information will either reject (has received prepare information greater than n), or reply to Prepare-result (n ', V ') (the message has received the maximum number n ' < n, and value is V '), or reply to Prepare-result (0 , nil) (the machine did not receive any proposals before receiving this information). If the majority accepts this prepare message, the sponsor enters the second step.
2) If the reply is Prepare-result (n ', V '), S1 sets the value of the proposal to V '; If the reply is Prepare-result (0, nil), then value can be arbitrarily determined. Then the sponsor S1 will broadcast the message propose (n, v), as with the first step, the machine receiving this message either refuses (has received the message prepare (n '), where n ' > N), or accepts it.
3) If the majority (including the sponsors) accept this propose message, the sponsor S1 will broadcast decide (n, v) to indicate that the group has agreed on the proposal.
In fact, there is a problem, in the fault-tolerant system, the new machine to the old machine to go. So the majority we mentioned earlier refers to the old machine set, the new machine set, or both. How the new collection is guaranteed to be secure. What to do if the machine goes wrong and there is no new machine to receive the decide information.
One of the papers that introduces the use of the Paxos algorithm is viewstamped Replication, but it has two problems: 1) It is difficult to understand how to replicate a simple system: 2) It assumes that the machine set is fixed and does not consider dynamic joins and exits. Next, our thesis is to elaborate on how to use Paoxs, and to solve the viewstamoed replication limitations.
First we look at the state machine, where it is deterministic (deterministic), so if there are two identical state machines starting from the same initial state and accepting the same request sequence, then they will generate the same reply. A replication system should have two sets of libraries, both server and client. There are three functions available on the server side:
id_t newgroup (char *path);
id_t Joingroup (id_t Group, Char *path);
int run (char *path, sockaddr *other_cohort, buf (*execute) (BUF));
The NewGroup function is used to create a state machine. If you want to join a group, call Joingroup. The third parameter of run, execute, is the implementation of a state machine.
At the client there will be a function invoke that corresponds to execute:
BUF Invoke (id_t Group, sockaddr *cohort, buf request);
RPC is generally used to implement execute. Unfortunately, not all RPC servers are deterministic. For example, when a file server receives a write file request, it sets the file node's modification time so that even if the two machines run the same file server code and perform the same write request, eventually they have different time values for the same write operation, so there is a divergence state. We can solve this problem: let a machine to set this uncertain time value, and then let other machines according to the standard of the machine to execute. To do this we introduce a new function choose:
BUF Choose (BUF request);
It unites the new execute function:
BUF Execute (BUF request, buf extra);
In our file service example, first let a machine execute choose, choose a modification time, put in buffer, the Choose function returns this buffer, then execute execute, and its second parameter extra is the content in buffer (that is, the modification time). That is, to a request, first call choose, on a machine to select those that will cause the divergence of values as the result of the return, then the results broadcast to other machines as execute parameters to execute, so that consistency.
►normal-case operation
There is one machine in the server group that is primary, while the others are marked as backup, and the term view is used to represent a collection of active cohort with the specified primary (cohort refers to a machine in a machine group). Our system assigns a unique view-id to each view, and the view-id automatically increases whenever a view changes occurs (that is, a new view is generated). First Look at a flowchart:
The client sends a request to Primary,primary to log the request and broadcasts the request to another Backups,backups record operation op and returns a confirmation message to primary. If primary knows that the majority (including itself) has recorded the Operation OP, it performs the OP operation and returns the result to the client. Here is a detailed discussion:
Before starting to send a request, the client must know the server group current primary and View-id, to know that this information requires the client to know each of the server group cohort identity, there is no discussion here. Here is the client's request to primary:
Here is a look at the vid, which is the view-id of the current cohort set, and it exists to avoid a new primary re-executing the request when the view changes occurs. Primary after the request is received, each request is appended with a timestamp, and the request sequence in a view is executed sequentially, in order of time. If you combine View-id and timestamps into Viewstamp, you can mark the execution order for all requests, whether or not they are in a view. In the information Replicate_arg sent to backups by primary, this viewstamp is used to identify the order of execution of the request, and there is a viewstamp called committed, which is smaller than Viewstamp, Have completed the request and returned it to clients so that when primary returns to the client, the information is lost or the primary suddenly fails, it only needs to be re-executed from the committed onwards.
Backups will record the request sent by primary, and the timestamp we mentioned earlier will help backups avoid missing the operation, and backups will send a confirmation message when the request is recorded. After the primary receives the confirmation message from the majority, it executes the request and returns the result to the client. But here's a question: backups don't need to execute the request.
►view-change protocol
The process of creating a new view as a result of a fault is similar to Paxos, which is to first propose a new View-id and then propose a new view. Figure 4 Below is the process of this View-change protocol:
When the View Manager (View-id cohort) picks up the new View-id, it sends a VIEW_CHANGE_ARG message to the other cohorts, and after that you send me back to pick the new view and primary.
►optimizations
One optimization of the model is that the Protocol requires all requests to be broadcast to backups, including read-only operations, but this article takes the optimization that if the majority commitment in backups 60 seconds does not form a new view,primary will temporarily reply to read-only requests, Instead of broadcasting it to backups. Another optimization is: If you want to detect a failure, at least three replicas, we can reduce to two replicas, and let the third Machine act as an observer role, that is, the normal situation will not participate in the execution of the request, only if the failure or network Partition then joins the consistency protocol to form a view with the other two machines.