Paxos made practical

Last Update:2014-09-10 Source: Internet

Author: User

Tags new set

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

This article is from: paxos made practical

Paxos has three steps in implementation:

1) Proposer S1 selects a proposal number N, which must contain the unique identifier of the recommender machine, so that two different machines will not have the same proposal number. Proposer broadcasts the Information prepare (n. The machine that receives this message either rejects (the prepare information greater than N has been received), or replies with prepare-result (n', V ') (the maximum number of received messages is 'n' <n, and the value is V'), or the prepare-result (0, nil) is returned) (The machine did not receive any proposal before receiving the message ). If the majority accepts the prepare information, the Creator enters step 2.

2) If the response is prepare-result (n', V'), S1 sets the value of the proposal to V'; if the response is prepare-result (0, nil ), the value can be set at will. Then, initiator S1 broadcasts the information propose (n, V). Like in step 1, the machine that receives the information either rejects the message (has received the information prepare (N ''), N ''> N), or accept.

3) if the majority (including the Creator) accepts this propose information, the Creator S1 broadcasts decide (n, v) to indicate that the group has reached an agreement on this proposal.

In fact, there is a problem here. In the fault-tolerant system, new machines come to old machines. In this way, do we refer to the majority group as the old machine set, the new machine set, or both of them? How does a new set ensure security? What if a machine error occurs and no new machine receives the decide message?

One of the papers about the paxos algorithm is viewstamped replication, but it has two problems: 1) It is difficult to understand how to copy a simple system: 2) it assumes that the machine set is fixed and does not consider dynamic addition or exit. Next, we will explain in detail how to use paoxs and solve the limitations of viewstamoed replication.

First, let's take a look at the state machine. Here it is deterministic. So if two identical state machines start from the same initial state and accept the same request sequence, then they will generate the same response. A replication system must have two sets of databases, server and client. Three functions are provided on the server:

Id_t newgroup (char * path );

Id_t joingroup (id_t group, char * path );

Int run (char * path, sockaddr * other_cohort, Buf (* execute) (BUF ));

The newgroup function is used to create a state machine. To join a group, call joingroup. The third parameter of run is the implementation of the state machine.

On the client, there will be a function invoke corresponding to execute:

Buf invoke (id_t group, sockaddr * cohort, Buf request );

Generally, execute is implemented using RPC. Unfortunately, not all RPC servers are deterministic. For example, when the file server receives a file write request, it sets the modification time of the file node, so that even if the two machines run the same file server code and perform the same write request, however, they eventually have different time values in the same write operation, so they are in different States. We can solve it like this: let a machine set this uncertain time value, and then let other machines execute according to this machine's standards. To this end, we introduce a new function choose:

Buf choose (BUF request );

It is combined with the new Execute function:

Buf execute (BUF request, Buf extra );

In our file service example, first let a machine execute choose, select a modification time, and put it into the buffer. The choose function returns the buffer, and then execute, its second parameter extra is the content in the buffer (that is, the modification time ). That is to say, for a request, first call choose to select the values that will cause divergence on a machine as the result to be returned, then the result is broadcast to other machines for execution as the execute parameter, so that consistency is achieved.

? Normal-case operation

One machine in the server group is primary, and the others are marked as backup, the term view is used to indicate a set of active cohort with the specified Primary (cohort refers to a machine in the Machine Group). Our system assigns a unique view-ID for each view, view-ID is automatically added when view changes occurs (that is, a new view is generated. Let's look at a flowchart:

The client sends a request to primary, which records the request and broadcasts the request to other backups. The backups records the operation op and returns a confirmation message to primary. If primary knows that the majority (including itself) has recorded the operation op, it will execute the operation op and return the result to the client. The following is a detailed description:

Before sending a request, the client must know the current primary and view-ID of the server group. to know this information, the client must know the identity of each cohort in the server group. The following is the request sent by the client to primary:

Here, let's take a look at the vid, which is the view-ID of the current cohort set. It exists to prevent the new primary from re-executing this request when view changes occurs. After receiving the request, primary attaches a timestamp to each request. Then, the request sequence in a view is executed in sequence by time. If you combine view-ID and timestamp into viewstamp, you can mark the execution sequence for all requests (whether or not in a view. In the message replicate_arg sent by primary to the backups, The viewstamp is used to identify the execution sequence of the request. In addition, there is a viewstamp called committed, which is smaller than the viewstamp, the request has been executed and returned to the clients. In this way, when the information returned by the primary to the client is lost or the primary suddenly fails, you only need to re-execute the request after committed.

Backups records the requests sent by primary, and the timestamp we mentioned earlier helps the backups avoid missing operations. when the request is recorded, the backups sends a confirmation message. After receiving the majority confirmation information, primary executes the request and returns the result to the client. But there is a question: Does backups not need to execute the request?

? View-change Protocol

The process of creating a new view through view changes due to a fault is similar to paxos. Both of them first propose a new view-ID and then propose a new view. The following figure 4 shows the process of the view-change protocol:

When view manager selects a new view-ID, it sends a view_change_arg message to other cohorts, then you send me back to select a new view and primary.

? Optimizations

An optimization of the model proposed in this article is that the Protocol requires all requests to be broadcast to the backups, including read-only operations, however, the optimization in this article is that if the majority in the backups promises not to form a new view in 60 seconds, primary will temporarily reply to the read-only request, instead of broadcasting it to the backups. Another optimization is: To detect a fault, at least three replicas are required. We can reduce it to two replicas, and let the third machine act as an observer, that is, requests are not executed normally. Only when a fault or network partition occurs, the consistency protocol is added to form a view with the other two machines.

Paxos made practical

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More