Paxos made practical

Source: Internet
Author: User
Tags new set

This article is from: paxos made practical


Paxos has three steps in implementation:

1) Proposer S1 selects a proposal number N, which must contain the unique identifier of the recommender machine, so that two different machines will not have the same proposal number. Proposer broadcasts the Information prepare (n. The machine that receives this message either rejects (the prepare information greater than N has been received), or replies with prepare-result (n', V ') (the maximum number of received messages is 'n' <n, and the value is V'), or the prepare-result (0, nil) is returned) (The machine did not receive any proposal before receiving the message ). If the majority accepts the prepare information, the Creator enters step 2.

2) If the response is prepare-result (n', V'), S1 sets the value of the proposal to V'; if the response is prepare-result (0, nil ), the value can be set at will. Then, initiator S1 broadcasts the information propose (n, V). Like in step 1, the machine that receives the information either rejects the message (has received the information prepare (N ''), N ''> N), or accept.

3) if the majority (including the Creator) accepts this propose information, the Creator S1 broadcasts decide (n, v) to indicate that the group has reached an agreement on this proposal.

 

In fact, there is a problem here. In the fault-tolerant system, new machines come to old machines. In this way, do we refer to the majority group as the old machine set, the new machine set, or both of them? How does a new set ensure security? What if a machine error occurs and no new machine receives the decide message?

 

One of the papers about the paxos algorithm is viewstamped replication, but it has two problems: 1) It is difficult to understand how to copy a simple system: 2) it assumes that the machine set is fixed and does not consider dynamic addition or exit. Next, we will explain in detail how to use paoxs and solve the limitations of viewstamoed replication.

 

First, let's take a look at the state machine. Here it is deterministic. So if two identical state machines start from the same initial state and accept the same request sequence, then they will generate the same response. A replication system must have two sets of databases, server and client. Three functions are provided on the server:

Id_t newgroup (char * path );

Id_t joingroup (id_t group, char * path );

Int run (char * path, sockaddr * other_cohort, Buf (* execute) (BUF ));

The newgroup function is used to create a state machine. To join a group, call joingroup. The third parameter of run is the implementation of the state machine.

On the client, there will be a function invoke corresponding to execute:

Buf invoke (id_t group, sockaddr * cohort, Buf request );

Generally, execute is implemented using RPC. Unfortunately, not all RPC servers are deterministic. For example, when the file server receives a file write request, it sets the modification time of the file node, so that even if the two machines run the same file server code and perform the same write request, however, they eventually have different time values in the same write operation, so they are in different States. We can solve it like this: let a machine set this uncertain time value, and then let other machines execute according to this machine's standards. To this end, we introduce a new function choose:

Buf choose (BUF request );

It is combined with the new Execute function:

Buf execute (BUF request, Buf extra );

In our file service example, first let a machine execute choose, select a modification time, and put it into the buffer. The choose function returns the buffer, and then execute, its second parameter extra is the content in the buffer (that is, the modification time ). That is to say, for a request, first call choose to select the values that will cause divergence on a machine as the result to be returned, then the result is broadcast to other machines for execution as the execute parameter, so that consistency is achieved.

 

? Normal-case operation

One machine in the server group is primary, and the others are marked as backup, the term view is used to indicate a set of active cohort with the specified Primary (cohort refers to a machine in the Machine Group). Our system assigns a unique view-ID for each view, view-ID is automatically added when view changes occurs (that is, a new view is generated. Let's look at a flowchart:

 

The client sends a request to primary, which records the request and broadcasts the request to other backups. The backups records the operation op and returns a confirmation message to primary. If primary knows that the majority (including itself) has recorded the operation op, it will execute the operation op and return the result to the client. The following is a detailed description:

Before sending a request, the client must know the current primary and view-ID of the server group. to know this information, the client must know the identity of each cohort in the server group. The following is the request sent by the client to primary:

 

Here, let's take a look at the vid, which is the view-ID of the current cohort set. It exists to prevent the new primary from re-executing this request when view changes occurs. After receiving the request, primary attaches a timestamp to each request. Then, the request sequence in a view is executed in sequence by time. If you combine view-ID and timestamp into viewstamp, you can mark the execution sequence for all requests (whether or not in a view. In the message replicate_arg sent by primary to the backups, The viewstamp is used to identify the execution sequence of the request. In addition, there is a viewstamp called committed, which is smaller than the viewstamp, the request has been executed and returned to the clients. In this way, when the information returned by the primary to the client is lost or the primary suddenly fails, you only need to re-execute the request after committed.

Backups records the requests sent by primary, and the timestamp we mentioned earlier helps the backups avoid missing operations. when the request is recorded, the backups sends a confirmation message. After receiving the majority confirmation information, primary executes the request and returns the result to the client. But there is a question: Does backups not need to execute the request?

 

? View-change Protocol

The process of creating a new view through view changes due to a fault is similar to paxos. Both of them first propose a new view-ID and then propose a new view. The following figure 4 shows the process of the view-change protocol:

 

When view manager selects a new view-ID, it sends a view_change_arg message to other cohorts, then you send me back to select a new view and primary.

 

? Optimizations

An optimization of the model proposed in this article is that the Protocol requires all requests to be broadcast to the backups, including read-only operations, however, the optimization in this article is that if the majority in the backups promises not to form a new view in 60 seconds, primary will temporarily reply to the read-only request, instead of broadcasting it to the backups. Another optimization is: To detect a fault, at least three replicas are required. We can reduce it to two replicas, and let the third machine act as an observer, that is, requests are not executed normally. Only when a fault or network partition occurs, the consistency protocol is added to form a view with the other two machines.

Paxos made practical

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.