Paxos Algorithm and Zookeeper analysis

Last Update:2017-12-27 Source: Internet

Author: User

Tags zookeeper

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Reprinted from http://blog.csdn.net/xhh198781/article/details/10949697

1 Paxos Algorithm 1.1 basic definition

The participants in the algorithm are mainly divided into three roles, and each participant can take multiple roles concurrently:

⑴proposer proposed proposals, including the proposal number and the proposed value;

⑵acceptor can accept the proposal after receipt of the proposal;

⑶learner can only "learn" the approved proposal;

Algorithms take care of the basic semantics of consistency:

⑴ resolution (value) can only be approved if it is submitted by proposers (the non-approved resolution is called "proposal (proposal)");

⑵ in an execution instance of a Paxos algorithm, only one value is approved (chosen);

⑶learners can only obtain the value approved (chosen);

There are three semantics above that can evolve into four constraints:

⑴P1: One acceptor must accept (accept) The proposal received for the first time;

⑵P2A: Once a proposal with value V is approved (chosen), then any proposal that Acceptor accepts (accept) must have value V;

⑶P2B: Once a proposal with value V is approved (chosen), then any subsequent proposal by proposer must have value V;

⑷P2C: If a proposal with number n has value V, then there is a majority, or none of them accept any proposal with a number less than n, or they have accepted (Accpet) The proposal with the largest number in the proposal with a number less than n has the value V;

1.2 Base algorithm (basic Paxos)

Algorithm (resolution of the proposed and approved) mainly divided into two stages:

1. Prepare stage:

(1). When Porposer wishes to propose a programme V1, it first issues prepare requests to most acceptor. Prepare request content is serial number <SN1>;

(2). When Acceptor receives the prepare request <SN1>, check the prepare request that it last replied to <SN2>

a). If SN2>SN1, ignore this request and terminate the approval process directly;

b). Otherwise check the last approved accept request <snx,vx>, and reply <SNx,Vx>; If no previous approval, then simple reply <OK>;

2. Accept Approval Phase:

(1a). After a period of time, received some acceptor reply, reply can be divided into the following kinds:

a). The number of responses satisfies the majority, and all responses are <ok> then Porposer issued an accept request for the content of the motion <SN1,V1>;

b). The number of responses satisfies the majority, but some replies are: <sn2,v2>,<sn3,v3> Then Porposer finds more than half of all replies, assuming <snx,vx>, then sends the accept request, the request content is the motion <SN1,Vx>;

c). The number of replies does not meet the majority, proposer try to increase the serial number to sn1+, 1 to continue execution;

(1b). After a period of time, received some acceptor reply, reply can be divided into the following kinds:

a). The number of responses satisfies the majority, then confirms that V1 is accepted;

b). The number of replies did not meet the majority, V1 was not accepted, proposer increased the serial number to sn1+, and 1 continued to execute;

(2). Without violating its commitment to other proposer, Acceptor accepts and responds to the request after receiving the acceptance request.

1.3 Algorithm optimization (Fast Paxos)

Paxos algorithm in the case of competition, its convergence speed is very slow, and even the possibility of a live lock, for example, when there are three and more than three proposer after sending prepare request, it is difficult to have a proposer receive more than half of the response to continue to implement the first phase of the agreement. Therefore, in order to avoid the competition, accelerate the convergence speed, in the algorithm introduced a leader this role, under normal circumstances at the same time should have only one participant to play the leader role, while the other participants play the role of Acceptor, At the same time all the people are playing the role of learner.

In this optimization algorithm, only leader can propose a motion, so as to avoid the competition so that the algorithm can quickly converge and become consistent, at this time, the Paxos algorithm is essentially back into a two-phase commit protocol. However, in exceptional cases, the system may appear multiple leader, but this does not break the algorithm to the consistency of the guarantee, at this time multiple leader can propose their own proposals, the optimization of the algorithm degenerated into the original Paxos algorithm.

A leader workflow consists of three phases:

(1). Learning phase to other participants to learn the data they do not know (resolution);

(2). The synchronization phase allows the majority of participants to maintain the consistency of the data (resolution);

(3). Service stage for client service, motion;

1.3.1 Learning Stage

When a participant becomes a leader, it should need to know most of the Paxos instances, so it will start an active learning process immediately. Assuming that the current new leader has already known the Paxos instance of 1-134, 138, and 139, it executes the first phase of the Paxos instance of 135-137 and greater than 139. If only the Paxos instances of 135 and 140 are detected to have a certain value, then it will eventually know the Paxos instances of 1-135 and 138-140.

1.3.2 Synchronization Phase

At this point, leader already knows 1-135, 138-140 of the Paxos instance, then it will re-execute 1-135 Paxos instances, to ensure that the majority of participants in 1-135 of the Paxos instances are consistent. As for the Paxos instance of 139-140, it does not execute 138-140 Paxos instances immediately, but waits until the service phase is populated with 136, 137 Paxos instances. The reason here is to fill the interval is to avoid the future leader always to learn the Paxos instances in these intervals, and these Paxos instances do not have a corresponding determination value.

1.3.4 Service Phase

Leader the user's request into the corresponding Paxos instance, of course, it can execute multiple Paxos instances concurrently, when this leader exception, it is likely to cause the Paxos instance to break.

1.3.5 problems

(1). Election principles of leader

(2). Acceptor how to perceive the current leader failure, how the customer knows the current leader

(3). How to kill unwanted leader when multiple leader are present

(4). How to dynamically extend the acceptor

2. Zookeeper2.1 Overall architecture

In the zookeeper cluster, there are three main roles, and each node can only play a role, the three roles are:

(1). Leader accepts all follower requests for proposals and coordinates the voting for initiating the proposal, and is responsible for internal data exchange (synchronization) with all follower;

(2). Follower directly serves the client and participates in the voting of the proposal, while exchanging data (synchronization) with leader;

(3). Observer directly to the client but does not participate in the vote of the proposal, but also with the leader data exchange (synchronization);

Basic design of 2.2 quorumpeer

Zookeeper for each node Quorumpeer design is quite flexible, quorumpeer mainly consists of four components: Client request receiver (Servercnxnfactory), Data Engine (zkdatabase), voter (election ), Core functional components (Leader/follower/observer). which

(1). Servercnxnfactory is responsible for maintaining the connection with the client (receiving the client's request and sending the corresponding response);

(2). Zkdatabase is responsible for storing/loading/locating data (kv+ operation log based on directory tree structure + client session);

(3). Election is responsible for electing a leader node of the cluster;

(4). Leader/follower/observer A Quorumpeer node should complete the core responsibilities;

2.3 Quorumpeer Work Flow

2.3.1 Leader Responsibilities

Follower confirm: Wait for all the follower connection registration, if receive the legal follower registration quantity within the stipulated time, confirm the success; otherwise, the confirmation fails.

2.3.2 Follower Responsibilities

2.4 Election algorithm 2.4.1 leaderelection election algorithm

The election thread is held by the current server-initiated election thread, whose main function is to count the poll results and select the recommended server. The election thread first initiates an inquiry (including itself) to all servers, the queried party responds according to its current state, and after the election thread receives the reply, verifies whether the query was initiated by itself (verifying that XID is consistent), and then obtains the other's ID (myID). and store it in the list of currently queried objects, and finally get the proposed

Leader Information (ID,ZXID) and store this information in the Voting record table of the election, when all serve R

After all inquiries, the statistical results are filtered and counted to calculate which server wins after the query, and the current ZXID largest server is set to the current server to recommend the server (there may be itself, there can be other servers, Depending on the poll results, but each server will vote for the first time, if the winning server gets N/2 + 1 of the server votes, the currently recommended leader is the winning server. Set your own status based on the information about the winning server. Each server repeats the process until the leader is elected.

Initialize ballot (first ballot): Each quorum node is initially cast to itself;

Collect ballot: Use the UDP protocol to collect all quorum node's current ballot (single thread/synchronous mode), timeout setting 200ms;

Statistical votes: 1). The number of votes per quorum node;

2). Create a new ballot for yourself (ZXID, myID are the largest);

Election success: More than half of the votes of one quorum node;

Update ballot: In the event of a failure of this election, the current quorum node will vote for its next election by selecting the appropriate ballot paper (ZXID, myID) from the ballot papers collected;

Handling of abnormal problems

1). During the election process, the addition of the server

When a server starts it will initiate an election, at this time by the election thread to initiate the relevant process, then each Serve R will get the current ZXI d the largest Serve R is who, if the second largest Serve R did not get n/2+1 votes, then the next vote, He will vote for Zxid's largest server, repeat the process, and finally be able to elect a leader.

2). During the election process, the server exits

As long as the n/2+1 server is guaranteed to survive there is no problem, if less than n/2+1 server survived there is no way to elect leader.

3). During the electoral process, leader died.

When the election out of leader, at this time each server should be what state (fllowing) has been determined, at this time because leader has died we do not care about it, the other fllower in the normal process to continue, when the process is completed, All fllower will send a ping message to leader, and if it is not possible to ping it, change its shape to (fllowing ==> looking) and initiate a new round of elections.

4). After the election was completed, leader died.

The process is as above.

5). Double-Master problem

Leader elections are guaranteed to only produce a recognized leader, and follower re-election with the old leader recovery and exit is essentially simultaneous, when follower cannot ping with leader is to think that leader has gone wrong to start a re-election, Leader received follower Ping did not reach more than half to exit leader re-election.

2.4.2 fastleaderelection election algorithm

Fastleaderelection is the standard fast Paxos implementation, which first proposed to all servers itself to become leader, and when other servers received the offer, resolved the clash between epoch and Zxid, and accepted the other's proposal, The message is then sent to the other party to accept the proposal completion.

The fastleaderelection algorithm collects the votes of other nodes by means of asynchronous communication, and simultaneously analyzes the ballot papers and makes different processing according to the current state of the voters in order to speed up the leader election process.

Each server has a receive thread pool and a send thread pool, and when no election is initiated, the two thread pools are blocked until a message arrives to unblock and process the message, and each serve R has an election thread (the thread that can initiate the election).

1). Proactively initiate the processing of an election (election thread)

First your own Logicalclock plus 1, and then generate the notification message, and put the message in the Send queue, the system is configured with several servers to generate a few messages, to ensure that each server can receive this message, if the current server The state of the looking is to loop through the receive queue to see if there is a message, if there is a message, according to the status of the message in the corresponding processing.

2). Handling of the active Send Message end (send thread pool)

The message to be sent is converted by a notification message into a tosend message, then sent to the other party and waits for the other party's reply.

3). Handling of passive Receive message end (receive thread pool)

Converts the received message into a notification message into the receive queue if the other server's epoch is less than Logicalclock sends a message (Let it update the epoch), or if the other server is in the looking state, You are in the following or leading state, you also send a message (the current leader has been generated to let it converge as quickly as possible).

2.4.3 authfastleaderelection election algorithm

The authfastleaderelection algorithm is basically the same as the fastleaderelection algorithm, except that the authentication information is added to the message, and the algorithm is deprecated in the latest zookeeper.

2.5 Zookeeper's API

Name	Synchronous	Asynchronous	Watch	Authority authentication
Create	√	√		√
Delete	√	√		√
Exist	√	√	√
GetData	√	√	√	√
SetData	√	√		√
Getacl	√	√
SetACL	√	√		√
GetChildren	√	√	√	√
Sync		√
Multi	√			√
CreateSession	√
CloseSession	√

Request processing process in 2.6 zookeeper 2.6.1 follower node processes user's read and write requests

2.6.2 leader node processing write requests

It is worth noting that the read operation on the follower/leader is parallel, the read and write operation is serial, and when commitrequestprocessor processes a write request, it blocks all read and write requests.

Unreliable communication: Message latency, message repeat delivery, message loss

Paxos Algorithm and Zookeeper analysis

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More