Zookeeper Paxos algorithm

Source: Internet
Author: User
Tags ack zookeeper

Original source: http://rdc.taobao.com/blog/cs/?p=162

This paper mainly introduces the election of zookeeper Server leader in zookeeper, zookeeper leader algorithm (mainly fast Paxos) in the election Paxos. Here are two main types: Leaderelection and fastleaderelection.

we need to know the following points first
    • How a server knows other servers


In zookeeper, the number of servers in a zookeeper cluster is fixed, and each server is used for the election of the IP and port in the configuration file

    • There is no other way than IP and port to identify a server

Each server has a numeric number and is unique, we number each server according to the configuration in the configuration file, this step needs to be done manually at the time of deployment, need to create a file called myID in the directory where the data file is stored, and write its own number, This number is useful for handling the same value as I submitted

    • The necessary condition for becoming a leader

Get N/2 + 1 server consent (this means N/2 + 1 servers to agree that owning Zxid is the largest server of all servers)

    • Zookeeper in the election using UDP or TCP

Zookeeper in the election is mainly the use of UDP, but also a implementation of TCP, the two implementations introduced here is the use of UDP

    • What kinds of states are in zookeeper

Looking initialization state

Leading leader status

Following follower status

    • If all Zxid are the same (for example, when you first initialize), you may not be able to form a n/2+1 server at this time.

Each server in zookeeper has an ID that is not duplicated and is sorted by size, and if this is the case, zookeeper will recommend which server with the largest ID as leader

    • Zookeeper leader how to know Fllower still alive, fllower how to know leader still alive

Leader timed to Fllower Ping message, Fllower timed to leader Ping message, when found leader can't ping, change their state (looking), launch a new round of elections

noun explanation

Zookeeer Server:zookeeper in a server, hereinafter referred to as the server

ZXID (Zookeeper transtion ID): Zookeeper transaction ID, which is the key factor in the election process, it determines who the current server will vote for (that is, my value in the election process, which is just one , as well as an ID)

MYID/ID (Zookeeper server ID): Zookeeper server ID, he is also a factor in the ability to become a leader

Epoch/logicalclock: He is mainly used to describe whether leader has changed, and each server starts with an epoch with an initial value of 0, and when the new election is started, Epoch plus 1, and the Epoch plus 1 when the election is completed.

Tag/sequencer: Message number

XID: A randomly generated number that is the same as the epoch function

Fast Paxos Message Flow graph vs. basic Paxos Message Flow graph
    • Basic Paxos Message Flow graph
Client   proposer      acceptor     learner  |   | | | | | | X-------->| | | | |  |  Request   |         X--------->|->|->|       |  |  Prepare (N)//propose to all servers   |         | <---------x--x--x       |  |  Promise (N,{VA,VB,VC})//reply to the proposer whether to accept the offer (if not accepted back to the previous step)   |         X--------->|->|->|       |  |  accept! (N,VN)//Send acceptance message to everyone   |         | <---------x--x--x------>|->|  Accepted (N,VN)//reply himself to the proposer has accepted the offer)   |<---------------------------------x--x  Response   |  | | |  |       |  |
    • Fast Paxos Message Flow graph

The electoral process without conflict

Client    Leader         acceptor      learner   |         | | | | | | | | X--------->|->|->|->|       |  |  Any (n,i,recovery)  |   | | | | | | | X------------------->|->|->|->|       |  |  accept! (N,I,W)//offer to all servers, after all servers receive the message, accept the offer   |         | <---------x--x--x--x------>|->|  Accepted (N,I,W)//Send a message to the proposer to accept the offer   |<------------------------------------x--x  Response (W)   |         |          |  |  |  |       |  |
First implementation: Leaderelection

Leaderelection is the simplest implementation of fast Paxos, where each server starts asking the other server who it wants to vote for, and after receiving all the server replies, calculates which server is the largest of the ZXID. and set this server-related information to the next server to vote

Each server has a response thread and an election thread, so let's take a look at what each thread is doing

Response Threads

Its main function is to passively accept the other side of the request, and according to the current status of the corresponding reply, each reply has its own ID, and XID, we based on his state to see what he has replied to what content

Looking Status:

Server-related information that you want to recommend (ID,ZXID)

Leading status

myID, the ID of the last recommended server

Fllowing Status:

The ID of the current leader, and the transaction ID (ZXID) that was last processed

Election thread

The election thread is held by the current server-initiated election thread, whose main function is to count the poll results and select the recommended server. The election thread first initiates an inquiry (including itself) to all servers, the queried party responds according to its current state, and after the election thread receives the reply, verifies whether the query was initiated by itself (verifying that XID is consistent), and then obtains the other's ID (myID). and stored in the current list of queries, and finally get the leader related information (ID,ZXID) proposed by the other party, and store this information in the Voting record table of the election, when the query to all servers, the statistical results are screened and statistics, Calculates which server wins after the query, and sets the current ZXID largest server to the server to be recommended by the current server (possibly itself, or another server, depending on the poll results, But each server will vote for the first time, if the winning server gets N/2 + 1 of the server votes, setting the currently recommended leader as the winning server will set its own State based on the winning server-related information. Each server repeats the process until a leader is selected

To understand the functionality of each thread, let's take a look at the electoral process

    • During the election process, the addition of the server

When a server is started, it will initiate an election, when the relevant process is initiated by the election thread, then each server will get the current ZXID the largest server who is, if the second largest server did not get n/2+1 votes, then the next vote, He will vote for Zxid's largest server, repeat the above process, and finally be able to elect a leader

    • During the election process, the server exits

As long as the n/2+1 server is guaranteed to survive there is no problem, if less than n/2+1 server survived there is no way to elect leader

    • During the electoral process, leader died.

When the election out of leader, at this time each server should be what state (fllowing) has been determined, at this time because leader has died we do not care about it, the other fllower in the normal process to continue, when the process is completed, All fllower will send ping messages to leader, and if they cannot ping, change their status to (Fllowing ==> looking) and initiate a new round of elections

    • After the election, leader died.

The process is handled in the same way as the leader death process during the election, where it is no longer described

Second implementation: Fastleaderelection

Fastleaderelection is the standard fast Paxos implementation, which first proposed to all servers itself to become leader, and when other servers received the offer, resolved the clash between epoch and Zxid, and accepted the other's proposal, Then send a message to the other party to accept the proposal completed

Data structure

Local message structure:

Static public class Notification {
Long leader; The recommended server ID

Long Zxid; Zxid of the recommended server (zookeeper Transtion ID)

Long epoch; Describes whether leader changes (each server starts with a logicalclock, with an initial value of 0)

Quorumpeer.serverstate State; Current status of Sender
Inetsocketaddress addr; Sender's IP address
}

Network message structure:

Static public class Tosend {

int type; Message type
Long leader; Server ID
Long Zxid; Zxid of the server
Long epoch; The epoch of the server
Quorumpeer.serverstate State; The state of the server
Long tag; Message number

Inetsocketaddress addr;

}

server-specific implementations

Each server has a receive thread pool (3 threads) and a Send thread pool (3 threads), and when an election is not initiated, the two thread pools are blocked until a message arrives to unblock and process the message, and each server has an election thread (the thread that can initiate the election) Let's look at what each thread does, as follows:

Processing of the passive Receive message end (receive thread pool):

Notification: First detect whether the recommended Zxid,epoch on the current server is legal (Currentserver.epoch <= Currentmsg.epoch && ( Currentmsg.zxid > Currentserver.zxid | | (Currentmsg.zxid = = Currentserver.zxid && currentmsg.id > Currentserver.id))) If it is illegal to update the recommended value of the current server with the Zxid,epoch,id in the message, the received message is converted into a notification message into the receive queue and an ACK message is sent to the other party

ACK: Put the message number in the ACK queue, detect the other party's status is looking state, if not that there is already leader has been selected, the received message is forwarded to notification message into the receiving queue

The processing of the active Send Message end (send thread pool):

Notification: The message to be sent is converted by the notification message into a tosend message, and then sent to the other party, and wait for the other party's reply, if the wait at the end did not receive the reply to the method, redo three times, if redo or not receive the other party's reply when the current election detection ( Epoch) has changed, and if it has not changed, put the message in the Send queue again, repeating it until a leader is selected or received by the other party.

ACK: Send the relevant information to each other mainly

To initiate the processing of an election-side (election thread):

First your own epoch plus 1, and then generate the notification message, and put the message in the Send queue, the system is configured with several servers to generate a few messages, to ensure that each server can receive this message. If the state of the current server is looking, it loops through the receive queue for a message and, if there is a message, processes it according to the status of the other party in the message.

Looking Status:

First detect whether the epoch in the message is legal, whether it is larger than the current server, if you compare the current server's epoch, update epoch, detection is whether the zxid,id in the message is larger than the current recommended server, if it is to update the relevant values, And a new generation of notification messages into the send-off queue, empty voting statistics, if the message is small epoch do not do anything, if the same detection message zxid,id is legitimate, if the message is Zxid,id large, then update the current server-related information, and the newly generated notification message is put into the send queue, the IP and voting results of the received message are put into the statistics table, and the statistic results are calculated, and the corresponding state is set according to the result.

Leading status:

The IP and voting results of the received messages are put into the statistics table (the tables here are independent), and the statistical results are calculated and the corresponding states are set according to the results.

Following status:

The IP and voting results of the received messages are put into the statistics table (the tables here are independent), and the statistical results are calculated and the corresponding states are set according to the results.

To understand the functionality of each thread, let's take a look at the election process, which is the same as the first course.

    • During the election process, the addition of the server

When a server starts it will initiate an election, at this time by the election thread to initiate the relevant process, by the ZXID and epoch to tell the other server, and finally each server will have to zxid the largest value of the server information, And in the next vote on the ZXID value of the largest server, repeat the above process, and finally will be able to elect a leader

    • During the election process, the server exits

As long as the n/2+1 server is guaranteed to survive there is no problem, if less than n/2+1 server survived there is no way to elect leader

    • During the electoral process, leader died.

When the election out of leader, at this time each server should be what state (fllowing) has been determined, at this time because leader has died we do not care about it, the other fllower in accordance with the normal process to continue, when the process is completed, all Fllower will send a ping message to leader, if it is not able to ping, change its status to (Fllowing ==> looking), launch a new round of elections

    • After the election, leader died.

The process is handled in the same way as the leader death process during the election, where it is no longer described

Zookeeper Paxos algorithm

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.