Leader elections for "distributed" zookeeper

Source: Internet
Author: User
Tags rounds

First, preface

We learned the details of the zookeeper server, which is an important part of the cluster launch, the leader election, and then the leader election.

Ii. election of leader

  2.1 Leader Election overview

Leader elections are the key to ensuring the consistency of distributed data. When a server in the zookeeper cluster has one of the following two scenarios, it is necessary to enter the leader election.

(1) server initialization started.

(2) The server cannot remain connected to leader during the operation.

The following is an analysis of two situations.

  1. Leader election during server start-up period

In the case of leader elections, at least two machines are required, and a server cluster consisting of 3 machines is chosen. During the cluster initialization phase, when there is a server Server1 start, it alone cannot carry out and complete the leader election, when the second child server Server2 start, at this time two machines can communicate with each other, each machine tries to find leader, so enter the leader election process. The election process is as follows

  (1) Each server issues a poll . As a result of the initial situation, Server1 and Server2 will vote for themselves as leader servers, each ballot will contain the myID and ZXID of the server being referred, using (myID, ZXID) to represent, at this time Server1 voting for (1, 0), Server2 's vote was (2, 0), and then each of the votes was sent to other machines in the cluster.

  (2) to accept votes from various services . After each server in the cluster receives a poll, it first determines the validity of the poll, such as checking whether it is the current poll and whether the server is from the looking state.

  (3) Processing of votes . For each poll, the server needs to pk,pk the votes of others and their own votes as follows

    • Priority Check Zxid. ZXID is preferred as leader for larger servers.

    • If the ZXID is the same, then it is more myID. myID the larger server as the leader server.

For Server1, its vote is (1, 0), receive Server2 vote for (2, 0), first will compare the zxid of both, are 0, and then compare myID, at this time Server2 the largest myid, so update their votes for (2, 0), and then re-vote, For Server2, it is not necessary to update their votes, but to send the last polling message to all the machines in the cluster again.

  (4) Statistical voting . After each vote, the server will count the voting information, to determine whether more than half of the machine has received the same voting information, for Server1, Server2, all the statistics of the cluster has two machines have accepted (2, 0) of the voting information, at this time think has elected leader.

  (5) Change the server status . Once the leader is identified, each server updates its state, and if it is follower, it is changed to following, and if it is leader, it is changed to leading.

  2. Leader elections during the run of the server

During the zookeeper operation, leader and non-leader servers do their part, even if there is a non-leader server outage or new accession, this will not affect leader, but once the leader server is hung, the entire cluster will be suspended external services, into a new round of leader elections, the process and the start-up period of the leader election process is basically the same. Assuming that there are Server1, Server2, Server3 three servers running, the current leader is Server2, if a moment leader hung, then began leader elections. The election process is as follows

(1) Change Status . After the leader is hung, the remaining non-observer servers will say their server status changes to looking, and then start to enter the leader election process.

(2) each server will issue a poll . During the run, the ZXID on each server may be different, assuming that Server1 's Zxid is 123,server3 zxid of 122; In the first ballot, Server1 and Server3 would cast themselves, generating votes (1, 123), (3, 122), Then each vote is sent to all machines in the cluster.

(3) receive polls from individual servers . Same as the process at startup.

(4) handle voting . At the same time as the process at startup, Server1 will become leader.

(5) statistical votes . Same as the process at startup.

(6) change the state of the server . Same as the process at startup.

  Analysis of 2.2 leader election algorithm

The version of Zookeeper after 3.4.0 only retains the TCP version of the Fastleaderelection election algorithm. When a machine enters the leader election, the current cluster may be in the following two states

· Leader already exists in the cluster.

· There is no leader in the cluster.

For a cluster already exists leader, this situation is generally a machine started later, before its start, the cluster is already working, in this case, the machine is trying to elect leader, will be told the current server leader information, for the machine, Only need to establish a connection with the leader machine, and the state can be synchronized. In the case of a cluster that does not exist leader is relatively complex, with the following steps

(1) first ballot . Regardless of which led to the leader election, all the machines in the cluster were trying to elect a leader state, the looking state, and the looking machine would send messages to all other machines called polls. The poll contains the SID (unique identifier of the server) and ZXID (transaction ID), (SID, Zxid), to identify the polling information. Assuming that the zookeeper consists of 5 machines, sid 1, 2, 3, 4, 5,ZXID respectively 9, 9, 9, 8, 8, and at this time Sid 2 Machine is the leader machine, a moment, 1, 2 where the machine fails, so the cluster began to conduct leader elections. At the first ballot, each machine would vote for itself, so the number of machines with SIDs of 3, 4, and 5 were (3, 9), (4, 8), (5, 8).

(2) change of vote . After each machine has voted, it will also receive votes from other machines, each of which will handle the voting of the other machines received according to certain rules, which is the core of the entire leader election algorithm, and the term is described below

    vote_sid: The SID of the leader server that was elected in the poll received.

    vote_zxid: The Zxid of the leader server that was elected in the poll received.

    self_sid: The current server's own SID.

    Self_zxid: The current server's own ZXID.

Each time a vote is received, the process is compared to (Vote_sid, Vote_zxid) and (Self_sid, SELF_ZXID).

Rule one: If VOTE_ZXID is greater than SELF_ZXID, approve the currently received poll and send the poll again.

Rule two: If VOTE_ZXID is smaller than self_zxid, stick to your vote and make no changes.

Rule three: If Vote_zxid equals Self_zxid, then compare the SID of both, if the VOTE_SID is greater than SELF_SID, then approve the currently received vote and send the poll again.

Rule four: If Vote_zxid equals Self_zxid, and Vote_sid is less than self_sid, then stick to your own vote and make no changes.

In combination with the above rules, the following cluster change process is given.

(3) determine the leader. After the second round of balloting, each machine in the cluster receives another machine's vote again, then polls, and if a machine receives more than half the same vote, the SID machine for the poll is leader. At this point Server3 will become leader.

As the above rule shows, the newer the data on that server (the larger the Zxid), the more likely it is to become leader, and the more secure it is to recover the data. If the ZXID is the same, the larger the SID the greater the chance.

  2.3 Leader election Implementation details

  1. Server Status

The server has four states, namely looking, following, leading, observing.

  looking: Looking for leader status. When the server is in that state, it will assume that there is no leader in the current cluster and therefore need to enter the leader election state.

  following: Follower state. Indicates that the current server role is follower.

  Leading: Leader status. Indicates that the current server role is leader.

  observing: Observer state. Indicates that the current server role is observer.

 2. Voting data structure

Each poll contains two most basic information, the SID and Zxid of the selected server, and the poll (Vote) contains the fields in zookeeper as follows

  ID: The SID of the leader being elected.

  Zxid: The nominated leader transaction ID.

 Electionepoch: Logical clock, used to determine whether multiple polls are in the same election cycle, the value on the server is a self-increment sequence, every time after entering a new round of voting, the value will be added 1 operations.

  Peerepoch: The epoch of the elected leader.

  State: The status of the current server.

  3. Quorumcnxmanager: Network I/O

Each server starts with a quorumpeermanager that is responsible for the underlying leader of network communication between each server during the election process.

(1) Message Queuing . Quorumcnxmanager internally maintains a series of queues to hold messages received, to be sent, and to send messages to the sender, in addition to the receive queue, other queues are grouped by Sid to form a collection of queues, such as a cluster, in addition to its own 3 machines, Then you will create a send queue for each of these 3 machines, each with no interference.

    recvqueue: A message receive queue that holds messages that are received from other servers.

    queuesendmap: Message send queue, which holds messages to be sent, grouped by SID.

    senderworkermap: The sender collection, each senderworker message sender, corresponds to a remote zookeeper server that is responsible for sending messages and grouping them by SID.

    lastmessagesent: A recently sent message that holds a message that has recently been sent for each SID.

(2) establish a connection . To be able to vote with each other, all the machines in the zookeeper cluster need 22 to establish a network connection. Quorumcnxmanager will create a serversocket at startup to listen for the leader election communication port (default is 3888). When monitoring is turned on, zookeeper is able to continuously receive create connection requests from other servers, which are processed when a TCP connection request is received from another server. To avoid creating a TCP connection repeatedly between the two machines, zookeeper only allows servers with large SIDS to actively establish connections with other machines, or disconnect them. After receiving the Create connection request, the server determines whether to receive the connection request by comparing its own and the remote server's SID value, and if the current server finds its own SID is larger, it disconnects the current connection and then makes a connection to the remote server on its own initiative. Once the connection is established, the corresponding message sender Sendworker and the message receiver Recvworker are created and started based on the SID of the remote server.

(3) receive and send messages . Message Reception : Recvworker is responsible for the message sink, since zookeeper allocates a separate recvworker for each remote server, so each recvworker only needs to continuously read messages from this TCP connection. and save it to the Recvqueue queue. message sending : Since zookeeper allocates a separate sendworker for each remote server, each sendworker only needs to constantly get a message sent from the corresponding message send queue, This message is also placed in lastmessagesent. In Sendworker, once zookeeper discovers that the message sending queue for the current server is empty, a recently sent message needs to be fetched from the lastmessagesent to be sent again. This is to resolve that the receiver is hung before the message is received or the message is received, causing the message to not be processed correctly. At the same time, zookeeper can guarantee that the receiver will handle the message correctly when it is processed.

  4. Fastleaderelection: Election algorithm core

  • External voting : This refers specifically to the votes sent by other servers.

  • Internal voting : The current vote of the server itself.

  • Election rounds : Zookeeper server leader election round, i.e. Logicalclock.

  · PK: Compare internal and external polls to determine if internal voting needs to be changed.

  (1) Ballot management

  sendqueue: The ballot send queue, which holds the ballot to be sent.

  recvqueue: The ballot receive queue, which is used to save the received external votes.

  · Workerreceiver: Ballot receivers. It will continue to obtain the election messages from the other servers from the Quorumcnxmanager and convert them into a ballot paper, which is then saved to Recvqueue, and if the ballot is found to be less than the current server in the voting process, the external ballot is ignored. Send your own internal polls at the same time.

 • Wokersender: The ballot sender, which constantly gets the ballots to be sent from the Sendqueue and passes them to the underlying quorumcnxmanager.

  (2) algorithm core

Shows how the Fastleaderelection module interacts with the underlying network I/O. The basic process of leader elections is as follows

1. self-election rounds . Zookeeper stipulates that all valid ballots must be in the same round and that, at the beginning of a new ballot, the Logicalclock will first be self-increasing.

2. Initialize the ballot . Before starting a new round of balloting, each server initializes its own ballot, and during the initialization phase, each server will nominate itself as leader.

3. send the initial ballot . Once the ballot is initialized, the server initiates the first ballot. Zookeeper will put the newly-initialized votes into Sendqueue, sent out by the transmitter Workersender.

4. receive external votes . Each server continues to fetch external ballots from the Recvqueue queue. If the server finds that it cannot get any external votes, it immediately confirms that it has a valid connection to the other servers in the cluster, if there is no connection, establishes the connection immediately, and sends its own current internal poll again if a connection is established.

  5. Judging the election rounds . After the initial ballot has been sent, the external vote is processed. When dealing with an external ballot, different treatments are carried out according to the election rounds.

    • The election rounds of the external ballot are larger than the internal vote . If the server's own election rounds fall behind the election rounds of the external polling server, it will immediately update its own election rounds (Logicalclock) and clear all the votes received, then use the initialized vote to PK to determine whether to change the internal vote. Finally, the internal vote is sent out.

    • The election rounds of the external ballot are less than the internal vote. If the server receives an election round of the outside ballot that falls behind its own election rounds, then zookeeper directly ignores the external ballot, does nothing, and returns to step 4.

    • An election round of external voting equals an internal vote . You can start the ballot PK at this point.

6. vote PK. In the case of a ballot PK, a change of vote is required if any of the conditions are met.

· If the election rounds of the leader servers elected in the external ballot are larger than the internal votes, a change of vote is required.

· If the election rounds are consistent, then the zxid of the two are compared, and if the zxid of the external ballot is large, a change of vote is required.

· If the zxid of the two are consistent, then the SID of both is compared, and if the SID of the external vote is large, then a change of vote is required.

7. change of vote . After PK, if the external vote is determined to be better than the internal vote, then the change of vote, that is, the use of external voting information to cover the internal vote, after the change is completed, again the change of the internal vote sent out.

8. filing of ballot papers . Whether or not a vote has been changed, the external ballot just received is put in the ballot set Recvset for filing. Recvset is used to record all external votes received by the current server in the leader elections in this round (according to the SID differences of the service team, such as {(1, vote1), (2, Vote2) ...} )。

9. statistical votes . Once the ballot papers have been archived, it is time to start counting votes in order to count whether more than half of the servers in the cluster have approved the current internal vote, and if more than half of the servers have approved the vote, the poll is terminated. Otherwise, return to step 4.

Update server Status . If you have determined that the voting can be terminated, then start updating the server state, the server preferred to determine the current more than half of the server is approved by the leader server is the same as the corresponding servers are self, if you own, then update their server status to leading, if not, According to the specific circumstances to determine whether they are following or observing.

The above 10 steps are the core of fastleaderelection, where step 4-9 passes through several rounds until a leader election is produced.

Iii. Summary

Following this blog post, I learned the specifics of the leader election, which will provide a good basis for subsequent code analysis. And thank you for watching the Garden friends ~

Leader elections for "distributed" zookeeper

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.