How to elect a leader,zookeeper in the zookeeper cluster using three kinds of algorithms, which can be configured in the configuration file, the corresponding configuration item is "Electionalg", Where 1 corresponds to the leaderelection algorithm, 2 corresponds to the authfastleaderelection algorithm, 3 corresponds to the fastleaderelection algorithm. The fastleaderelection algorithm is used by default. The other two algorithms I have not studied, will not say much.
to understand this algorithm, it is better to need some theoretical basis of Paxos algorithm.
1) Data recovery phase
First, each in the Zookeeper server first read the current data stored in the disk, zookeeper each piece of data, there is a corresponding ID value, this value is incremented, in other words, the more new data, the corresponding ID value is greater.
2) Send your own voting values for the first time
After reading the data, each zookeeper server sends its own election leader, which contains the following parts of the data:
1) The ID of the elected leader (that is, the ID of each server written in the configuration file), at the initial stage, the value of each server is the ID of its own server, that is, they elect themselves as leader.
2) The server's maximum data ID, which is a large server, indicating that the updated data is stored.
3) The value of the logical clock, which increments from 0, each time the election corresponds to a value, that is to say: If in the same election, then this value should be consistent 2) the greater the logical clock value, the more the process of this election leader is updated.
4) The status of this machine in the current election process, there are several ooking,following,observing,leading, as the name implies no need to explain it.
after each server sends the above data from its own server to the other servers in the cluster, it also needs to receive data from other servers, which will do the following:
1) If the status of the received data server is still in the election phase (looking state), then the first judgment of the logical clock value, and divided into the following three cases:
A) If the logical clock sent over is greater than the current logical clock, then this is an updated election, it is necessary to update the local logic clock value, while the previously collected from the other servers to clear the election, Because the data is no longer valid. Then decide if you need to update your current election situation. Here is judged by the leader ID of the election, the maximum data id saved, the weight relationship between the two data on the result of this election is: First look at the data ID, the data ID of the big winner; Second, judge leader Id,leader ID winner. It then broadcasts its latest election results (that is, three of the above-mentioned data to other servers). The code is as follows
- if (N.epoch > Logicalclock) {
- Logicalclock = N.epoch;
- Recvset.clear ();
- if (Totalorderpredicate (N.leader, N.zxid,getinitid (), Getinitlastloggedzxid ()))
- Updateproposal (N.leader, N.ZXID);
- Else
- Updateproposal (Getinitid (), Getinitlastloggedzxid ());
- Sendnotifications ();
The Totalorderpredicate function is a function that is based on the leader ID of the packet sent, the data ID to determine the corresponding data saved by the native, and returns true indicating that the data needs to be updated, so call the Updateproposal function to update the data b) sent over the logical clock of the data is less than the logical clock of the native
Describe the other party in a relatively early election process, this is only necessary to send the local data sent in the past is the
c) on both sides of the logical clock is the same, at this time just call the Totalorderpredicate function to determine whether to update the local data, if the update of their latest election results broadcast it is.
- <span style= "font-family:fangsong_gb2312;" >1. Determine if the source of the message is observer, if it is, tell the Observer I currently think leader information, otherwise enter 2
- 2. Determine if the message is vote information, or enter 3
- 3. Create a vote according to the message
- 4. If the current server processes the looking state, puts vote into its own ballot box, and if the vote source server is in the looking state while vote the original election of the source server, the current server notifies it of a new round of voting;
- 5 If the current server is not in the looking state and vote source server processes The looking state, the current server tells it the current leader information. </span>
After three cases have been processed, two more cases are processed:
1) The server determines whether it has collected all the server's election status, if so according to the election results set their own role (following or leader), and then quit the election process is.
2) Even if the election status of all the servers is not collected, it can be judged according to the above process of the latest election leader is not more than half of the server support, if so, then try to receive data within 200MS, if no new data arrives, It means that everyone has already defaulted on this result and also set the role out of the election process.
The code is as follows:
- /*
- * Only proceed if the vote comes from a replica in the
- * Voting view.
- */
- if (Self.getvotingview (). ContainsKey (N.sid)) {
- Recvset.put (N.sid, New Vote (N.leader, N.zxid, N.epoch));
- If has received from all nodes and then terminate
- if (Self.getvotingview (). Size () = = Recvset.size ()) && (Self.getquorumverifier (). Getweight (Proposedleader)! = 0)) {
- Self.setpeerstate (Proposedleader = = Self.getid ())? ServerState.LEADING:learningState ());
- Leaveinstance ();
- return new Vote (Proposedleader, PROPOSEDZXID);
- } else if (Termpredicate (recvset,new Vote (Proposedleader, Proposedzxid,logicalclock))) {
- Verify If there is a change in the proposed leader
- while ((n = recvqueue.poll (finalizewait,timeunit.milliseconds)) = null) {
- if (Totalorderpredicate (N.leader, N.zxid,proposedleader, Proposedzxid)) {
- Recvqueue.put (n);
- Break
- }
- }
- /*
- * This predicate are true once we don ' t read any new
- * Relevant message from the reception queue
- */
- if (n = = null) {
- Self.setpeerstate (Proposedleader = = Self.getid ())? ServerState.LEADING:learningState ());
- if (log.isdebugenabled ()) {
- Log.debug ("About to leave FLE instance:leader=" + Proposedleader + ", Zxid =" + Proposedzxid + ", My id =" + Self.geti D () + ", My state =" + self.getpeerstate ());
- }
- Leaveinstance ();
- return new Vote (PROPOSEDLEADER,PROPOSEDZXID);
- }
- }
- }
2) If the receiving server is not in an election state, that is, in following or leading state
Make the following two judgments:
A) If the logical clock is the same, save the data to Recvset, if the receiving server claims to be leader, then it will determine if more than half of the servers elect it, and if so, set the election status to exit the electoral process
b) Otherwise this is a message that does not conform to the current logic clock, so that the election results are already in the other election process, so the election result is added to the Outofelection collection, and then according to Outofelection to determine whether the election can be concluded, If you can also save the logical clock, set the election status, exit the election process.
The code is as follows:
- if (N.epoch = = Logicalclock) {
- Recvset.put (N.sid, New Vote (N.leader, N.zxid, N.epoch));
- if ((n.state = = serverstate.leading) | | (Termpredicate (Recvset, New Vote (N.leader,n.zxid, N.epoch, n.state)) && Checkleader (outofelection, N.leader, N.epoch)) {
- Self.setpeerstate (N.leader = = Self.getid ())? ServerState.LEADING:learningState ());
- Leaveinstance ();
- return new Vote (N.leader, N.ZXID);
- }
- }
- Outofelection.put (N.sid, New Vote (N.leader, N.zxid, N.epoch, n.state));
- if (Termpredicate (outofelection, New Vote (N.leader,n.zxid, N.epoch, n.state)) && Checkleader (Outofelection, N.leader, N.epoch)) {
- Synchronized (this) {
- Logicalclock = N.epoch;
- Self.setpeerstate (N.leader = = Self.getid ())? ServerState.LEADING:learningState ());
- }
- Leaveinstance ();
- return new Vote (N.leader, N.ZXID);
- }
- Break
- }
- }
Take a simple example to illustrate the whole process of the election.
Suppose there are five servers of the zookeeper cluster, their IDs from 1-5, and they are all up-to-date, that is, no historical data, the amount of data stored in this point, is the same. Let's see what happens when these servers are started sequentially.
- <span style= "font-family:fangsong_gb2312;" >1) server 1 starts, at this time only one server is started, it sends out the report does not have any response, so its election status is always looking state
- 2) server 2 starts, it communicates with the server 1 that starts up, exchanges their own election results, because there is no historical data, so the ID value of the larger server 2 wins, but because not more than half of the server agreed to elect it (more than half of this example is 3), so server 1 , 2 continues to maintain the looking state.
- 3) Server 3 start, according to the previous theoretical analysis, server 30% is the eldest of the server, and unlike above, at this time there are three servers elected it, so it became the election of the leader.
- 4) Server 4 start, according to the previous analysis, in theory, server 4 should be the largest server 1,2,3,4, but because more than half of the previous server has elected server 3, so it can only receive when the younger brother's life.
- 5) Server 5 start, same as 4, when younger brother .</span>
- For more information, please visit: http://bbs.superwu.cn, who is concerned about Superman Academy: Bj-crxy
Fastleader election algorithm