Zookeeper working principle is very comprehensive (article from Baidu Search and development Department)

Last Update:2018-07-28 Source: Internet

Author: User

Tags ack time interval zookeeper

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Zookeeper is a distributed, open source distributed Application Coordination Service that contains a simple primitive set, which enables distributed applications to implement synchronization services, configure maintenance and naming services, and so on. Zookeeper is a subproject of Hadoop, and it does not have to be the same in its history. In distributed applications, because engineers do not use lock mechanisms well and message based coordination mechanisms are not suitable for use in some applications, a reliable, scalable, distributed, configurable coordination mechanism is needed to unify the state of the system. This is the purpose of zookeeper. This paper simply analyzes the working principle of zookeeper, and how to use zookeeper is not the focus of this paper.

1 Basic concepts of zookeeper 1.1 Roles

There are three main categories of roles in Zookeeper, as shown in the following table:

The system model is shown in the figure:

1.2 Design Purpose

1. Final consistency: Client regardless of which server to connect to, show it is the same view, which is the most important performance of zookeeper.

2. Reliability: Simple, robust, good performance, if message M is accepted by a server, then it will be accepted by all servers.

3. Real-time: Zookeeper ensure that the client will be in a time interval to obtain the server's update information, or server failure of information. However, due to the network delay and other reasons, zookeeper can not guarantee that two clients can get the data just updated, if you need the latest data, you should call the sync () interface before reading the data.

4. Wait-independent (Wait-free): Slow or invalid client must not interfere with fast client requests, so that each client can effectively wait.

5. Atomicity: Updates can only succeed or fail with no intermediate state.

6. Order: including global order and partial order: Global order means if message A on a server is released before message B, then message A on all servers will be published before message B, and the partial order means that if a message B is released by the same sender after message A, a will be in front of B. 2 Working principle of zookeeper

The core of zookeeper is atomic broadcasting, a mechanism that ensures synchronization between individual servers. The protocol that implements this mechanism is called the Zab protocol. There are two modes of the Zab protocol, which are the recovery mode (select Master) and broadcast mode (sync). When the service is started or the leader crashes, Zab is in the recovery mode, and when the leader is elected and most of the servers have finished synchronizing with the leader state, the recovery model is over. State synchronization guarantees that the leader and server have the same system state.

To ensure the order consistency of transactions, zookeeper uses an incremented transaction ID number (ZXID) to identify the transaction. All the proposals (proposal) were added ZXID when they were presented. In the implementation of ZXID is a 64-digit number, it is high 32 bit is epoch to identify whether the leader relationship changes, each time a leader is selected, it will have a new epoch, logo is currently belonging to the leader of the ruling period. The lower 32 bits are used to increment the count.

Each server has three states during its work: looking: The current server does not know who leader is and is searching for leading: the current server is elected leader Following:leader has been elected, Current server synchronizes with 2.1 Select main process

When leader crashes or leader loses most of the follower, ZK enters recovery mode, and the recovery model needs to elect a new leader to restore all servers to a correct state. ZK's election algorithm has two kinds: one is based on basic Paxos implementation, the other is based on the fast Paxos algorithm. The system's default election algorithm is fast Paxos. First, introduce the basic Paxos process: 1. The election thread is held by the current server-initiated election thread, whose main function is to count the results of the poll and select the recommended server; 2. The election thread first initiates an inquiry (including itself) to all servers; 3. After the election thread receives a reply, verify that the query was initiated by itself (verify that ZXID is consistent), then obtain the other's ID (myID), and store it in the list of current query objects, and finally obtain the leader relevant information (ID,ZXID) proposed by the other party. and store this information in the Voting record table for the second election, 4. After receiving all of the server responses, calculate the server with the largest ZXID and set this server-related information to the next server to be voted on; 5. The thread sets the current ZXID largest server to the leader to be recommended by the current server, and if the winning server obtains N/2 + 1 server votes, sets the currently recommended leader as the winning server. will set its own status based on the information that was won, otherwise, continue the process until leader is elected.

Through process analysis we can conclude that to enable leader to obtain support from most servers, the total number of servers must be odd 2n+1 and the number of surviving servers should not be less than n+1.

Each server will repeat the above process after it is started. In recovery mode, if the server that was just recovered from a crash state or just started recovers data and session information from a disk snapshot, ZK logs the transaction log and periodically snaps to facilitate state recovery at recovery time. The specific flowchart for selecting the master is as follows:

The fast Paxos process is during the election process where a server first proposes to all servers that they want to be leader, and when other servers receive the offer, resolve epoch and zxid conflicts, accept each other's offer, and then send the message to the other side to accept the offer. Repeat this process, the final will be able to elect the leader. The flowchart below is as follows:

2.2 Synchronization Process

After the leader is selected, ZK enters the state synchronization process. 1. Leader waiting for server connection; 2. Follower connection leader, the largest Zxid sent to leader; 3. Leader the synchronization point according to the zxid of follower; 4. After completing the synchronization notification follower has become a uptodate state; 5. Follower received the UpToDate message, you can again accept the client's request for service.

The flowchart looks like this:

2.3 Work Flow 2.3.1 Leader Work Flow

Leader has three main functions: 1. Recover data; 2. Maintain the heartbeat with learner, receive learner request and Judge learner request message type; 3. Learner message types are mainly ping messages, request messages, ACK messages, revalidate messages, depending on the type of message, for different processing.

The ping message refers to the heartbeat information of the learner, and the request message is the proposed information sent by follower, including the write request and the synchronization request, the ACK message is follower's reply to the proposal, more than half of the follower pass, then commit the proposal The revalidate message is used to extend the session valid time.
Leader's work flow diagram is shown below, in the actual implementation, the process is much more complex than the following diagram, started three threads to implement the function. 2.3.2 Follower Work Flow

Follower has four main functions: 1. Send a request to leader (Ping, request, ACK, revalidate); 2. Receive leader messages and handle them; 3. Receive client's request, if for write request, send to leader to vote; 4. Returns the client result.

The follower message loop handles several of the following messages from leader: 1. PING message: Heartbeat message; 2. PROPOSAL News: Leader launched a proposal asking follower to vote; 3. COMMIT Message: Server-side latest proposal information; 4. UpToDate Message: Indicates synchronization completed; 5. Revalidate Message: Close the session to revalidate or allow the message to be accepted according to the revalidate result of leader; 6. Sync message: Returns the sync result to the client, which was initially initiated by the client to force the latest updates.

Follower's work flow diagram is shown below, in the actual implementation, follower is through 5 threads to implement the function.

The only difference between the observer process and the follower is that the observer process is no longer described, and Observer does not participate in the leader-sponsored polls.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More