Principles of zookeeper

Source: Internet
Author: User
Document directory
  • 2.3.1 leader Workflow
  • 2.3.2 follower Workflow

Zookeeper is a distributed, open-source distributed application Coordination Service. It contains a simple primitive set. distributed applications can implement synchronization services, configuration maintenance, and naming services based on it. Zookeeper is a subproject of hadoop. Its development process does not need to be described in detail. In distributed applications, because the lock mechanism cannot be well used by engineers and message-based coordination mechanisms are not suitable for some applications, therefore, a reliable, scalable, distributed, and configurable coordination mechanism is required to unify the state of the system. This is the purpose of zookeeper. This article briefly analyzes the working principle of zookeeper and does not focus on how to use zookeeper.


1 Basic Concept of zookeeper 1.1 Role

Zookeeper has the following three types of roles:

System Model:

1.2 design purpose

1. Final consistency: No matter which server the client connects to, it is displayed as the same view, which is the most important performance of zookeeper.

2. Reliability: It has simple, robust, and good performance. If message m is accepted by a server, it will be accepted by all servers.

3. Real-time performance: zookeeper ensures that the client obtains Server Update information or Server failure information within a time interval. However, due to network latency and other reasons, Zookeeper cannot guarantee that the two clients can get the newly updated data at the same time. If you need the latest data, you should call the sync () interface before reading the data.

4. Wait-free: slow or invalid clients cannot intervene in fast client requests, so that each client can wait effectively.

5. atomicity: update can only be successful or failed, and there is no intermediate state.

6. sequence: includes global order and partial order. Global Order means that if message a is published before message B on a server, message A will be published before message B on all servers. Partial Order means that if message B is published by the same sender after message A, message A will be placed before message B.

2 Working Principle of zookeeper

The core of zookeeper is atomic broadcast, which ensures synchronization between various servers. The Protocol implementing this mechanism is called the Zab protocol. The Zab protocol has two modes: recovery mode (Master selection) and broadcast mode (synchronization ). After the service is started or the leader crashes, Zab enters the recovery mode. When the leader is elected and most servers are synchronized with the leader status, the recovery mode ends. State synchronization ensures that the leader and server have the same system status.

To ensure transaction sequence consistency, Zookeeper uses an incremental transaction ID (zxid) to identify the transaction. Zxid is added when all proposals (proposal) are proposed. In implementation, zxid is a 64-bit number, and its 32-bit height is the epoch used to identify whether the leader relationship has changed. Each time a leader is selected, it will have a new epoch, identifies the leader's current rule period. Low 32 bits are used for incremental counting.

Each server has three States during its work:

  • Looking: The current server does not know who the leader is.
  • Leading: The current server is the selected leader.
  • Following: The leader has been elected and the current server is synchronized with it.
2.1 select the main process

When the leader crashes or the leader loses most of the follower, ZK enters the recovery mode. In the recovery mode, a new leader needs to be elected to restore all servers to a correct state. There are two types of ZK election algorithms: one is based on basic paxos and the other is based on fast paxos. The system's default Election Algorithm is fast paxos. First, we will introduce the basic paxos process:

  1. 1. The election thread is the thread from which the current server initiates the election. Its main function is to collect statistics on the voting results and select the recommended server;
  2. 2. The election thread first initiates a query (including itself) to all servers );
  3. 3. after receiving a reply, the election thread verifies whether it is a self-initiated query (verifying whether the zxid is consistent), obtains the ID (myid) of the other party, and stores it in the list of currently queried objects, finally, obtain the leader information (ID, zxid) proposed by the other party, and store the information in the voting record of the current election;
  4. 4. After receiving replies from all the servers, the server with the largest zxid will be calculated, and the server information will be set to the server for the next vote;
  5. 5. the thread sets the server with the largest zxid as the leader to be recommended by the current server. If the server that wins this time receives n/2 + 1 server votes, set the currently recommended leader as the winning server, and set its status based on the winning server information. Otherwise, continue the process until the leader is elected.

Through the process analysis, we can conclude that for the leader to obtain support from most servers, the total number of servers must be an odd 2n + 1, and the number of surviving servers must not be less than N + 1.

The preceding process is repeated after each server is started. In recovery mode, if the server is recovered from the crash state or the server is started, data and session information will be restored from the disk snapshot, zk will record the transaction log and regularly take snapshots, it is convenient to restore the status when it is restored. The specific flowchart of the master selection is as follows:

In the fast paxos process, a server first proposes to all servers to become a leader. When other servers receive the proposal, the epoch and zxid conflicts are resolved, accept the proposal from the other party, send a message to the other party to accept the proposal, repeat the process, and finally the leader will be elected. The flowchart is as follows:

2.2 synchronization process

After selecting the leader, ZK enters the State synchronization process.

  1. 1. The leader waits for the server to connect;
  2. 2. Follower connects to the leader and sends the largest zxid to the leader;
  3. 3. The leader determines the synchronization point based on the zxid of the follower;
  4. 4. After synchronization is completed, the follower is notified that it has become uptodate;
  5. 5. After follower receives the uptodate message, it can accept the client's request again for service.

The flowchart is as follows:

2.3 workflow 2.3.1 leader Workflow

The leader has three main functions:

  1. 1. Restore data;
  2. 2. Maintain heartbeat with learner, receive learner requests, and determine the Request Message Type of learner;
  3. 3. Learner's message types include Ping message, request message, ACK message, and revalidate message. Different types of messages are processed.

The Ping message refers to the heartbeat information of learner. The request message is the proposal information sent by follower, including the write request and synchronous request. The ACK message is the reply of follower to the proposal, if more than half of the follower passes, commit the proposal. The revalidate message is used to extend the session validity period.
The leader workflow is shown in the following figure. In actual implementation, the process is much more complex than that. Three threads are started to implement functions.

2.3.2 follower Workflow

Follower has four main functions:

  1. 1. send a request to the leader (Ping message, request message, ACK message, and revalidate message );
  2. 2. Receive and process the leader message;
  3. 3. Receive client requests. If the request is a write request, send it to the leader for voting;
  4. 4. Return the client result.

The following types of messages from the leader are processed cyclically by follower:

  1. 1. PingMessage: Heartbeat message;
  2. 2. ProposalMessage: the follower vote is required for the proposal initiated by the leader;
  3. 3. CommitMessage: information about the latest proposal of the server;
  4. 4. uptodateMessage: the synchronization is completed;
  5. 5. revalidateMessage: Based on the revalidate result of the leader, disable the session waiting for revalidate or allow it to accept the message;
  6. 6. syncMessage: the sync result is returned to the client. The message is initially initiated by the client to force the latest update.

The workflow of follower is shown in the following figure. In actual implementation, follower implements functions through five threads.

The observer process is not described. The only difference between the observer process and follower is that the observer does not participate in the voting initiated by the leader.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.