Zookeeper Learning Series (iii) Zookeeper Basic principles __zookeeper

Source: Internet
Author: User
Tags ack zookeeper

Zookeeper is a distributed, open source Distributed Application Coordination Service that contains a simple primitives set that distributed applications can implement to synchronize services , Configure Maintenance and naming services Wait Zookeeper is a subproject of Hadoop. In distributed applications, because engineers do not use lock mechanisms well and message based coordination mechanisms are not suitable for use in some applications, a reliable, scalable, distributed, configurable coordination mechanism is needed to unify the state of the system . This is the purpose of zookeeper. This paper simply analyzes the working principle of zookeeper. 1 Basic concepts of zookeeper 1.1 Roles

There are three main categories of roles in Zookeeper, as shown in the following table:

The system model is shown in the figure:

1.2 Design Purpose

1. Final consistency : client regardless of which server to connect to, show it is the same view, which is the most important performance of zookeeper.

2. Reliability: Simple, robust, good performance, if message M is accepted by a server, then it will be accepted by all servers.

3. Real-time: Zookeeper ensure that the client will be in a time interval to obtain the server's update information, or server failure of information. However, due to the network delay and other reasons, zookeeper can not guarantee that two clients can get the data just updated, if you need the latest data, you should call the sync () interface before reading the data.

4. Wait-independent (Wait-free): Slow or invalid client must not interfere with fast client requests, so that each client can effectively wait.

5. Atomicity: Updates can only succeed or fail with no intermediate state.

6. Order: including global order and partial order: Global order means if message A on a server is released before message B, then message A on all servers will be published before message B, and the partial order means that if a message B is released by the same sender after message A, a will be in front of B. 2 Working principle of zookeeper

The core of zookeeper is Atomic broadcasting , a mechanism that ensures synchronization between individual servers. The protocol that implements this mechanism is called the Zab protocol . There are two modes of the Zab protocol, which are the recovery mode (select Master) and Broadcast mode (sync). When the service is started or the leader crashes, Zab is in the recovery mode, and when the leader is elected and most of the servers have finished synchronizing with the leader state, the recovery model is over. State synchronization guarantees that the leader and server have the same system state .

To ensure the order consistency of transactions, zookeeper uses an incremented transaction ID number (ZXID) to identify the transaction. All the proposals (proposal) were added Zxidwhen they were presented. In the implementation of ZXID is a 64-digit number, it is high 32 bit is epoch to identify whether the leader relationship changes , each time a leader is selected, it will have a new epoch, logo is currently belonging to the leader of the ruling period. The lower 32 bits are used to increment the count.

Each server has three states in the process of working:

Looking: The current server does not know who leader is and is searching for

Leading: The current server is the elected leader

Following:leader has been elected, the current server is synchronized with the 2.1 selected main flow

When leader crashes or leader loses most of the follower, ZK enters recovery mode, and the recovery model needs to elect a new leader to restore all servers to a correct state. ZK's election algorithm has two kinds: one is based on basic Paxos implementation, the other is based on the fast Paxos algorithm. The system's default election algorithm is fast Paxos. First, introduce the basic Paxos process:

1. The election thread is held by the current server-initiated election thread, whose main function is to count the results of the poll and select the recommended server;

2. The election thread first initiates an inquiry (including itself) to all servers;

3. When the election thread receives a reply, verify that it is the one that initiated it (verify that ZXID is consistent), then obtain the other's ID (myID) and store it in the list of current query objects, and finally obtain the leader relevant information (ID,ZXID) proposed by the other party. and store this information in the voting record form of the election;

4. After receiving all the server responses, calculate the server with the largest ZXID and set this server-related information to the next server to be voted on;

5. The thread sets the current ZXID largest server to the leader to be recommended by the current server, and if the winning server obtains N/2 + 1 of the server votes, set the currently recommended leader as the winning server, will set its own status based on the information that was won, otherwise, continue the process until leader is elected.

Through process analysis we can conclude that to enable leader to obtain support from most servers, the total number of servers must be odd 2n+1 and the number of surviving servers should not be less than n+1.

Each server will repeat the above process after it is started. In recovery mode, if the server that was just recovered from a crash state or just started recovers data and session information from a disk snapshot, ZK logs the transaction log and periodically snaps to facilitate state recovery at recovery time. The specific flowchart for selecting the master is as follows:

The fast Paxos process is during the election process where a server first proposes to all servers that they want to be leader, and when other servers receive the offer, resolve epoch and zxid conflicts, accept each other's offer, and then send the message to the other side to accept the offer. Repeat this process, the final will be able to elect the leader. The flowchart below is as follows:

2.2 Synchronization Process

After the leader is selected, ZK enters the state synchronization process.

1. Leader waiting for server connection;

2. Follower connection leader, the largest Zxid sent to leader;

3. Leader the synchronization point according to the zxid of follower;

4. After the completion of synchronization notification follower has become a uptodate state;

5. Follower received the UpToDate message, you can again accept the client's request for service.

The flowchart looks like this:

2.3 Work Flow 2.3.1 Leader Work Flow

Leader has three main functions:

1. Recovery of data;

2. Maintain and learner heartbeat, receive learner request and Judge learner request message type;

3. Learner message types are mainly ping messages, request messages, ACK messages, revalidate messages, depending on the type of message, for different processing.

The ping message refers to the heartbeat information of the learner, and the request message is the proposed information sent by follower, including the write request and the synchronization request, the ACK message is follower's reply to the proposal, more than half of the follower pass, then commit the proposal The revalidate message is used to extend the session valid time.
Leader's work flow diagram is shown below, in the actual implementation, the process is much more complex than the following diagram, started three threads to implement the function. 2.3.2 Follower Work Flow

Follower has four main functions:

1. Send a request to leader (Ping message, request message, ACK message, revalidate message);

2. Receive the leader message and handle it;

3. Receive client's request, if for write request, send to leader to vote;

4. Returns the client result.

The follower message loop handles several of the following messages from leader:

1. Ping message: Heartbeat message;

2. Proposal News: Leader launched a proposal to ask follower to vote;

3. Commit message: Server-side latest proposal information;

4. UpToDate message: Indicates synchronous completion;

5. Revalidate message: According to leader Revalidate result, close the session to revalidate or allow it to accept the message;

6. Sync message: Returns the sync result to the client, which was initially initiated by the client to force the latest updates.

Follower's work flow diagram is shown below, in the actual implementation, follower is through 5 threads to implement the function.

The only difference between the observer process and the follower is that the observer process is no longer described, and Observer does not participate in the leader-sponsored polls.


Mainstream application scenarios:

Zookeeper's mainstream application scenario (excluding official examples)

(1) Configuration management
Centralized configuration management is very common in an application cluster, where a centralized configuration management center is implemented within a common business company to respond to the need for different application clusters to share their respective configurations, and to notify each machine in the cluster when configuring changes.

It is easy for zookeeper to implement this centralized configuration management, such as configuring all APP1 configurations to/app1 Znode, APP1 This node to be monitored by all machines as soon as they start (/app1 ("Zk.exist", true). and implementation of the callback method watcher, then on the Zookeeper/app1 znode node when the data changes, each machine will be notified, watcher method will be executed, then the application and then remove the data (zk.getdata) ("/app1", False,null));

The above example is simply coarse granular configuration monitoring, fine-grained data can be hierarchical monitoring, all of which can be designed and controlled.
(2) Cluster management
In the application cluster, we often need to let each machine know which machines are alive in the cluster (or one of the other clusters that depend on it), and that the cluster machines can be quickly notified to each machine because of downtime, network disconnection, and so on, without human intervention.

Zookeeper is also very easy to implement this function, such as I have a zookeeper server at the end of a znode called/app1servers, then every machine in the cluster to start to create a ephemeral type of node under this node, For example Server1 Create/APP1SERVERS/SERVER1 (you can use IP, guarantee not to repeat), Server2 create/app1servers/server2, then SERVER1 and SERVER2 are watch/ App1servers This parent node, then the data or child node changes under this parent node will notify the client that the node is watch. Because the ephemeral type node has an important feature, that is, the client and server-side connection is broken or the session expires will cause the node to disappear, then when a machine hangs or broken chain, its corresponding node will disappear, and then all the cluster in the App1servers Watch clients will be notified and then get the latest list.

Another scenario is the cluster selection master, once master hangs out and can immediately select a master from the slave, the same steps as the former, except that the type of node created in App1servers when the machine is started becomes EPHEMERAL_ Sequential type so that each node is automatically numbered

We default to the minimum number of master, so when we monitor the/app1servers node, get the server list, as long as all cluster machine logic that the smallest number node is master, then Master is selected, and this master downtime, The corresponding Znode disappears, then the new server list is pushed to the client, and then each node logic considers the smallest number node to be master, so that the dynamic master election is done.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.