The idea of realizing distributed zookeeper

Source: Internet
Author: User
Tags zookeeper hadoop ecosystem

Zookeeper is one of the many excellent software that the Hadoop ecosystem provides for the open source session.

The previous time project used zookeeper, mainly used as a service for registration and discovery using a way similar to Ali's Dubbo. In fact, the Zookeeper feature is more than just this content, it provides a series of very powerful features that will be mentioned later. This article is only my personal understanding, if there is a mistake please correct, so as not to mislead others.

1. What is Zookeeper

Zookeeper's name is interesting. Called Animal administrators, because many of the software in the Hadoop ecosystem are named after animals, and hadoop,hive,pig......zookeeper is widely used as a distributed coordination system in Hadhoop. where hbase default with zookeeper. The main features of zookeeper are configuration maintenance, distributed locks, elections, distributed queues, and so on, and the zookeeper itself can be a cluster, providing high availability. At the same time, zookeeper provides some simple and easy-to-use APIs for easy development.

2. Zookeeper Data Model

The naming service provided by zookeeper looks very similar to a UNIX file system, and the following is a copy of the picture from the official website:


Each of these nodes is called Znode, and each Znode node can contain both data and child nodes, and since zookeeper is positioned as a coordinator, the data stored in Znode is usually not too large, typically storing some state information, location information, etc. Zookeeper official suggested that the data in Znode should not exceed 1M. Znode has two types of nodes: temporary node, permanent node, in which these two nodes are divided into ordered and unordered. Focus on the temporary node, because many of the underlying features of ZK are based on temporary nodes, and the client establishes communication between the two when it connects to the zookeeper, and the connection state is stored in the session of the Zookeeper server maintenance. With the timeout server for the session, all temporary nodes established by the client are removed, and the permanent node does not disappear even if the clients exit the node. Temporary nodes cannot have child nodes but can mount data. Znode node combined with watcher mechanism can realize very rich and flexible functions.

3. Zookeeper Cluster structure


The zookeeper itself supports stand-alone and cluster deployments, and the production environment recommends using cluster deployments because there is no single point of failure in the cluster deployment, and zookeeper recommends that the number of nodes deployed is odd, with only more than half of the machines not available for the entire ZK cluster. There are two main roles leader and Flower in the zookeeper cluster, each client can connect to any zookeeper node in the cluster and read information from the connected zookeeper, but for write operations, The flower node does not process but is forwarded to the leader, which is responsible for the atomic broadcast by the leader, thus ensuring the data consistency of each node in the cluster. Zookeeper stipulates that the entire write operation is completed only when the extra half of the nodes are synchronized, that is, that there may be less than half of the zookeeper node data is not the most recent data, so the zookeeper of the data one is not strong consistency but final consistency, However, the client can use Sync () to force the most recent data to be read.

4, Replaction

High availability in zookeeper is achieved through data redundancy, where a single piece of data has multiple nodes, and zookeeper requires that the same data be present on more than half of the nodes in order to make the number of downtime more tolerable. ZK recommends that the cluster be configured with an odd number of nodes, because more than half of the flower data and leader elections are required to be completed or agreed to be OK. For example, if there are 3 nodes, at least 2 nodes are normal, the tolerance is 1 (the number of nodes allowed to be dropped), there are 4 nodes, at least three nodes are normal, tolerance is also 1, more machines, but the same tolerance at any time seems to outweigh the gains. ZK therefore recommends the deployment of an odd number of nodes in the cluster, but this is not mandatory. Another look at why the write operation requires at least more than half of the node commit successful overall success, if there is a 2t+1 ZK node, that is, must have t+1 node commit success just OK, It is only in this case that at least one node has two update operations (two updates involve at least 2t+2 nodes, at least one node has two updated records). Zookeeper uses the ZAB algorithm to implement the atomic broadcast of the data, and each write operation writes the log and then updates the cache, each of the ZK nodes maintains a global variable zxid, which increments as each of the znode changes, and when leader hangs off, The remaining flower selects Zxid's largest node as the new leader, which requires a data recovery before the new leader provides services, and the new leader only has the most data (Zxid largest), but does not necessarily have the latest data. Therefore, the leader and flower data need to be synchronized to the latest state and complete the recovery of the entire data through the merging process.


The 5 ZK nodes above allow two downtime, and the other three nodes can always be recovered ABCDE.

5. Watch mechanism

Zookeeper allows the client to set the listener to the Znode node or the data in the node, and when Znode changes the server triggers the listener, and the client completes a callback to complete its own business logic. Zookeeper in the watch is one-time, that is, when the listener is triggered after the monitor is invalidated, need to apply the watcher again. The Exists,getdata,getchildren interface can specify whether to apply watcher, and you can use the default watcher or custom watcher. The interface that triggers the watcher can be create, delete, SetData, SetACL.

6. Configuration Management

If it is a stand-alone or a few machines, when the application of configuration changes, may be manually modified, but if a cluster has hundreds of application nodes, how to ensure rapid error-free completion of configuration item changes. The appearance of zookeeper can solve this problem easily


Similar to the above figure each node establishes a permanent znode on ZK and writes a configuration entry, then listens for changes in the data under that node, and all listener clients receive change notifications once other clients have modified the data.

7. Leader election

Zookeeper itself provides leader election mechanism, presumably the idea is that all nodes create a temporary orderly znode and then listen to all the changes in the node, get the smallest number and the number of their own creation to compare, if you are the smallest is elected leader. When leader actively deletes a node or goes down, the temporary node disappears and the change is captured by other surviving nodes to trigger the second leader election, and so on. In fact, many of the recipes curator mentioned by Zookeeper (a good encapsulation of the Zookeeper API) provide a good implementation (with the exception of two-phase submissions), and there are many things to consider based on the underlying Zookeeper API development applications, Curator provides encapsulation for these, so if you want to write a zookeeper application recommend using curator.

Leader applications are widely used, curator offers two different electoral implementations, one is polling for leader, the other is permanent access to leader until the node exits, and two electoral implementations can play a role in different cluster applications.

8, based on zookeeper to achieve a simple fault detection

When the node in the cluster is started, create a permanent unordered node/cluster, as to which node to create without concern. At the same time, each node will register itself under the/cluster, that is, to create a similar/cluster/node* node, it should be noted that the node must be a temporary node, and all the applications in the cluster monitoring/cluster node changes, If a node in the cluster is down zookeeper will delete the temporary node (/cluster/node), using this feature to monitor the application of the/cluster node will receive change notification, if the service discovery can update the list of available services, If the failover can be done by the application of the leader to do follow-up processing.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.