Zookeeper architecture design and its application points

Source: Internet
Author: User
Tags epoch time time in milliseconds

Zookeeper is an open source distributed service Framework, which is a subproject of Apache Hadoop project, which is mainly used to solve some problems in distributed application scenarios, such as: Unified Naming Service, State Synchronization service, cluster management, distributed application configuration management, etc. It supports standalone mode and distributed mode, which can provide high performance and reliable coordination services for distributed applications in distributed mode, and the use of zookeeper can greatly simplify the implementation of distributed coordination services and greatly reduce the cost of developing distributed applications.

Overall Architecture

Zookeeper the overall architecture of the Distributed Coordination Services framework:


The zookeeper cluster consists of a set of server nodes that have a node with a role of leader in the set of servers, with the other nodes being follower. When client clients connect to the zookeeper cluster and perform write requests, the requests are sent to the leader node, and data changes on the leader node are synchronized to the other follower nodes in the cluster.
after receiving the data change request, the leader node first writes the changes to the local disk for recovery purposes. Changes are not applied to memory until all write requests are persisted to disk.
Zookeeper uses a custom atomic message protocol that, in the message layer, guarantees the consistency of the data or state of the nodes in the entire coordinate system. Follower based on this message protocol, the local zookeeper data is guaranteed to be synchronized with the leader node, and then the external service is provided independently based on local storage.
when a leader node fails, failure is a fast response, the message layer is responsible for re-selecting a leader, continue as the center of the Coordination Service cluster, processing client write requests, The data changes of the zookeeper coordination system are synchronized (broadcast) to other follower nodes.

Design Essentials

Zookeeper is based on the following 4 goals for trade-offs and design, which we describe in detail from the point of view of design and its characteristics:

    • Simple
    • Each process in a distributed application can be reconciled through the Zookeeper namespace (Namespace), which is shared, hierarchical, and, more importantly, simple enough to understand as easily as the directory structure of the filesystem that we normally touch:


    In Zookeeper each namespace (Namespace) is called Znode, you can understand that each znode contains a path and the associated metadata, as well as a list of children that inherit from that node. Unlike traditional file systems, data in zookeeper is stored in memory, enabling high throughput and low latency for distributed synchronization Services.
    In the zookeeper data model of the example, there are the following points:

    1. In each node (Znode) is the synchronization-related data (this is the original intention of the zookeeper design, the amount of data is small, about B to KB magnitude), such as state information, configuration content, location information and so on.

    2. A znode maintains a state structure that includes: version number, ACL change, timestamp. Each time the Znode data changes, the version number is incremented so that the client's read request can retrieve state-related data based on the version number.

    3. Each znode has an ACL that restricts access to the Znode.

    4. In a namespace, it is atomic to perform read and write request operations on data stored on Znode.

    5. The client can set up a monitor (Watch) on a znode, and if the Znode data changes, zookeeper notifies the client, triggering the execution of the logic implemented in the monitor.

    6. Each client connects to zookeeper, and a session is established, and in the course of the session, there are three states of connecting, connected, and closed.

    7. Zookeeper supports the concept of a temporary node (ephemeral Nodes), which is related to the session in zookeeper, and if the connection is broken, the node is deleted.

  • Redundancy
  • Zookeeper is designed to replicate the cluster architecture, each node of the data can be replicated in the cluster propagation, so that each node in the cluster synchronization of data, so as to achieve the reliability and availability of services. As mentioned earlier, zookeeper data in memory to improve performance, in order to avoid a single point of failure (SPOF), supporting the replication of data to achieve redundant storage, this is essential.

  • Ordered
  • Zookeeper uses timestamps to record transactional operations that cause state changes, which means that a set of transactions is guaranteed to be ordered through timestamps. Based on this feature. Zookeeper can achieve more advanced abstraction operations, such as synchronization.

  • Fast
  • Zookeeper includes both read and write operations, based on the zookeeper distributed applications, if read and write less application scenarios (read and write ratio of about 10:1), then read performance can be more efficient.

Data Model

Zookeeper has a hierarchical namespace, structured like a file system directory structure, very simple and intuitive. Among them, Znode is the most important concept, as we have described earlier. In addition, there are znode related to watches, ACLs, temporary nodes, sequence nodes (Sequence node).

  • Znode structure
  • ZooKeeper uses Zxid (ZooKeeper Transaction Id) to represent each node data change, a zxid corresponds to a timestamp, so that multiple different changes correspond to the transaction being ordered. Here is the composition of the Znode, and the reference document looks like this:

    • Czxid–the Zxid of the change, caused this znode to be created.

    • Mzxid–the Zxid of the modified this znode.

    • Ctime–the time in milliseconds from epoch time this znode is created.

    • Mtime–the time in milliseconds from epoch time this znode is last modified.

    • Version–the number of changes to the data of this znode.

    • Cversion–the number of changes to the children of this znode.

    • Aversion–the number of changes to the ACL of this znode.

    • Ephemeralowner–the session ID of the the owner of this znode if the Znode are an ephemeral node. If It is a ephemeral node, it would be zero.

    • Datalength–the length of the data field of this znode.

    • Numchildren–the number of children of this znode.

  • Watches (Monitoring)
  • zookeeper is only triggered once 。 That is, if the client sets watch at the specified Znode, if the Znode data changes, zookeeper sends a change notification to the client, triggering the set watch event. If the Znode data is changed again and the client does not reset the Znode watch after receiving the first notification, zookeeper will not send a change notification to the client. The
    Zookeeper asynchronously notifies the client setting watch. However, zookeeper can guarantee that the client will not be notified asynchronously until the Znode change takes effect, and then the client will be able to see Znode data changes. Due to network latency, multiple clients may see changes in znode data at different times, but the order in which they see changes is guaranteed to be orderly and consistent. The
    Znode can set up class two watch, one is Data Watches (the change in the Znode causes the watch event to be triggered), and the other is the child Watches (the Znode's children node is changed to trigger the Watch event). Call the GetData () and exists () methods to set the data Watches, and call the GetChildren () method to set child Watches. Call the SetData () method to trigger the registered data Watches in the Znode. Calling the Create () method creates a znode that triggers the Znode's data Watches, and the Create () method is called when the child node of the Znode is created, triggering the znode of the children Watches. Call the Delete () method to delete Znode, then both data Watches and child Watches are triggered, and if the deleted Znode also has a parent node, the parent node triggers a child Watches.
    In addition, if the client disconnects from zookeeper server, the client cannot trigger watches unless the connection to zookeeper server is established again.

  • Sequence Nodes (Sequence node)
  • When creating a znode, you can request zookeeper to generate a sequence, prefixed by a pathname, followed by the path name, for example, to produce a sequence similar to the following:

    qn-0000000001, qn-0000000002, qn-0000000003, qn-0000000004, qn-0000000005, qn-0000000006, qn-0000000007

    For Znode's parent node, each counter string in the sequence is unique and has a maximum value of 2147483647.

  • ACLs (Access control list)
  • ACLs can control access to zookeeper nodes, only on specific znode, and not on all child nodes of the Znode. It mainly has the following five kinds of permissions:

      • create Allow creation of child Nodes

      • read allows to get Znode data, as well as the child list of that node

      • write can modify Znode data

      • delete can delete a child node

      • admin can set permissions

    The zookeeper contains 4 ways to implement ACLs:

      • World A separate ID that means anyone can access

      • Auth does not use ID, only authenticated users can access

      • Digest using Username:password to generate MD5 hashes as authentication IDs

      • IP uses the client host IP address for authentication

  • ZooKeeper Session
  • The session is established when the client connects to the zookeeper cluster. State transitions during a session:


      During the connection process, the session state is connecting and the session state becomes connected when the connection is established successfully. During a session, if normal, the state of the session can only be one of connecting and connected. If the connection is broken during the session, it becomes the closed state.


Apply Traps

Not all distributed applications are suitable for use with zookeeper to build coordination services, and we will respond to these issues in terms of the documentation provided by zookeeper, as well as the issues that can arise when using them. Summarized as follows:

  1. Notice of change on Lost Znode
  2. After the client connects to zookeeper server, a TCP connection is maintained. In the connected state, the client sets up a watch listener for a znode that can receive notifications from that node's changes (subsequent triggering of a certain logical execution process). However, if the client disconnects from the zookeeper server due to a network exception, it is not possible to receive zookeeper notification of node data changes sent on Znode during the disconnection process.
    Therefore, if you use zookeeper Watch, you must look for a watch that maintains connected to ensure that you do not lose the data change notification on the watch Monitor's znode.

  3. Invalid zookeeper cluster node list
  4. When interacting with a zookeeper cluster, the client typically holds a list of zookeeper cluster nodes, or a subset of the list, then there are two scenarios:
    One scenario is that if a client holds a list or a subset of the lists, where the nodes are in the active state and can provide a coordination service, then the client accesses the zookeeper cluster without any problems.
    In another case, the client holds a list of zookeeper cluster nodes or a subset of the lists, and if some nodes in the list fail out of the cluster because of a failure, the service cannot be obtained if the client connects again to this type of defunct node.
    Therefore, when we use the Zookeeper cluster in the application, we must make this clear, either skip the invalid node, or re-look for a valid node to continue the business process, or check the zookeeper cluster to restore the entire cluster to normal.

  5. Configuration-led performance issues
  6. If you set the Java heap Memory (heap) to be unreasonable, it can cause zookeeper memory to be out of memory and exchange data between the RAM and the file system, causing the performance of zookeeper to degrade significantly, which may affect the application.
    To avoid swapping problems, consider setting up enough Java heap memory while reducing the memory used by the operating system and cache, avoiding data exchange between the memory and the file system, or limiting the exchange to a certain extent.

  7. Transaction log storage device performance
  8. Zookeeper synchronizes transactions to storage devices, and if the storage device is not dedicated, but shares the same disk with other I/O intensive applications, it can result in zookeeper efficiency. Because the client requests transactions that occur Znode data changes, Zookeeper writes the transaction log to the storage device before the response, and if the storage device is private, the entire service, and even the external application, will gain significant performance gains.

  9. Znode storage of large amounts of data leads to performance issues
  10. Zookeeper is designed to store only a small amount of synchronized data per Znode, and if a large amount of data is stored, it is necessary to write the transaction to the storage device each time the node changes, and also to replicate the propagation within the cluster, which leads to inevitable delays and performance problems.
    So, if you need to be related to a lot of data, you can store a lot of data in other devices, but simply store a simple mapping in zookeeper, such as pointers, references, and so on.


Reference links

    • http://zookeeper.apache.org/

    • Http://zookeeper.apache.org/doc/r3.3.4/zookeeperOver.html

    • Http://wiki.apache.org/hadoop/ZooKeeper/PoweredBy

    • http://www.ibm.com/developerworks/cn/opensource/os-cn-zookeeper/

    • Http://zookeeper.apache.org/doc/r3.3.4/recipes.html

    • Http://zookeeper.apache.org/doc/r3.3.4/zookeeperProgrammers.html

Zookeeper architecture design and its application points

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.