Zookeeper Getting Started

Source: Internet
Author: User

Zookeeper Introduction

Zookeeper is a highly available distributed data management and system coordination framework built on the Paxos algorithm, which provides a set of primitives that can be used by higher-level applications for synchronization, configuration management, name services, master elections, distributed locks, distributed queues, and so on.

Zookeeper provides the following service guarantee

    • Sequential consistency: Client's updates requests are processed sequentially according to the order in which they are issued
    • Atomicity: An update operation either succeeds or fails with no other possible result
    • Consistent mirroring: No matter which server the client is connected to, it is displayed in the same view
    • Reliability: Once an update is applied, it is persisted unless the current value is updated by another update request
    • Timeliness: The system the client sees is up-to-date within a time frame
Zookeeper's design Goal 1. Simple

Zookeeper allows distributed processes to collaborate with one another through a shared hierarchical namespace that is organized in a manner similar to a file system. The data unit in the namespace is called Znode, which is the equivalent of a file or directory. Unlike typical file systems, file systems are used to persist storage, while zookeeper data is kept in memory, which means that zookeeper can achieve high throughput and low latency.

2. High performance, high availability, and strict sequential access

High performance enables zookeeper to be used in large-scale distributed systems, with high availability to avoid single points of failure, and strict sequencing means that complex synchronization operations can be implemented at the client.

3. Clustering

The zookeeper itself is also clustered. Such as:

The client is connected to a single server, which maintains a TCP connection, sends requests on it, gets a response, obtains monitoring events, and sends heartbeat detection. If the TCP connection to the server is broken, the client connects to the other server. Each server that makes up the zookeeper service knows the existence of other services, and they maintain a memory image of the server state, with transaction logs and snapshots saved in the persisted storage, as long as most of the server is available, and the Zookeeper service is available.

4. Sequencing

Zookeeper marks a number for each update operation to reflect the order of the transactions. Later operations use this order to achieve a high level of abstraction, such as synchronizing primitives.

5. Fast

This feature is particularly noticeable in read-oriented work. The zookeeper application runs on thousands of machines, and when the ratio of read and write operations is 10:1, the zookeeper can achieve the best performance.

Architecture of the Zookeeper

All servers comprise a service from zookeeper, the interaction between server and server is communicated through the Protocol, and a leader is chosen to broadcast all the change messages to follower. In fact all follower communicate with leader, follower accept all the change information from leader, save in their own memory. Follower forwards a write request from the client to leader. The client's read request will be served directly on the follower side without the need to be forwarded to leader.

ZooKeeper Atomic Broadcast (ZAB) protocol

First in 2011, Yahoo published the Zab agreement paper. The second election leader, only the leader can put forward a resolution. This means that the order in which the messages are sent is actually very related, so the internal communication is basically done through TPS. The third two-paragraph submission without abort. The Forth is the message transmission based on the state increment. Five is high availability, high performance.

Zookeeper Data Model

Zookeeper's data model is very much like a standard file system, and a name is a sequence of path elements separated by/delimited, each node of the zookeeper namespace is identified by a path.

The view structure of zookeeper is similar to the standard UNIX file system, but does not introduce file system related concepts: directories and files, but uses its own node concept, called Znode. Znode is the smallest unit of data in zookeeper, which can be stored on each znode, while also mounting child nodes, and also constituting a hierarchical namespace called a tree.

Znode node type

Each node of the zookeeper has a life cycle, depending on the type of node. In zookeeper, node types can be divided into persistent nodes (persistent), temporary nodes (ephemeral), and time series nodes (sequential), which are generally used in combination during node creation, and can generate the following 4 types of nodes:

1. Persistent node (persistent)

The so-called persistent node is the existence of a node after it has been created, until a delete operation is taken to proactively clear the node--it will not disappear because the client session that created the node fails.

2. Persistent sequential node (persistent_sequential)

The basic characteristics of such nodes are consistent with the above node types. The extra feature is that in ZK, each parent node maintains a time sequence for his first-level child node, which records the order in which each child node is created. Based on this feature, when you create a child node, you can set this property, and in the process of creating the node, ZK automatically adds a number suffix to the given node name, as the new one. The upper limit of this number suffix is the maximum value of the integer type.

3, temporary node (ephemeral)

Unlike persistent nodes, the life cycle of a temporary node and client session binding. In other words, if the client session fails, the node is automatically erased. Note that session invalidation is mentioned here, not disconnected. In addition, child nodes cannot be created under temporary nodes.

4. Temporary sequential node (ephemeral_sequential)

Node information
[Zk:localhost:2181 (CONNECTED) 4] Get/yinshi. MONITOR. ALIVE. CHECK
? t 10.232.102.191:21811353595654255
Czxid = 0x300000002
CTime = Thu Dec 23:29:53 CST 2011
Mzxid = 0XE00008BBF
Mtime = Thu Jul 07:17:34 CST 2012
Pzxid = 0x300000002
cversion = 0
Dataversion = 2164293
aclversion = 0
Ephemeralowner = 0x0
Datalength = 39
Numchildren = 0

The above information is an output message from the ZK command line, which can be clearly seen in the output, and what information is contained in a node of ZK. The more important information includes the node's data content, the node creation/modification transaction ID, the node/modification creation time, the current data version number, the data content length, the number of child nodes and so on.

Zookeeper is designed to store data for Management services: state information, configuration, location information, etc., so the data stored by each node is usually small, ranging from a few bytes to a few K bytes. The default upper limit is 1M.

Zonde maintains a stat structure that contains the version number of the data changes, ACL changes, and timestamps to allow cache checksum reconciliation to be updated. The version number is incremented whenever znode data changes. When a client receives data, it also receives the version number of the data.

The data stored in each znode is read and written automatically. The read operation gets all the data from the Znode, and the write replaces all znode data. Each node has an Access control list (ACL) to restrict who can do what.

TIPS Node size

The Jute.maxbuffer property in the configuration file represents the size of the data that can be stored by the node, DEFAULT=0XFFFFF=1MB, if exceeding this value will cause zookeeper system instability, or even crash, the direct restart is invalid, so prohibit the creation of nodes with data more than 1MB In view of the large impact of more than 10k of node data on performance, it is recommended that the data size be controlled as much as possible within 10k.

Zookeeper Basic API

One of the design goals of zookeeper is easy programming. Therefore, it only supports the following operations:

    • Create a node at a location in the namespace
    • Delete Deletes a node
    • exists to test whether a node exists in a location
    • Get data reads from a node
    • Set data writes to a single node
    • Get children gets a list of child nodes of a node
    • Sync waits for data to be propagated to synchronize data
ZooKeeper Session

A long connection is made between the client and the server. After the connection is established, the server generates a session ID to return to the client. The client periodically sends a PIN package to check and maintain connections with the server. Once the session ends or times out, all ephemeral and nodes are deleted. The client can set the appropriate session time-out period, depending on the situation.

ZooKeeper Watches

Watches is an event-listening method that is installed on the server by the client, and when a change in listening occurs, the server sends a message to the client for notification. The client synchronizes callbacks sequentially for all events using a single thread. It is automatically deleted after each trigger. If you need to listen for events again, you must reinstall watches. There is no guarantee to track every change, avoid installing a large number of watches listeners on the same node. Watches creation and triggering rules: All API is how to create a watches, write the request API, will trigger some events. As an example: when we call watches, we use a create, who creates this watches the node will be triggered off.

ZooKeeper Observer

It is always follower not to participate in the election. Increase the number of observer to further enhance the service capability of the cluster. Will not increase the cost of re-election leader. Good support for cross-datacenter, local read, off-site write capabilities.

Zookeeper Implementing roles
    • Leader (Leader): The leader does not accept the client's request, is responsible for the voting initiation and the resolution, the final update status.
    • Follower (Follower): Follower is used to receive customer requests and return customer results. Participate in leader-sponsored polls.
    • Observer (Observer): Oberserver can receive client connections and forward write requests to leader nodes. But Observer does not participate in the voting process, but only synchronizes the leader state. Observer provides a way for system extensions.
    • Learner (Learner): Learner, the follower and observer of leader that are synchronized with the state, are learner.

The composition of zookeeper is illustrated from a higher level

The cluster database is a database that exists in memory and holds all the data for the namespace. The update operation is logged to the hard disk for recovery, and the write operation is serialized to the hard disk before being applied to the in-memory database. Usually the zookeeper is made up of 2n+1 servers, as long as there are n+1 (most) servers available, the entire system remains available.

For follower received client request, for the read operation, the follower local memory database directly to the client to return the results, for the write operation that will change the state of the system, the leader is proposed to vote, more than half of the passed after the return of the results to the client:

The core of zookeeper is atomic broadcasting, a mechanism that guarantees synchronization between the various servers. The protocol that implements this mechanism is called the Zab protocol. The ZAB protocol has two modes, the recovery mode and the broadcast mode, respectively. When the service is started or after the leader crashes, the Zab enters the recovery mode, and when leader is elected and most of the servers are synchronized with the leader state, the recovery mode is finished. State synchronization ensures that the leader and server have the same system state.

The broadcast mode needs to ensure that the proposal is processed sequentially, so ZK uses an incremental transaction ID number (ZXID) to guarantee it. All the proposals (proposal) were added to the ZXID when they were presented. The implementation of ZXID is a 64-digit number, and its high 32-bit is the epoch used to identify whether the leader relationship has changed, and each time a leader is chosen, it will have a new epoch. A low 32-bit is an incrementing count. When leader crashes or leader loses most of the follower, when ZK enters recovery mode, the recovery mode needs to re-elect a new leader, so that all servers are restored to a correct state.

Election and synchronization process

The implementation of ZK is based on the Paxos algorithm (mainly Fastpaxos). Specific as follows:

    1. After each server startup, ask the other server who it is going to vote for.
    2. For queries from other servers, the server responds to its own recommended leader ID and the ZXID of the last processing transaction (each server will recommend itself at System boot time).
    3. After you receive all the server replies, calculate which server Zxid is the largest, and set the server-related information to the next server to vote.
    4. The winner is calculated as the sever of the highest number of votes in the process, and if the winner has more than half of the votes, the server is selected as leader. Otherwise, continue the process until leader is elected.

Also in recovery mode, if the server that is recovering from a crash state or just started is also recovering data and session information from a disk snapshot (ZK logs the transaction log and periodically snapshots it to facilitate state recovery on recovery). After leader is selected, zookeeper enters the state synchronization process.

The state synchronization process is as follows:

    1. Leader will start waiting for the server connection
    2. Follower connect leader, send the largest zxid to leader
    3. Leader to determine the synchronization point based on the zxid of follower
    4. Notification follower has become uptodate status after synchronization is complete
    5. Follower receive the uptodate message, you can re-accept the client's request for service
Zookeeper performance data 1. Create destroy Read performance

Create a node performance test

Deleting a node performance test

Read node content performance test

2. Comprehensive reading and writing performance curve

Common application scenario for zookeeper 1. Naming Services (naming service)

Naming services is also a common type of scenario in distributed systems. In distributed systems, by using a naming service, a client application can obtain information such as the address of a resource or service, the provider, and so on, according to the specified name. A named entity can usually be a machine in a cluster, provide a service address, a remote object, and so on-all of which we can collectively name them. One of the more common is the list of service addresses in some distributed service frameworks. By invoking the APIs provided by ZK to create a node, it is easy to create a globally unique znode that all applications read from and avoid writing to death.

2. Distributed Notification/coordination

Watcher registration and asynchronous notification mechanism in zookeeper can realize the notification and coordination between different systems in distributed environment, and realize real-time processing of data change. The use of the method is usually different systems to the ZK on the same Znode registration, monitoring Znode changes (including the Znode itself and child nodes), one of the system update Znode, then another system can receive notification, and make corresponding processing

3. Distributed lock

This is largely due to the fact that zookeeper has assured us of strong data consistency. Lock services can be divided into two categories, one is to maintain exclusivity, and the other is to control timing.

    • The so-called exclusive, is all the clients trying to get this lock, and finally only one can successfully obtain this lock. The usual practice is to think of a znode on ZK as a lock, which is achieved by the way of Create Znode. All clients are going to create the/distribute_lock node, and the client that was successfully created has the lock.
    • Control timing, that is, all views to get the lock of the client, will eventually be scheduled to execute, but there is a global timing. The procedure is basically similar to the above, except that here/distribute_lock is pre-existing and the client creates a temporary ordered node under it (this can be specified by the node's property control: createmode.ephemeral_sequential). ZK's parent node (/distribute_lock) maintains a copy of the sequence, guaranteeing the temporal timing of the creation of the child nodes, thus forming the global timing for each client.
4. Distributed queues

There are two simple, one is the regular FIFO queue, the other is to wait until the queue member NAND after the unified sequential execution. For the first FIFO queue, and the control timing scenario in the Distributed lock service, the fundamentals are consistent, and are not described here.

The second type of queue is actually an enhancement based on the FIFO queue. It is usually possible to pre-establish a/queue/num node under the/queue Znode, and assign n (or assign n to/queue directly), indicating the size of the queue, and then each time a queue member joins, it is determined whether the queue size has been reached and whether the execution can begin. A typical scenario for this usage is that in a distributed environment, a large task a requires a lot of subtasks to be completed (or conditionally ready). At this time, when one of the subtasks is complete (ready), then go to/tasklist to set up their own temporal timing node (createmode.ephemeral_sequential), when/tasklist found that the following child nodes meet the specified number, You can proceed to the next step in order to process.

Zookeeper Getting Started

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.