Principles of zookeeper

Last Update:2014-10-22 Source: Internet

Author: User

Tags zookeeper client

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Zookeeper is a distributed, open-source distributed application Coordination Service. It contains a simple primitive set. distributed applications can implement synchronization services, configuration maintenance, and naming services based on it. Zookeeper is a subproject of hadoop. Its development process does not need to be described in detail. In distributed applications, because the lock mechanism cannot be well used by engineers and message-based coordination mechanisms are not suitable for some applications, therefore, a reliable, scalable, distributed, and configurable coordination mechanism is required to unify the state of the system. This is the purpose of zookeeper. This article briefly analyzes the working principle of zookeeper and does not focus on how to use zookeeper.

1 Basic Concept of zookeeper 1.1 Role

Zookeeper has the following three types of roles:

System Model:

1.2 design purpose

1. Final consistency: No matter which server the client connects to, it is displayed as the same view, which is the most important performance of zookeeper.

2. Reliability: It has simple, robust, and good performance. If message m is accepted by a server, it will be accepted by all servers.

3. Real-time performance: zookeeper ensures that the client obtains Server Update information or Server failure information within a time interval. However, due to network latency and other reasons, Zookeeper cannot guarantee that the two clients can get the newly updated data at the same time. If you need the latest data, you should call the sync () interface before reading the data.

4. Wait-free: slow or invalid clients cannot intervene in fast client requests, so that each client can wait effectively.

5. atomicity: update can only be successful or failed, and there is no intermediate state.

6. sequence: includes global order and partial order. Global Order means that if message a is published before message B on a server, message A will be published before message B on all servers. Partial Order means that if message B is published by the same sender after message A, message A will be placed before message B.

2 Working Principle of zookeeper

The core of zookeeper is atomic broadcast, which ensures synchronization between various servers. The Protocol implementing this mechanism is called the Zab protocol. The Zab protocol has two modes: recovery mode (Master selection) and broadcast mode (synchronization ). After the service is started or the leader crashes, Zab enters the recovery mode. When the leader is elected and most servers are synchronized with the leader status, the recovery mode ends. State synchronization ensures that the leader and server have the same system status.

To ensure transaction sequence consistency, Zookeeper uses an incremental transaction ID (zxid) to identify the transaction. Zxid is added when all proposals (proposal) are proposed. In implementation, zxid is a 64-bit number, and its 32-bit height is the epoch used to identify whether the leader relationship has changed. Each time a leader is selected, it will have a new epoch, identifies the leader's current rule period. Low 32 bits are used for incremental counting.

Each server has three States during its work:

Looking: The current server does not know who the leader is.
Leading: The current server is the selected leader.
Following: The leader has been elected and the current server is synchronized with it.

2.1 select the main process

When the leader crashes or the leader loses most of the follower, ZK enters the recovery mode. In the recovery mode, a new leader needs to be elected to restore all servers to a correct state. There are two types of ZK election algorithms: one is based on basic paxos and the other is based on fast paxos. The system's default Election Algorithm is fast paxos. First, we will introduce the basic paxos process:

1. The election thread is the thread from which the current server initiates the election. Its main function is to collect statistics on the voting results and select the recommended server;
2. The election thread first initiates a query (including itself) to all servers );
3. after receiving a reply, the election thread verifies whether it is a self-initiated query (verifying whether the zxid is consistent), obtains the ID (myid) of the other party, and stores it in the list of currently queried objects, finally, obtain the leader information (ID, zxid) proposed by the other party, and store the information in the voting record of the current election;
4. After receiving replies from all the servers, the server with the largest zxid will be calculated, and the server information will be set to the server for the next vote;
5. the thread sets the server with the largest zxid as the leader to be recommended by the current server. If the server that wins this time receives n/2 + 1 server votes, set the currently recommended leader as the winning server, and set its status based on the winning server information. Otherwise, continue the process until the leader is elected.

Through the process analysis, we can conclude that for the leader to obtain support from most servers, the total number of servers must be an odd 2n + 1, and the number of surviving servers must not be less than N + 1.

The preceding process is repeated after each server is started. In recovery mode, if the server is recovered from the crash state or the server is started, data and session information will be restored from the disk snapshot, zk will record the transaction log and regularly take snapshots, it is convenient to restore the status when it is restored. The specific flowchart of the master selection is as follows:

In the fast paxos process, a server first proposes to all servers to become a leader. When other servers receive the proposal, the epoch and zxid conflicts are resolved, accept the proposal from the other party, send a message to the other party to accept the proposal, repeat the process, and finally the leader will be elected. The flowchart is as follows:

2.2 synchronization process

After selecting the leader, ZK enters the State synchronization process.

1. The leader waits for the server to connect;
2. Follower connects to the leader and sends the largest zxid to the leader;
3. The leader determines the synchronization point based on the zxid of the follower;
4. After synchronization is completed, the follower is notified that it has become uptodate;
5. After follower receives the uptodate message, it can accept the client's request again for service.

The flowchart is as follows:

2.3 workflow 2.3.1 leader Workflow

The leader has three main functions:

1. Restore data;
2. Maintain heartbeat with learner, receive learner requests, and determine the Request Message Type of learner;
3. Learner's message types include Ping message, request message, ACK message, and revalidate message. Different types of messages are processed.

The Ping message refers to the heartbeat information of learner. The request message is the proposal information sent by follower, including the write request and synchronous request. The ACK message is the reply of follower to the proposal, if more than half of the follower passes, commit the proposal. The revalidate message is used to extend the session validity period.
The leader workflow is shown in the following figure. In actual implementation, the process is much more complex than that. Three threads are started to implement functions.

2.3.2 follower Workflow

Follower has four main functions:

1. send a request to the leader (Ping message, request message, ACK message, and revalidate message );
2. Receive and process the leader message;
3. Receive client requests. If the request is a write request, send it to the leader for voting;
4. Return the client result.

The following types of messages from the leader are processed cyclically by follower:

1. Ping message: Heartbeat message;
2. Proposal message: a proposal initiated by the leader that requires follower to vote;
3. Commit message: information about the latest proposal of the server;
4. uptodate message: indicates that synchronization is completed;
5. revalidate message: Based on the revalidate result of the leader, disable the session waiting for revalidate or allow it to accept the message;
6. Sync message: return the sync result to the client. The message is initially initiated by the client to force the latest update.

The workflow of follower is shown in the following figure. In actual implementation, follower implements functions through five threads.

The observer process is not described. The only difference between the observer process and follower is that the observer does not participate in the voting initiated by the leader.

Mainstream application scenarios:

Implementation of mainstream zookeeper application scenarios (excluding official examples)

(1) Configuration Management
Centralized configuration management is very common in Application Clusters. Generally, commercial companies implement a centralized configuration management center to meet the needs of different application clusters for sharing their respective configurations, you can also notify every machine in the cluster when the configuration changes.

Zookeeper can easily implement this centralized configuration management. For example, if you configure all the configurations of app1 under/app1 znode, all the machines in app1 will monitor the node/app1 (zk. exist ("/app1", true), and implements the callback method watcher. When the data under/app1 znode on zookeeper changes, each machine will receive a notification, the Watcher method will be executed, and then the application will remove the data (zk. getdata ("/app1", false, null ));

In the preceding example, the coarse-grained configuration of monitoring is simple, and the fine-grained data can be monitored hierarchically. All of this can be designed and controlled.
(2) Cluster Management
In an application cluster, we often need to let every machine know which machines in the cluster (or other clusters dependent on) are alive, and the cluster machines are down due, network leeching and other reasons can be quickly notified to every machine without human intervention.

Zookeeper is also easy to implement. For example, I have a znode named/app1servers on the zookeeper server, when each machine in the cluster is started, a node of the ephemeral type will be created under this node, for example, server1 creation/app1servers/server1 (IP addresses can be used to ensure no duplication ), create/app1servers/server2 in server2, and watch/app1servers in both server1 and server2, that is, if the data or child node changes under the parent node, the client that watches the node will be notified. Because the ephemeral type node has a very important feature, that is, when the client and server are disconnected or the session expires, the node disappears, when a machine is down or disconnected, the corresponding node will disappear, and all the clients in the cluster that watch/app1servers will receive a notification and obtain the latest list.

Another application scenario is to select a master for the cluster. Once the master fails, a master can be selected from the slave immediately. The implementation steps are the same as those of the former, only when the machine is started, the node type created in app1servers changes to the ephemeral_sequential type, so that each node is automatically numbered

By default, the minimum number is set to master. Therefore, when we monitor the/app1servers node, we can obtain the Server LIST. As long as the logic of all cluster machines considers the node with the minimum number as master, the master node will be elected, and the corresponding znode will disappear when the master node goes down, and the new server list will be pushed to the client, then, each node logic considers the smallest number node as the master node, so as to achieve dynamic master election.

Introduction to zookeeper monitoring (watches)

The description of zookeeper c api is in include/zookeeper. h. In addition, most of the zookeeper c api constants and struct declarations are also found in zookeeper. h. If you are not familiar with using the c api, you 'd better look at zookeeper. h, or use doxygen to generate the zookeeper c api help document.

The most distinctive and least understandable thing about zookeeper is watches ). All read operations of zookeeper -- getdata (), getchildren (), and exists () can be set to watch. monitoring events can be understood as one-time triggers. The official definition is as follows: A watch event is one-time trigger, sent to the client that sets the watch, which occurs when the data for which the watch was set changes. We need to understand this as follows:

One-time trigger
When the set monitoring data changes, the monitoring event will be sent to the client. For example, if the client calls getdata ("/znode1", true) later, when the data on/znode1 node changes or is deleted, the client will obtain the monitoring event that/znode1 has changed. If/znode1 changes again, the client will not receive Event Notifications unless the client monitors/znode1 again.
(Sent to the client) sent to the client
The Zookeeper client communicates with the server through a socket. Due to a network failure, the monitoring event may not arrive at the client. The monitoring event is sent to the monitor asynchronously, zookeeper itself provides ordering guarantee: after the client first sees the monitoring event, only when the znode configured for monitoring changes (a client will never see a change for which it has set a watch until it first sees the watch event ). network latency or other factors may cause different clients to perceive a monitoring event at different times, but all the events seen by different clients are in the same order.
(Watch data) the data for which the watch was set
This means that znode nodes have different change methods. You can also imagine that zookeeper maintains two monitoring linked lists: data monitoring and subnode monitoring (Data watches and child watches) getdata () and exists () to set data monitoring and getchildren () set sub-node monitoring. Alternatively, you can imagine that different monitoring metrics set by zookeeper return different data. getdata () and exists () return information about the znode node, while getchildren () returns the subnode list. Therefore, setdata () triggers the data monitoring settings set on a node (assuming that the data settings are successful), and a successful create () the Operation will start the data monitoring set on the current node and the child node monitoring of the parent node. A successful Delete () operation triggers the data monitoring and child node monitoring events of the current node, and also triggers the Child Watch of the parent node of the node.

The monitoring in zookeeper is lightweight, so it is easy to set, maintain, and distribute. When the client loses contact with the zookeeper server, the client will not receive a notification of the monitoring event. Only after the client reconnects, if necessary, previously Registered monitoring will be re-registered and triggered, which is usually transparent to developers. There is only one situation that will lead to loss of monitoring events, that is, a znode monitoring is set through exists, however, if a client loses contact with the zookeeper server within the time interval between the creation and deletion of the znode node, the client will not be notified of the event even if it reconnects to the zookeeper server later.

Zookeeper c api constants and some structures (struct) introduce the structures and constants related to ACL:

Struct ID structure:

struct Id {     char * scheme;     char * id; };

Struct ACL structure:

struct ACL {     int32_t perms;     struct Id id; };

Struct acl_vector structure:

struct ACL_vector {     int32_t count;     struct ACL *data; };

Constants related to znode Access Permissions

Const int zoo_perm_read; // allows the client to read the znode value and subnode list.
Const int zoo_perm_write; // allows the client to set the znode value.
Const int zoo_perm_create; // allows the client to create a subnode under the znode node.
Const int zoo_perm_delete; // allows the client to delete subnodes.
Const int zoo_perm_admin; // allows the client to execute set_acl ().
Const int zoo_perm_all; // allows the client to perform all operations, which is equivalent to or of all the above signs ).

Constants related to ACL IDS

Struct ID zoo_anyone_id_unsafe; // ('World', 'anyone ')
Struct ID zoo_auth_ids; // ('auth ','')

Three standard ACLs

Struct acl_vector zoo_open_acl_unsafe; // (zoo_perm_all, zoo_anyone_id_unsafe)
Struct acl_vector zoo_read_acl_unsafe; // (zoo_perm_read, zoo_anyone_id_unsafe)
Struct acl_vector zoo_creator_all_acl; // (zoo_perm_all, zoo_auth_ids)

Constants related to interest: zookeeper_write, zookeeper_read

These two constants are used to identify the events of interest and notify zookeeper of what events occurred. Interest constants can be combined or (OR) to identify multiple interests (multiple interests: write, read). These two constants are generally used in the zookeeper_interest () and zookeeper_process () functions.

Constants related to node creation: zoo_ephemeral, zoo_sequence

Zoo_create function flag. zoo_ephemeral is used to identify the creation of a temporary node. zoo_sequence is used to identify the node name with an incrementing suffix serial number (generally the serial number after the node name is filled with 10 characters, such as/xyz1_000000, /xyz0000000001,/xyz0000000002 ,...), similarly, zoo_ephemeral and zoo_sequence can be combined.

Constants related to the connection status stat

The following constants are related to the zookeeper connection status. They are usually used as parameters of the monitor callback function.

Zooapi const int	Zoo_expired_session_state
Zooapi const int	Zoo_auth_failed_state
Zooapi const int	Zoo_connecting_state
Zooapi const int	Zoo_associating_state
Zooapi const int	Zoo_connected_state

Constant related to the monitoring type (watch types)

The following constants identify the types of monitoring events, which are usually used as the first parameter of the monitor callback function.

Zoo_created_event; // when a node is created (this node does not exist before), set monitoring through zoo_exists.
Zoo_deleted_event; // The node is deleted and monitored through zoo_exists () and zoo_get.
Zoo_changed_event; // when a node changes, set monitoring through zoo_exists () and zoo_get.
Zoo_child_event; // subnode event, set monitoring through zoo_get_children () and zoo_get_children2.
Zoo_session_event; // session loss
Zoo_notwatching_event; // monitoring is removed.

Zookeeper c api error code introduction zoo_errors

Zok	Normal Return
Zsystemerror	System and server-side errors. The server does not throw this error. This error is only used to identify the error range, that is, it is greater than the error value, and less than zapierror are system errors.
Zruntimeinconsistency	Non-Consistency error during running.
Zdatainconsistency	Data inconsistency error.
Zconnectionloss	Zookeeper client and server lost connection
Zw.allingerror	Error while listing alling or unmarshalling data)
Zunimplemented	This operation is not implemented (operation is unimplemented)
Zoperationtimeout	This operation times out (Operation timeout)
Zbadarguments	Invalid parameter error (invalid arguments)
Zinvalidstate	Invliad zhandle state)
Zapierror	API error (API errors). The server does not throw this error. This error is only used to identify the error range. An API error with an error value greater than this value is returned, zsystemerror is smaller than the value.
Znonode	Node does not exist)
Znoauth	Not Authenticated)
Zbadversion	Version conflict)
Znochildrenforephemerals	Temporary nodes cannot have subnodes (ephemeral nodes may not have children)
Znodeexists	The node already exists)
Znotempty	This node has its own subnode (the node has children)
Zsessionexpired	Session expiration (the session has been expired by the server)
Zinvalidcallback	Invalid callback function (invalid callback specified)
Zinvalidacl	Invalid ACL (invalid ACL specified)
Zauthfailed	Client Authentication failed)
Zclosing	Zookeeper Connection closed (zookeeper is closing)
Znothing	It is not an error. The client does not need to process the Server Response (not error, no server responses to process)
Zsessionmoved	The session is transferred to another server, so the operation is ignored (session moved to another server, so operation is ignored)

Watch event type:

Zoo_created_event: node creation event. You need to watch a node that does not exist. When the node is created, this watch is set through zoo_exists ().
Zoo_deleted_event: Node Deletion event. This watch is set through zoo_exists () or zoo_get ().
Zoo_changed_event: node data change event. This watch is set through zoo_exists () or zoo_get ().
Zoo_child_event: subnode list change event. This watch is set through zoo_get_children () or zoo_get_children2 ().
Zoo_session_event: A session failure event triggered when the client is disconnected from the server or is re-connected.
Zoo_notwatching_event: The watch removal event. This event is not triggered when the server is no longer a watch node of the client for some reason.

Zookeeper Principle)

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More