Zookeeper is a distributed, open-source distributed Application Coordination Service that contains a simple set of primitives that can be used by distributed applications to implement synchronization services, configure maintenance and naming services, and so on. Zookeeper is a sub-project of Hadoop, and its evolution is not to be mentioned. In distributed applications, a reliable, extensible, distributed, and configurable coordination mechanism is needed to unify the state of the system, because the engineers do not use the lock mechanism well and the message-based coordination mechanism is not suitable for use in some applications. This is the purpose of zookeeper. This paper simply analyzes the working principle of zookeeper, and it is not the focus of this paper to discuss how to use zookeeper.
1 Basic concepts of Zookeeper 1.1 characters
There are three main categories of roles in Zookeeper, as shown in the following table:
System Model:
1.2 Design Purpose
1. Final consistency: No matter which server the client connects to, it is the same view that is presented to it, which is the most important performance of zookeeper.
2. Reliability: With simple, robust, good performance, if the message M is accepted to a server, then it will be accepted by all servers.
3. Real-time: zookeeper to ensure that the client will be in a time interval to obtain updates to the server, or server failure information. However, due to network delay and other reasons, zookeeper cannot guarantee that two clients can get the newly updated data at the same time, if you need the latest data, you should call the sync () interface before reading the data.
4. Wait unrelated (Wait-free): Slow or invalid client must not intervene in the fast client request, so that each client can effectively wait.
5. Atomicity: Updates can only succeed or fail with no intermediate state.
6. Sequence: including global order and partial order: Global order is that if the message a on a server is published before message B, on all servers, message A will be published in front of message B; The partial order is that if a message B is published by the same sender after message A, a must precede B.
2 How the Zookeeper works
The core of zookeeper is atomic broadcasting, a mechanism that guarantees synchronization between the various servers. The protocol that implements this mechanism is called the Zab protocol. The ZAB protocol has two modes, namely the recovery mode (select Master) and broadcast mode (synchronous). When the service is started or after the leader crashes, the Zab enters the recovery mode, and when the leader is elected and most of the servers are synchronized with the leader state, the recovery mode is finished. State synchronization ensures that the leader and server have the same system state.
To ensure the sequential consistency of transactions, zookeeper uses an incremented transaction ID number (ZXID) to identify transactions. All the proposals (proposal) were added to the ZXID when they were presented. The implementation of ZXID is a 64-bit number, it is 32 bits high is the epoch used to identify whether the leader relationship changes, each time a leader is chosen, it will have a new epoch, marking the current period of the reign of the leader. The low 32 bits are used to increment the count.
Each server has three states in the process of working:
Looking: Current server does not know who leader is, is searching
Leading: The current server is an elected leader
Following:leader has been elected and the current server is in sync with it.
2.1 Selection Master Process
When leader crashes or leader loses most of the follower, when ZK enters recovery mode, the recovery mode needs to re-elect a new leader, so that all servers are restored to a correct state. ZK's election algorithm has two kinds: one is based on basic Paxos, the other is based on the fast Paxos algorithm. The default election algorithm for the system is fast Paxos. Introduce the basic Paxos process first:
1. The election thread is held by the current server-initiated election thread, whose main function is to count the poll results and select the recommended server;
2. The election thread first initiates an inquiry (including itself) to all servers;
3. After the election thread receives the reply, verifies whether it is an inquiry initiated by itself (verifies that the ZXID is consistent), then obtains the other person's ID (myID), and stores it in the current Query object list, and finally obtains leader related information (ID,ZXID) proposed by the other party. and store this information in the Voting record table of the election;
4. After receiving all the server replies, calculate the ZXID largest server, and set the server related information to the next server to vote;
5. The thread sets the current ZXID maximum server to the current server to recommend leader, if the winning server obtains N/2 + 1 of the server votes, sets the currently recommended leader for the winning server, will set its own state based on the information about the winning server, otherwise, continue the process until leader is elected.
Through process analysis we can conclude that to enable leader to obtain support from most servers, the total number of servers must be odd 2n+1 and the number of surviving servers should not be less than n+1.
These processes are repeated after each server startup. In recovery mode, if the server that is just recovering from a crash or just started also recovers data and session information from a disk snapshot, ZK logs the transaction log and periodically snapshots it to facilitate state recovery on recovery. The specific flowchart for the selected master is as follows:
The Fast Paxos process is an election process in which a server first proposes itself to be a leader to all servers, and when other servers receive the offer, resolves the clash between the epoch and ZXID and accepts the other's proposal, then sends a message to the other party accepting the proposal to complete, Repeat the process, and you will finally be able to elect the leader. The flowchart is as follows:
2.2 Synchronization Process
After selecting leader, ZK enters the state synchronization process.
1. Leader wait for server connection;
2. Follower connection leader, the largest Zxid sent to leader;
3. Leader the synchronization point according to the zxid of follower;
4. After completing the synchronization notification follower has become uptodate status;
5. Follower receives the uptodate message, it can re-accept the client's request for service.
The flowchart is as follows:
2.3 Workflow 2.3.1 Leader Work flow
Leader has three main functions:
1. Recover data;
2. Maintain heartbeat with learner, receive learner request and Judge learner request message type;
3. The learner message types are mainly ping messages, request messages, ACK messages, revalidate messages, and different processing depending on the message type.
The ping message refers to the heartbeat information of the learner, the request message is the proposed information sent by follower, including write requests and synchronization requests, the ACK message is follower's reply to the proposal, more than half of the follower passed, then commit the proposal The revalidate message is used to extend the session's effective time.
The workflow diagram of leader is shown below, in the actual implementation, the process is much more complex, starting three threads to implement the function.
2.3.2 Follower Work Flow
Follower has four main functions:
1. Send a request to the leader (Ping message, request message, ACK message, revalidate message);
2. Receive leader messages and process them;
3. Receive client's request, if write a request, send to leader to vote;
4. Returns the client result.
The follower message loop handles the following messages from leader:
1. Ping message: Heartbeat message;
2. Proposal NEWS: Leader initiated the proposal, request follower vote;
3. Commit message: Information about the latest proposal on the server side;
4. UpToDate message: Indicates that synchronization is complete;
5. Revalidate message: According to Leader's revalidate results, close the session to be revalidate or allow it to accept messages;
6. Sync message: Returns the sync result to the client, originally initiated by the client, to force the latest update.
The follower workflow diagram is shown below, in the actual implementation, the follower is implemented by 5 threads.
For the observer process no longer described, the only difference between the observer process and follower is that observer will not participate in leader-sponsored polls.
Mainstream application scenarios:
Zookeeper of the mainstream application scenario (except for the official example)
(1) Configuration management
Centralized configuration management is common in application clusters, where a centralized set of configuration management centers is implemented within a common business company, responding to the need for different application clusters to share their respective configurations and being able to notify every machine in the cluster when configuration changes are made.
Zookeeper is easy to implement this centralized configuration management, such as the configuration of all APP1 configuration to/app1 Znode, APP1 all the machines start on the/APP1 this node monitoring (Zk.exist ("/app1", true)), and implement the callback method watcher, then on zookeeper/app1 znode node under the data changes, each machine will be notified, the Watcher method will be executed, then the application can then remove the data (Zk.getdata ("/app1", False,null));
The above example is simply coarse granular configuration monitoring, fine-grained data can be monitored hierarchically, all of which can be designed and controlled.
(2) Cluster management
In an application cluster, we often need to let each machine know which machines are alive in the cluster (or some other cluster depending on them), and can quickly notify every machine in the event that the cluster machine is not manually involved due to outages, network disconnection, etc.
Zookeeper is also very easy to implement this function, such as I have a zookeeper server side znode called/app1servers, then every machine in the cluster when the start of the node to create a ephemeral type of node, For example Server1 create/APP1SERVERS/SERVER1 (can use IP, guaranteed not to repeat), Server2 create/app1servers/server2, then SERVER1 and SERVER2 Watch/ App1servers This parent node, then the data or child node changes under the parent node will notify the client of the watch on that node. Because the ephemeral type node has a very important feature, that is, the client and server side connection is broken or the session expires will cause the node to disappear, then when a machine hangs or breaks the chain, the corresponding node will disappear, and then the cluster of all the The client that App1servers watch will be notified and then get the latest list.
Another application scenario is the cluster select Master, once master hangs off can immediately select a master from the slave, the implementation of the same steps as the former, but the machine at the start of the app1servers created in the node type changed to Ephemeral_ Sequential type so that each node is automatically numbered
We default to the minimum number of the master, so when we monitor the/app1servers node, we get a list of servers, as long as all the cluster machine logic that the minimum number node is master, then Master is selected, and this master down, The corresponding znode disappears, and then the new server list is pushed to the client, then each node logic considers the minimum number node as master, so that the dynamic master election is done.
Zookeeper Monitoring (Watches) Introduction
The Declaration and description of the Zookeeper C API can be found in include/zookeeper.h, and most of the Zookeeper C API constants and struct declarations are also in Zookeeper.h if you are experiencing a problem where you are using C API , it's best to look at Zookeeper.h, or use Doxygen to generate zookeeper C API's help documentation.
The most distinctive and most difficult to understand Zookeeper is surveillance (Watches). Zookeeper all read Operations--getdata (), GetChildren (), and exists () can be set to monitor (watch), the monitoring event can be understood as a one-time trigger, the official definition is as follows: A watch event is one-time Trigger, sent to the client that sets the watch, which occurs when the data is which the watch was set changes. This needs to be understood as follows:
(disposable trigger) one-time trigger
When setting the monitored data changes, the monitoring event is sent to the client, for example, if the client calls GetData ("/znode1", true) and later the data on the/ZNODE1 node changes or is deleted, the client gets to/znode1 Monitored events that change, and if/znode1 changes again, the client will not receive an event notification unless the client again monitors the/ZNODE1 settings.
Sent to the client
The Zookeeper client and server are communicated through the socket, and because of a network failure, it is likely that the monitoring event will not reach the client successfully, the monitoring event is sent asynchronously to the Watcher, and the Zookeeper itself provides a sequence of guarantees (ordering Guarantee): That is, the client will only be aware of the changes to the znode that it has set up after it has seen the monitoring event (a-client would never see a-which it has set a watch until I T first sees the watch event). Network latency or other factors can cause different clients to perceive a monitoring event at different times, but everything that different clients see is in a consistent order.
(The data for Watch is set) the data for which the watch is set
This means that the Znode node itself has different ways of changing. You can also imagine that Zookeeper maintained two watch lists: Data monitoring and child node monitoring (watches and watches) GetData () and exists () setting data monitoring, GetChildren () Setting child node monitoring 。 Alternatively, you can imagine that different monitoring Zookeeper settings return different data, GetData () and exists () return information about the Znode node, and GetChildren () returns a list of child nodes. Therefore, setData () triggers the setting of data monitoring set on a node (assuming the data is set up successfully), while a successful create () operation starts with the data monitoring set on the current node and the child node monitoring of the parent node. A successful delete () operation will trigger the current node's data monitoring and child node monitoring events, as well as the parent node of the node.
The monitoring in Zookeeper is lightweight and therefore easy to set up, maintain, and distribute. When the client loses contact with the Zookeeper server, the client does not receive notification of the monitoring event, which is usually transparent to the developer if, if necessary, previously registered monitoring is re-registered and triggered when the client is reconnected. There is only one situation that can cause a monitoring event to be lost, that is, the monitoring of a Znode node is set through exists (), but if a client loses contact with the zookeeper server during the time interval that the Znode node is created and deleted, the client even later reconnect Event notifications are not zookeeper after the server.
Zookeeper C API constants and partial structures (structs) describe ACL-related structures and constants:
The struct ID structure is:
struct ID {char * scheme; char * ID; };
The struct ACL structure is:
struct ACL {int32_t perms; struct ID ID; };
The struct Acl_vector structure is:
struct Acl_vector {int32_t count; struct ACL *data; };
Constants related to Znode access rights
const int Zoo_perm_read; Allows the client to read the value of the Znode node and the list of child nodes.
The const int zoo_perm_write;//allows the client to set the value of the Znode node.
const int zoo_perm_create; Allows the client to create child nodes under the Znode node.
The const int zoo_perm_delete;//allows the client to delete child nodes.
const int zoo_perm_admin; Allows the client to execute Set_acl ().
The const int zoo_perm_all;//allows the client to perform all operations, equivalent to all flags above or (or).
Constants related to ACL IDs
struct Id zoo_anyone_id_unsafe; (' World ', ' anyone ')
struct Id zoo_auth_ids;//(' AUTH ', ')
Three types of standard ACLs
struct Acl_vector zoo_open_acl_unsafe; (Zoo_perm_all,zoo_anyone_id_unsafe)
struct Acl_vector zoo_read_acl_unsafe;//(zoo_perm_read, Zoo_anyone_id_unsafe)
struct Acl_vector zoo_creator_all_acl; (Zoo_perm_all,zoo_auth_ids)
Constants related to Interest: Zookeeper_write, Zookeeper_read
These two constants are used to identify the event of interest and to notify zookeeper of events that have occurred. Interest constants can be combined or (or) to identify multiple interests (multiple interests:write, read), which are typically used for zookeeper_interest () and zookeeper_process () Two functions in a function.
Constants related to node creation: Zoo_ephemeral, zoo_sequence
The Zoo_create function flag, zoo_ephemeral is used to identify the creation of a temporary node, zoo_sequence is used to identify the node named with an incrementing suffix ordinal (typically a number of 10-bit characters after the node name, such as/xyz0000000000,/ xyz0000000001,/xyz0000000002, ...), similarly, zoo_ephemeral, zoo_sequence can be combined.
Constants related to connection status Stat
The following constants are related to the Zookeeper connection state, and they are typically used as parameters to the monitor callback function.
Zooapi Const INT |
Zoo_expired_session_state |
Zooapi Const INT |
Zoo_auth_failed_state |
Zooapi Const INT |
Zoo_connecting_state |
Zooapi Const INT |
Zoo_associating_state |
Zooapi Const INT |
Zoo_connected_state |
Constants related to monitoring type (watch Types)
The following constants identify the types of monitoring events that they typically use as the first parameter of a monitor callback function.
Zoo_created_event; The node is created (previously the node does not exist) and is set up for monitoring by zoo_exists ().
Zoo_deleted_event; The nodes are removed and monitored through zoo_exists () and Zoo_get () settings.
Zoo_changed_event; nodes change, and monitoring is set up through Zoo_exists () and Zoo_get ().
Zoo_child_event; Child node events, set up monitoring through Zoo_get_children () and Zoo_get_children2 ().
Zoo_session_event; Session lost
Zoo_notwatching_event; Monitoring is removed.
Zookeeper C API Error code Introduction zoo_errors
Zok |
Normal return |
Zsystemerror |
System or server-side error (Systems and server-side errors), the server does not throw the error, and the error is only used to identify the error range, which is greater than the error value, and less than Zapierror is a system error. |
Zruntimeinconsistency |
Run-time non-conformance error. |
Zdatainconsistency |
Data non-conformance error. |
Zconnectionloss |
Zookeeper client loses connection to server side |
Zmarshallingerror |
Error in marshalling and unmarshalling data (Error while marshalling or unmarshalling data) |
zunimplemented |
The operation is not implemented (operation is unimplemented) |
Zoperationtimeout |
The operation timed out (Operation timeout) |
Zbadarguments |
Illegal parameter error (Invalid arguments) |
Zinvalidstate |
Illegal handle status (Invliad zhandle state) |
Zapierror |
API error (API errors), the server does not throw the error, the error is only used to identify the error range, the error value is greater than the value of the identity API error, and is less than the value of the identity zsystemerror. |
Znonode |
Node does not exist (nodes does not exist) |
Znoauth |
Not authorized (not authenticated) |
Zbadversion |
Versioning conflict (version conflict) |
Znochildrenforephemerals |
Temporary nodes cannot have child nodes (ephemeral nodes may not have children) |
Znodeexists |
Nodes already exist (the node already exists) |
Znotempty |
The node has its own child nodes (the nodes have children) |
Zsessionexpired |
Session expires (the sessions have been expired by the server) |
Zinvalidcallback |
Illegal callback function (Invalid callback specified) |
Zinvalidacl |
Illegal ACL (Invalid ACL specified) |
Zauthfailed |
Client Authorization failed (authentication failed) |
Zclosing |
Zookeeper connection Off (Zookeeper is closing) |
Znothing |
Not an error, the client does not need to process the server's response (not error, no server responses to process) |
Zsessionmoved |
The session is moved to a different server, so the operation is ignored (session moved to another server, so operation is ignored) |
Watch Event Type:
Zoo_created_event: Node creation event, need watch a nonexistent node, when the node is created, this watch is set by Zoo_exists ()
Zoo_deleted_event: Node Delete event, this watch is set by zoo_exists () or Zoo_get ()
Zoo_changed_event: Node Data Change event, this watch is set by zoo_exists () or Zoo_get ()
Zoo_child_event: Child node List Change event, this watch is set by Zoo_get_children () or Zoo_get_children2 ()
Zoo_session_event: Session failure event, triggered when client disconnects or re-connects to the server
Zoo_notwatching_event:watch Remove event, the server is no longer triggered for some reason when the Client watch node
Go Zookeeper principle and Application scenario