Zookeeper is a very important component of Hadoop ecosystem, and its main function is to provide consistency coordination (coordination) services for distributed systems, and the corresponding Google-like service is called Chubby. Today this article is divided into three parts to introduce zookeeper, the first part introduces the basic principles of zookeeper, the second part introduces the use of the client API provided by zookeeper, the third part introduces some zookeeper typical application scenarios.
Zookeeper Fundamentals
1. Data Model
As shown, the structure of the Zookeeper data model is very similar to the Unix file system, as a whole can be seen as a tree, and each node is called a znode. Each znode can be uniquely identified by its path, such as the first Znode on the third tier, and its path is/APP1/C1. A small amount of data can be stored on each znode (by default, 1M, which can be modified by configuration and is generally not recommended for storing large amounts of data on Znode), which is useful in typical scenarios later in this article. In addition, each znode also stores its ACL information, it should be noted that although the Znode tree structure is similar to the Unix file system, but its ACL and UNIX file system is completely different, each Znode ACL independent, child nodes will not inherit the parent node, The ACL in Zookeeper can refer to an article previously written, "talking about ACLs in zookeeper".
2. Key Concepts
2.1 Znode
The previous article has introduced Znode, Znode according to its own characteristics, can be divided into the following two categories:
- Regular Znode: Regular type Znode, users need to explicitly create, delete
- Ephemeral Znode: Temporary Znode, after the user creates it, can be explicitly deleted, or can be automatically deleted by zookeeper server after the session that created it
Znode also has a sequential feature that, if specified at the time of creation, automatically append an ever-increasing sequenceno after the name of the Znode.
2.2 Session
The communication between the client and the zookeeper requires a session to be created, and the session will have a time-out. Because the zookeeper cluster will persist the session information of the client, the connection between the client and the zookeeper server can be moved transparently between the zookeeper servers before the session expires.
In real-world applications, if the communication between the client and server is frequent enough, the session maintenance does not require additional information. Otherwise, the ZooKeeper client sends a heartbeat to the server every T/3 MS, and if client 2T/3 MS does not receive a heartbeat response from the server, it will be switched to a new ZooKeeper server. Here T is the timeout period for the user-configured session.
2.3 Watcher
Zookeeper supports a watch operation where the client can set a watcher on a znode to watch the changes on the Znode. If there is a corresponding change on the Znode, the watcher is triggered and the corresponding event is notified to the client setting the watcher. It should be noted that the Watcher in zookeeper is a one-time, that is, the trigger is canceled once, if you want to continue watch, the client needs to reset the watcher. This is a bit like the oneshot pattern in Epoll.
3. Zookeeper Features
3.1 Read, write (update) mode
In the zookeeper cluster, read can be read from any zookeeper server, which is the key to ensure the zookeeper better read performance; Written requests are forwarder to leader first, Then by leader to the atomic broadcast protocol in the Zookeeper, the request broadcast to all the Follower,leader received more than half of the write successful ACK, it is believed that the write is successful, it will persist the write, and tell the client to write successfully.
3.2 Wal and Snapshot
Like most distributed systems, Zookeeper also has Wal (Write-ahead-log), and for each update operation, Zookeeper writes the Wal first, then updates the in-memory data, and notifies the client of the update results. In addition, zookeeper will periodically snapshot the in-memory directory tree to the disk, which is similar to the fsimage in HDFs. The main purpose of this is, of course, data persistence, the second is to speed up the recovery after the restart, if all through the form of replay Wal recovery, it will be relatively slow.
3.3 FIFO
For each zookeeper client, all operations are in FIFO order, which is guaranteed by the following two basic features: the first is that the network communication between zookeeper client and server is based on Tcp,tcp client/ The order in which packets are transferred between servers, and the second is that zookeeper server executes client requests strictly in FIFO order.
3.4 linearizability
In zookeeper, all update operations have strict partial-order relationships, and update operations are executed serially, which is the key to ensure the correctness of zookeeper function.
ZooKeeper Client API
The ZooKeeper Client Library provides a rich and intuitive API for user programs to use, and here are some common APIs:
- Create (path, data, flags): Creates a znode path, which is the data to be stored on the Znode, flags commonly used are: Persisten, Persistent_sequentail, Ephemeral, Ephemeral_sequentail
- Delete (path, version): Deletes a znode, can delete the specified version through version, if version is-1, it means to delete all versions
- exists (path, watch): Determines whether the specified Znode exists and sets whether to watch this znode. If you want to set watcher, watcher is specified when creating the zookeeper instance, and if you want to set a specific watcher, you can call another overloaded version of exists (path, watcher). The following APIs with watch parameters are also similar
- GetData (Path, watch): reads the data on the specified Znode and sets whether watch this Znode
- SetData (Path, watch): Updates the data for the specified Znode and sets whether watch this Znode
- GetChildren (Path, watch): Gets the name of all child znode of the specified Znode and sets whether watch this Znode
- Sync (PATH): Synchronizes all update operations prior to sync so that each request takes effect on more than half of the zookeeper servers. The path parameter is not currently used
- SetACL (path, ACL): Sets ACL information for the specified Znode
- Getacl (PATH): Gets the ACL information for the specified Znode
Zookeeper Typical application scenarios
1. Name Service (Nameservice)
In distributed applications, a complete set of command mechanisms is often required to generate unique identities and to facilitate human identification and memory. We know that each znode can be uniquely identified by its path, the path itself is relatively concise and intuitive, in addition Znode can also store a small amount of data, these are the basis for achieving a unified nameservice. The following is an example of implementing Nameservice in HDFs to illustrate the basic fabric for implementing Nameservice:
- Objective: To access the specified HDFs fleet through a simple name
- Define naming rules: here to be concise and easy to remember. The following is an optional scenario: [Servicescheme://][zkcluster]-[clustername], such as hdfs://lgprc-example/, that is based on LGPRC Zookeeper clusters of HDFs clusters used to make example
- Configure DNS Mapping: Resolve Zkcluster's identity LGPRC through DNS to the address of the corresponding zookeeper cluster
- Create Znode: Create a/nameservice/hdfs/lgprc-example node on the corresponding zookeeper and store the HDFs configuration file under that node.
- To access hdfs://lgprc-example/'s HDFs cluster, the user program first finds the address of the LGPRC zookeeper cluster through DNS, and then zookeeper/nameservice/hdfs/ The configuration of the HDFs is read in the Lgprc-example node, and the actual access entry of the HDFS is obtained according to the configuration obtained.
2. Configuration management (config Management)
In the distributed system, often encounter such a scenario: a job of many instances in the run, they are at run time most of the configuration items are the same, if you want to change a unified configuration, one instance to change, is relatively inefficient, but also more error-prone way. Through zookeeper can be a good solution to this problem, the following basic steps:
- Place public configuration content on a znode in zookeeper, such as/service/common-conf
- All instances are passed into the zookeeper cluster's ingress address at startup, and watch/service/common-conf this znode during the run
- If the Cluster Administrator modifies common-conf, all instances will be notified, update their configuration based on the notifications received, and continue watch/service/common-conf
3. Team member Management (Group membership)
In a typical master-slave structure distributed system, master needs to manage all the slave as "explorer", when slave joins, or slave down, master needs to perceive this and then make the corresponding adjustments, So that it does not affect the entire cluster to provide services externally. In HBase, for example, Hmaster manages all the Regionserver, and when a new regionserver joins, Hmaster needs to allocate some region to the regionserver to provide services When there is a regionserver outage, Hmaster needs to reassign the region before the Regionserver service to the other regionserver that are currently serving, so that the client's normal access is not affected. Here are the basic steps for using zookeeper in this scenario:
- Master creates a/service/slaves node on zookeeper and sets the Watcher for that node
- Each slave creates a temporary (ephemeral) node/service/slaves/${slave_id} that uniquely identifies itself after a successful start, and writes related information, such as its own address (ip/port), to that node.
- master receives notification that new sub-nodes are added and does the appropriate processing
- If there is a slave outage, because the node it corresponds to is a temporary node, after its session timeout, zookeeper will automatically delete the node.
- master receives notification that a child node disappears and does the corresponding processing
4. Simple lock
Our knowledge, in traditional applications, threads, process synchronization, can be done through the mechanism provided by the operating system. However, in a distributed system, the operating system level is powerless to synchronize between multiple processes. In this case, a distributed coordination (coordination) service like zookeeper is required to assist in the synchronization, and the following is the step of implementing a simple mutex with zookeeper, which can be compared to a mutex synchronized between threads to understand:
- Multiple processes try to create a temporary (ephemeral) node in the specified directory/locks/my_lock
- Zookeeper can ensure that only one process succeeds in creating the node, and the process of creating the node succeeds is the process of grabbing the lock, assuming that the process is a
- All other processes watch the/locks/my_lock.
- When the a process no longer requires a lock, you can explicitly delete the/locks/my_lock release lock, or a process downtime after the session timeout, zookeeper system automatically delete/locks/my_lock node release lock. At this point, the other process will receive zookeeper notification, and try to create a/locks/my_lock grab lock, so loop repeatedly
5. Mutex (simple lock without herd Effect)
One of the problems in the previous section is that there are a lot of processes to compete for each lock, which can result in a herd effect (herd Effect), in order to solve this problem, we could improve the process by using the following steps:
- Each process creates a temporary sequential node (ephemeral sequential) on zookeeper/locks/lock_${seq}
- ${SEQ} minimum is the current lock holder (${seq} is zookeeper generated sequenctial number)
- Other processes correspond to the nodes that watch only the smaller processes, such as 2 Watch 1, 3 watch 2, etc.
- Now that the lock is released, the process that is larger than it will receive a zookeeper notification, which becomes the new lock holder, so the loop repeats
There is a need to add that the zookeeper is usually used in distributed systems to do leader election (elect) is achieved through the above mechanism, the lock is the current "master".
6. Read/write Lock (Read/write lock)
We know that a read-write lock differs from a mutex in that it is divided into two modes: Read and write, multiple reads can be executed concurrently, but both write and read and write are mutually exclusive and cannot execute rows at the same time. Using zookeeper, on the basis of the above, a little modification can also implement the traditional read-write lock semantics, the following are the basic steps:
- Each process creates a temporary sequential node (ephemeral sequential) on zookeeper/locks/lock_${seq}
- ${SEQ} The smallest one or more nodes are the current holder, and many are because multiple reads can be concurrent
- A process that needs to write a lock, watch a node that corresponds to a process smaller than its minor
- The process that needs to read the lock, watch the last write process corresponding to the node that is smaller than it
- After the current node releases the lock, all the processes that watch the node are notified, they become the new lock holder, so the loop repeats
7. Barrier (Barrier)
In a distributed system, a barrier is a semantic: The client waits for multiple processes to complete its task before proceeding to the next step. The bottom-up is the basic step of using zookeeper to implement the barrier:
- The client creates a barrier node/barrier/my_barrier on the zookeeper and initiates the process of performing each task
- The client watch/barrier/my_barrier the node by exist ().
- Each task process after the completion of the task, to check whether to meet the specified conditions, if not achieved, do nothing, if achieved, the/barrier/my_barrier node deleted
- The client receives a notification that the/barrier/my_barrier was deleted, the barrier disappears, and the next task continues
8. Dual barrier (double Barrier)
A double barrier is a semantic: it can be used to synchronize the beginning and end of a task, and when enough of the processes have entered the barrier, the task is started, and the barrier is withdrawn when all the processes have completed their respective tasks. Here are the basic steps to implement a double barrier using zookeeper:
Entry Barrier:
- Client watch/barrier/ready node, determines whether the task is initiated by judging whether the node exists or not.
- Each task process creates a temporary node/barrier/process/${process_id} when it enters the barrier, and then checks whether the node entering the barrier reaches the specified value, creates a/barrier/ready node if the specified value is reached, or continues to wait
- When the client receives a notification created by/barrier/ready, it starts the task execution process
To leave the barrier:
- Client Watch/barrier/process, if it has no sub-nodes, can assume that the task execution is over and can leave the barrier.
- After each task process executes the task, it is necessary to delete its corresponding node/barrier/process/${process_id}.
Zookeeper principle and use