& nbsp; ZooKeeper is a very important component of Hadoop Ecosystem, its main function is to provide a distributed system coordination service (Coordination), and The corresponding Google service is called Chubby. Today, this article is divided into three parts to introduce ZooKeeper. The first part introduces the basic principle of ZooKeeper. The second part introduces the use of Client API provided by ZooKeeper. The third part introduces some typical ZooKeeper Application scenarios.
ZooKeeper Fundamentals
Data model
As shown in the figure above, the structure of the ZooKeeper data model is very similar to that of the Unix file system. As a whole, it can be thought of as a tree, and each node is called a ZNode. Each ZNode can be uniquely identified by its path, such as the first ZNode on the third level in the figure above, whose path is / app1 / c1. It is useful to store a small amount of data on each ZNode (the default is 1M, which can be modified through configuration. It is not recommended to store a large amount of data on the ZNode.) This feature is useful in the following typical scenarios. In addition, each ZNode also stores its Acl information. It should be noted here that although the ZNode tree structure is similar to the Unix file system, its Acl is completely different from the Unix file system. Each ZNode's Acl independent , The child node will not inherit the parent node.
2. Important concepts
2.1 ZNode
ZNode has been introduced earlier, ZNode according to its own characteristics, can be divided into the following two categories:
Regular ZNode: regular ZNode, the user needs to explicitly create and delete Ephemeral ZNode: temporary ZNode, the user can delete it explicitly after creating it, or delete it automatically by ZooKeeper Server after creating its session
ZNode also has a Sequential feature that appends an increasing SequenceNo automatically after the name of the ZNode, if specified at creation time.
2.2 Session
Client and ZooKeeper communication between the need to create a Session, the Session will have a timeout. Because the ZooKeeper cluster persists the client's session information, the connection between the client and the ZooKeeper Server transparently moves between ZooKeeper Servers before the session times out.
In practical application, if the communication between Client and Server is frequent enough, Session maintenance does not need other additional messages. Otherwise, ZooKeeper Client sends a heartbeat to Server every t / 3 ms. If Client 2 t / 3 ms does not receive heartbeat from Server, it will be switched to a new ZooKeeper Server. Here t is the user-configured Session timeout.
2.3 Watcher
ZooKeeper supports a Watch operation. Client can set a Watcher on a ZNode to watch the changes on the ZNode. If there is a corresponding change on the ZNode, it will trigger the Watcher, the corresponding event notification to set Watcher's Client. It should be noted that the ZooKeeper Watcher is a one-time, that is triggered once will be canceled, if you want to continue Watch, you need the client to reset Watcher. This is similar to the oneshot mode in epoll.
ZooKeeper features
3.1 read, write (update) mode
In the ZooKeeper cluster, reading can be read from any ZooKeeper Server. This is the key to ensuring better read performance of ZooKeeper. The write request first Forwarder to the Leader, and then the Leader uses the atomic broadcast protocol in ZooKeeper to send the request Broadcast to all Followers, Leader received more than half of the successful write Ack, that the success of the write, it will persist the write and tell the client write success.
3.2 WAL and Snapshot
Like most distributed systems, ZooKeeper also has WAL (Write-Ahead-Log). For each update, ZooKeeper writes WAL first, then updates the data in memory and notifies Client of the update result. In addition, ZooKeeper will periodically snapshot the directory tree in memory to disk, which is similar to the FSImage in HDFS. The main purpose of doing so, of course, is the persistence of data, the second is to speed up the recovery speed after the restart, if all through the Replay WAL form of recovery, will be slower.
3.3 FIFO
For each ZooKeeper client, all operations follow the FIFO order. This feature is guaranteed by two basic features: First, the network communication between ZooKeeper Client and Server is based on the TCP and TCP guarantees Client / Server transfer between the order of the package; Second, ZooKeeper Server client requests are executed in strict accordance with the FIFO sequence.
3.4 Linearizability
In ZooKeeper, all update operations have a strict partial order. The update operation is performed serially, which is the key to ensure the correctness of the function of ZooKeeper.
ZooKeeper Client API
ZooKeeper Client Library provides rich and intuitive API for the user program to use, here are some common API:
create (path, data, flags): Create a ZNode, path is the path, data is the data to be stored on the ZNode, flags commonly used are: PERSISTEN, PERSISTENT_SEQUENTAIL, EPHEMERAL, EPHEMERAL_SEQUENTAILdelete (path, version): Delete a ZNode , You can use the version to delete the specified version. If version is -1, it means delete all versions exists (path, watch): Determine whether the specified ZNode exists and set whether to watch this ZNode. Here if you want to set the Watcher, Watcher is specified when creating ZooKeeper instance, if you want to set a specific Watcher, you can call another overloaded version exists (path, watcher). The following APIs with watch parameters are also similar to getData (path, watch): Read the data on the specified ZNode and set whether watch this ZNodesetData (path, watch): Update the data of the specified ZNode and set whether Watch this ZNodegetChildren (path, watch): Get the names of all the child ZNodes of the specified ZNode and set whether or not to watch this ZNodesync (path): Synchronize all update operations before sync to more than half the number of ZooKeeper Servers Effective. The path parameter is currently useless setAcl (path, acl): Set the Acl information of the specified ZNode getAcl (path): Get the Acl information of the specified ZNode ZooKeeper typical application scenarios
1. Name Service (NameService)
Distributed applications, usually require a complete set of command mechanism, both to generate a unique identity, but also facilitate the identification and memory. We know that each ZNode can be uniquely identified by its path, and the path itself is simple and intuitive. In addition, a small amount of data can also be stored on the ZNode, which are the basis for implementing a unified NameService. The following to achieve in the HDFS NameService, for example, to illustrate the basic steps to achieve NameService:
Goal: To access a named HDFS farm by simple name To define a naming rule: Make it simple and easy to remember here. The following is an optional solution: [serviceScheme: //] [zkCluster] - [clusterName], for example, hdfs: // lgprc-example / represents an HDFS cluster based on the lgprc ZooKeeper cluster used to configure DNS mapping: zkCluster ID lgprc Resolves the address of the corresponding ZooKeeper cluster by DNS to create a ZNode: Create the / NameService / hdfs / lgprc-example node on the corresponding ZooKeeper and store the HDFS configuration file under this node. The user program needs to access hdfs : // lgprc-example / For HDFS cluster, first find the address of ZooKeeper cluster of lgprc through DNS, and then read the configuration of HDFS in ZooKeeper's / NameService / hdfs / lgprc-example node, and according to the obtained configuration, Get the actual access to HDFS
2. Configuration Management (Configuration Management)
In distributed systems, often encounter such a scene: a job in many instances of the operation, they are running most of the configuration items are the same, if you want to reunification to change a configuration, one instance to change, Is less efficient and is also a more error-prone way. ZooKeeper can be a good solution to such problems, the following basic steps:
Put public config into a ZNode in ZooKeeper. For example, all instances of / service / common-conf will be imported into the ZooKeeper cluster's entrance address upon startup, and Watch / service / common-conf is the ZNode If the cluster administrator has modified common-conf, all instances will be notified, update their configuration based on the received notification, and continue Watch / service / common-conf
3 members management (Group Membership)
In a typical Master-Slave distributed system, the Master needs to manage all Slaves as a "master". When a Slave is added or a Slave is down, the Master needs to perceive this and make corresponding adjustments. So as not to affect the entire cluster to provide services to the outside world. In the case of HBase, the HMaster manages all RegionServers. When a new RegionServer is added, the HMaster needs to allocate a number of Regions to the RegionServer to provide services. When a RegionServer is down, the HMaster needs to replace the RegionServer's previous service Are redistributed to other RegionServers currently serving the service in order not to affect normal client access. Here are the basic steps for using ZooKeeper in this scenario:
Master creates a / service / slaves node on ZooKeeper and sets a Watcher on the node. After each Slave starts successfully, an Ephemeral node / service / slaves / $ {slave_id} is created, And write your own address (ip / port) and other related information to the node Master received a notification of the new child joined, do the appropriate treatment If Slave downtime, because it corresponds to the node is temporary Node, after its Session timeout, ZooKeeper will automatically delete the node Master received notification of the disappearance of child nodes, do the appropriate treatment
4 simple mutex lock (Simple Lock)
Our knowledge, in the traditional application, the thread, the process of synchronization, can be done through the operating system provided. However, in a distributed system, the synchronization between multiple processes can not do anything at the operating system level. In this case, a distributed coordination service such as ZooKeeper is needed to assist in the synchronization. The following is a simple mutex implementation with ZooKeeper, which can be understood in analogy with thread-synchronized mutexes:
Multiple processes try to create an ephemeral node / locks / my_lockZooKeeper in the specified directory to ensure that only one process successfully created the node, the process of creating a node success is to grab the lock process, Suppose the process is A Other processes Watch for / locks / my_lock When the A process no longer needs a lock, you can explicitly delete / locks / my_lock to release the lock, or A process is down after the Session expires, ZooKeeper system automatically delete / locks / my_lock node release lock. At this point, other processes receive a notification from ZooKeeper and try to create / locks / my_lock to lock, so iteratively
5. Mutex (Simple Lock without Herd Effect)
There is a problem with the example in the previous section. Every lock has a lot of processes to compete for, resulting in Herd Effect. To solve this problem, we can improve the above process by following the steps below:
Each process creates a temporary Ephemeral Sequential on ZooKeeper / locks / lock _ $ {seq} $ {seq} The smallest for the current holder ($ {seq} is the ZooKeeper generated Sequenctial Number) Other processes are only for nodes watch only process smaller than it next, such as 2 watch 1, 3 watch 2, and so on, the current locker releases the lock, and the process larger than the next one receives the notice of ZooKeeper , It becomes a new holder, so repeated
It is necessary to add that Zookeeper usually uses ZooKeeper in a distributed system to implement Leader Election. The locker in this case is the current "master".
6 read and write lock (Read / Write Lock)
We know that reading and writing locks compared with the mutex difference is that it is divided into read and write modes, multiple reads can be executed concurrently, but writing and reading and writing are mutually exclusive and can not be performed simultaneously. The use of ZooKeeper, based on the above, a little change can also achieve the traditional read-write lock semantics, the following are the basic steps:
Each process creates a temporary Ephemeral Sequential on ZooKeeper / locks / lock _ $ {seq} $ {seq} The smallest one or more nodes are currently held, many because A read can be complicated by the need to write the lock process, Watch next smaller process corresponding node needs to read the lock process, Watch smaller than the last write process corresponding node after the current node release the lock, all Watch the The process of the node will be notified, they become the new holder, so repeatedly
Barrier
In distributed systems, the barrier is the semantics that clients need to wait for multiple processes to complete their tasks before they can move on to the next step. The next step is to use ZooKeeper to achieve the basic steps of the barrier:
Client creates a barrier node / barrier / my_barrier on ZooKeeper and starts the process of executing each task Client watches / barrier / my_barrier nodes by exist () Each task process checks whether the specified condition is satisfied after the task is completed, If you do not achieve what is not done, if reached / put / barrier / my_barrier node deleted Client received / barrier / my_barrier deleted notice, the barrier disappears, continue to the next task
Double barrier (Double Barrier)
A double barrier is a semantic meaning: it can be used to synchronize the start and end of a task before it begins to perform when there are enough processes to enter the barrier; the barrier is withdrawn only after all processes have performed their respective tasks . Here are the basic steps to implementing a dual barrier with ZooKeeper:
Enter the barrier: Client Watch / barrier / ready node, by determining whether the existence of the node to decide whether to start the task Each task process into the barrier to create a temporary node / barrier / process / $ {process_id}, and then check the barrier If it reaches the specified value, create a / barrier / ready node, otherwise continue to wait for the client to receive / barrier / ready to create a notification to start the task execution process to leave the barrier: Client Watch / barrier / process, if there is no sub-node, you can think of the end of the task execution, you can leave the barrier Each task process after the end of the task, you need to delete their corresponding node / barrier / process / $ {process_id} Original link : Http: //www.wuzesheng.com/? P = 2609