What is zookeeper?
Official rhetoric: Zookeeper Distributed Service Framework is a sub-project of Apache Hadoop, which is mainly used to solve some data management problems commonly encountered in distributed applications, such as: Unified Naming Service, State Synchronization service, cluster management, management of distributed application configuration items, etc.
To be abstract, let's change the way we look at what features it provides, and then see what these features do.
Ii. What the Zookeeper provides
Simply put, the zookeeper= file system + notification mechanism.
1. File system
Zookeeper maintains a data structure similar to the file system:
Each subdirectory entry, such as Nameservice, is called Znode, and as with the file system, we are free to add, remove Znode, add and remove sub-znode under a znode, except that Znode can store data.
There are four types of Znode:
1. persistent-Persistent Directory node
The node still exists after the client disconnects from the zookeeper
2. persistent_sequential-persistent sequential numbered directory node
After the client disconnects from the zookeeper, the node still exists, but the node name is zookeeper sequentially numbered
3. ephemeral-Temp directory Node
After the client disconnects from zookeeper, the node is deleted
4. ephemeral_sequential-temporary sequential numbering directory node
After the client disconnects from the zookeeper, the node is deleted, but the node name is zookeeper sequentially numbered
2. Notification mechanism
The client registers to listen to the directory nodes it cares about, and zookeeper notifies the client when the directory node changes (data changes, deletions, and the subdirectory nodes are deleted).
As simple as this, let's see what we can do.
Third, what can we do with zookeeper?
1. Naming Service
This seems to be the simplest, in the Zookeeper file system to create a directory, that is, a unique path. When we use tborg cannot determine the upstream program deployment machine can be agreed with the downstream program path, through path can explore each other discovery, disappear.
2. Configuration Management
Programs always need to be configured, and if the program is distributed across multiple machines, it becomes difficult to change the configuration individually. OK, now put all of these configurations on Zookeeper, save in a directory node in Zookeeper, and then all the related applications listen to this directory node, once the configuration information changes, each application receives Zookeeper notification, and then from Zookeeper Get the new configuration information applied to the system is good.
3. Cluster Management
The so-called cluster management does not care about two points: whether there are machines to quit and join, election master.
For 1th, all machine conventions create temporary directory nodes under the parent directory groupmembers, and then listen for child node change messages for parent directory nodes. Once a machine hangs up, the machine is disconnected from the zookeeper, the temporary directory node it creates is deleted, and all other machines are notified: A sibling directory is deleted, so everyone knows: it's on board. New machine join is similar, all machines receive notification: New Brothers directory joined, Highcount again.
For the 2nd, let's change it a little bit, all machines create a temporary sequential numbered directory node, and each time you pick the machine with the smallest number as master.
4. Distributed lock
With the Zookeeper consistency file system, the lock problem becomes easy. Lock services can be divided into two categories, one is to maintain exclusivity, and the other is to control timing.
For the first class, we think of a znode on the zookeeper as a lock, implemented by Createznode way. All clients are going to create the/distribute_lock node, and the client that was successfully created has the lock. The toilet has a statement: Come also dash, go also dash, use up to delete the Distribute_lock node that you create to release the lock.
For the second class,/distribute_lock is pre-existing, all clients create a temporary sequential numbered directory node under it, and the same as the master, the number of the smallest to obtain the lock, use the deletion, in turn convenient.
5. Queue Management
Two types of queues:
1, synchronization queue, when a member of a queue is NAND, this queue is available, otherwise wait for all members to arrive.
2, queue in accordance with the FIFO mode of the team and the operation.
First, create a temporary directory node under the Contract directory, and listen for the number of nodes we require.
The second category, and the control timing scenario in the Distributed lock service, has the same basic principle, into row number, and dequeue by number.
Finally understand what we can do with zookeeper, but as a programmer, we always want to be enthusiastic about how zookeeper do this, a single point of maintenance of a file system is not difficult, but if it is a cluster maintenance of a file system to maintain data consistency is very difficult.
Iv. Distributed and data
Zookeeper as a cluster to provide consistent data services, naturally, it is to do data between all machines. Benefits of data:
1. Fault tolerance
One node error, not to let the whole system stop working, the other node can take over its work;
2, improve the system's ability to expand
Distributing the load to multiple nodes, or adding nodes to improve the load capacity of the system;
3. Improve performance
Allows clients to access the nearest node locally, increasing user access speed.
From the transparency of client read and write access, the data cluster system is divided into the following two types:
1, write the main (writemaster)
Changes to the data are submitted to the specified node. Read no this limit, you can read any one node. In this case, the client needs to distinguish between reading and writing, commonly known as read-write separation;
2. Write anything (write any)
Changes to the data can be submitted to any node, as with reading. In this case, the client is transparent about the role and change of the cluster node.
For zookeeper, the way it's used is to write arbitrarily. By adding machines, its read-throughput capability and responsiveness are very good, while writing, as the machine's throughput capacity is sure to drop (which is why it builds observer), and responsiveness depends on how it is implemented, whether the delay remains final, or immediately responds quickly.
Our focus is on how to ensure the consistency of all the machines in the cluster, which involves the Paxos algorithm.
V. Data consistency and Paxos algorithm
It is said that the difficult comprehension of the Paxos algorithm is as admirable as the popularity of the algorithm, so we first look at how to maintain the consistency of the data, here is a principle:
In a distributed database system, if the initial state of each node is consistent, each node executes the same sequence of operations, then they can finally get a consistent state.
What problems the Paxos algorithm solves is to ensure that each node performs the same sequence of operations. Well, this is not easy, Master maintains a global write queue, all writes must be put into this queue number, then no matter how many nodes we write, as long as the write operation is numbered, the consistency can be guaranteed. Yes, that's it, but if master hangs up.
The Paxos algorithm polls the write operation to the global number, at the same time, only one write operation is approved, while concurrent write operations to win votes, only a majority of the votes will be written to be approved (so there will always be only one write operation is approved), other write operation competition failed to launch another round of voting , and in this way, all writes are strictly numbered in the polls, day after day, year after year. Numbers are strictly incremented, when a node accepts a write with a number of 100, and then accepts a write with a number of 99 (due to many unforeseen causes such as network latency), it immediately realizes that its data is inconsistent, automatically stops the external service and restarts the synchronization process. Any one node hangs up without affecting the entire cluster's data consistency (the total 2n+1 table, unless it hangs larger than N).
To summarize, how is data consistency guaranteed? Is voted out, happiness is the same AH.
References, directly with pictures and some text:
http://blog.csdn.net/chen77716/article/details/6166675
Http://blog.sina.com.cn/s/blog_5374d6e30100sn4l.html
http://rdc.taobao.com/team/jm/archives/448
http://www.ibm.com/developerworks/cn/opensource/os-cn-zookeeper/
Seven: Analysis of Zookeeper and Paxos