Apache Original address: http://zookeeper.apache.org/doc/trunk/zookeeperOver.html
ZooKeeper
- ZOOKEEPER:A distributed coordination Service for distributed applications
- Design goals
- Data model and the hierarchical namespace
- Nodes and Ephemeral Nodes
- Conditional Updates and Watches
- Guarantees
- Simple API
- Implementation
- Uses
- Performance
- Reliability
- The ZooKeeper Project
ZOOKEEPER:A distributed coordination Service for distributed applications
ZooKeeper is a distributed, open-source coordination Service for distributed applications. It exposes a simple set of primitives, distributed applications can build upon to implement higher level services for Synchronization, configuration maintenance, and groups and naming. It is designed-be-easy-to-program-to-uses a data model styled after the familiar directory tree structure of file Systems. It runs in Java and have bindings for both Java and C.
Coordination services is notoriously hard-get right. They is especially prone to errors such as race conditions and deadlock. The motivation behind ZooKeeper is to relieve distributed applications the responsibility of implementing coordination SER Vices from scratch.
Design goals
ZooKeeper is simple. ZooKeeper allows distributed processes to coordinate with all other through a gkfx hierarchal namespace which is Organi Zed similarly to a standard file system. The name space consists of data registers-called znodes, in ZooKeeper Parlance-and these is similar to files and dire Ctories. Unlike a typical file system, which is designed for storage, ZooKeeper data is kept in-memory, which means ZooKeeper can a Chieve high throughput and low latency numbers.
The ZooKeeper implementation puts a premium on high performance, highly available, strictly ordered access. The performance aspects of ZooKeeper means it can be used in large, distributed systems. The reliability aspects keep it from being a single point of failure. The strict ordering means that sophisticated synchronization primitives can be implemented at the client.
ZooKeeper is replicated. Like the distributed processes it coordinates, ZooKeeper itself are intended to being replicated over a sets of hosts called a N Ensemble.
The servers The ZooKeeper service must all know on each of the other. They maintain an in-memory image of the state, along with a transaction logs and snapshots in a persistent store. As long as a majority of the servers is available, the ZooKeeper service would be available.
Clients connect to a single ZooKeeper server. The client maintains a TCP connection through which it sends requests, gets responses, gets watch events, and sends Heart Beats. If the TCP connection to the server breaks, the client would connect to a different server.
ZooKeeper is ordered. ZooKeeper stamps each update with a number, that reflects, the order of all ZooKeeper transactions. Subsequent operations can use the order to implement higher-level abstractions, such as synchronization primitives.
ZooKeeper is fast. It is the especially fast in "read-dominant" workloads. ZooKeeper applications run on thousands of machines, and it performs best where reads is more common than writes, at Rati Os of around 10:1.
Data model and the hierarchical namespace
The name space provided by ZooKeeper are much like that, a standard file system. A name is a sequence of the path elements separated by a slash (/). Every node in ZooKeeper's name space is identified by a path.
ZooKeeper ' s hierarchical Namespace |
|
Nodes and Ephemeral Nodes
Unlike is standard file systems, each node in a ZooKeeper namespace can has data associated with it as well as children. It's like having a File-system, allows a file to also be a directory. (ZooKeeper is designed to store coordination Data:status information, configuration, location information, etc., so the Data stored at each node is usually small, with the byte to kilobyte range.) We use the term znode to make it clear that we is talking about ZooKeeper data nodes.
Znodes maintain a STAT structure that includes version numbers for data changes, ACL changes, and timestamps, to allow CAC He validations and coordinated updates. Each time a Znode ' s data changes, the version number increases. For instance, whenever a client retrieves data it also receives the version of the data.
The data stored at Znode a namespace are read and written atomically. Reads get all the data bytes associated with a znode and a write replaces all the data. Each node has a Access Control List (ACL) that is restricts who can do.
ZooKeeper also has the notion of ephemeral nodes. These znodes exists as long as the session that created the Znode is active. When the session ends the Znode is deleted. Ephemeral nodes is useful when you want to implement [TBD].
Conditional Updates and Watches
ZooKeeper supports the concept of watches. Clients can set a watch on a znodes. A watch would be triggered and removed when the znode changes. When a watch was triggered the client receives a packet saying that the znode had changed. And if the connection between the client and one of the Zoo Keeper servers is broken, the client would receive a local Noti Fication. These can used to [TBD].
Guarantees
ZooKeeper is very fast and very simple. Since its goal, though, are to am a basis for the construction of more complicated services, such as synchronization, it PR Ovides a set of guarantees. These is:
Sequential Consistency-updates from a client would be applied in the order that they were sent.
Atomicity-updates either succeed or fail. No partial results.
Single System image-a client would see the same view of the service regardless of the server, it connects to.
For more information on these, and how they can is used, see [TBD]
Simple API
One of the design goals of ZooKeeper is provide a very simple programming interface. As a result, it supports only these operations:
-
Create
-
Creates a node at a location in the tree
-
Delete
-
Deletes a node
-
Exists
-
Tests if a node exists at a location
-
Get Data
-
Reads the data from a node
-
Set data
-
Writes data to a node
-
Get children
-
Retrieves a list of children of a node
-
Sync
-
Waits for data to is propagated
For a more in-depth discussion in these, and how they can is used to implement higher level operations, please refer to [TBD]
Implementation
ZooKeeper components shows the high-level components of the ZooKeeper service. With the exception of the request processor, each of the servers, and the ZooKeeper service replicates its own cop Y of each of the components.
The replicated database is an In-memory database containing the entire data tree. Updates is logged to disk for recoverability, and writes is serialized to disk before they is applied to the In-memory Database.
Every ZooKeeper server services clients. Clients connect to exactly one server to submit irequests. Read Requests is serviced from the local replica for each server database. Requests the State of the service, write requests, is processed by an agreement protocol.
As part of the agreement protocol all write requests from clients is forwarded to a single server called the leader< /c0>. The rest of the ZooKeeper servers, called followers, receive message proposals from the leader and agree upon Mes Sage Delivery. The messaging layer takes care of replacing leaders in failures and syncing followers with leaders.
ZooKeeper uses a custom atomic messaging protocol. Since the messaging layer is atomic, ZooKeeper can guarantee that the local replicas never diverge. When the leader receives a write request, it calculates what the the state of the system was when the write was to was applied an D transforms this to a transaction that captures this new state.
Uses
The programming interface to ZooKeeper are deliberately simple. With it, however, can implement higher order operations, such as Synchronizations Primitives, group membership, owners Hip, etc. Some distributed applications has used it to: [Tbd:add uses from white paper and video presentation.] For more information, see [TBD]
Performance
ZooKeeper is designed to be highly performant. But it? The results of the ZooKeeper's development team at Yahoo!-indicate. (See ZooKeeper throughput as the Read-write Ratio varies.) It is especially high performance in applications where reads outnumber writes, since writes involve synchronizing the STA Te of all servers. (Reads outnumbering writes is typically the case for a coordination service.)
ZooKeeper throughput as the Read-write Ratio varies |
|
The figure ZooKeeper throughput as the Read-write Ratio varies are a throughput graph of ZooKeeper release 3.2 running on S Ervers with dual 2Ghz Xeon and both SATA 15K RPM drives. One drive is used as a dedicated ZooKeeper log device. The snapshots were written to the OS drive. Write requests were 1K writes and the reads were 1K reads. "Servers" indicate the size of the ZooKeeper ensemble, the number of the Servers that make up the service. Approximately and servers were used to simulate the clients. The ZooKeeper ensemble is configured such that leaders does not allow connections from clients.
Note
In version 3.2 r/w performance improved by ~2x compared to the previous 3.1 release.
Benchmarks also indicate that it's reliable, too. Reliability in the presence of Errors shows how a deployment responds to various failures. The events marked in the following:
Failure and recovery of a follower
Failure and recovery of a different follower
Failure of the leader
Failure and recovery of followers
Failure of another leader
Reliability
To show the behavior of the system over time as failures is injected we ran a ZooKeeper service made up of 7 machines. We ran the same saturation benchmark as before, but this time we kept the write percentage at a constant 30%, which is a C Onservative ratio of our expected workloads.
Reliability in the presence of Errors |
|
The is a few important observations from this graph. First, if followers fail and recover quickly, then ZooKeeper are able to sustain a high throughput despite the failure. But maybe more importantly, the leader election algorithm allows for the system to recover fast enough to prevent THROUGHP UT from dropping substantially. In we observations, ZooKeeper takes less than 200ms to elect a new leader. Third, as followers recover, ZooKeeper is able to raise throughput again once they start processing requests.
The ZooKeeper Project
ZooKeeper have been successfully used in many industrial applications. It is used at Yahoo! As the coordination and failure recovery service for Yahoo! Message Broker, which is a highly scalabl E Publish-subscribe System managing thousands of topics for replication and data delivery. It is used by the fetching Service for Yahoo! Crawler, where it also manages failure recovery. A number of Yahoo! Advertising systems also use ZooKeeper to implement reliable services.
All users and developers is encouraged to join the community and contribute their expertise. See the Zookeeper Project on Apache for more information.
< translation >zookeeper official documents