Zookeeper working principle (detailed) _

Zookeeper working principle (detailed) __zookeeper

Last Update:2018-08-21 Source: Internet

Author: User

Tags zookeeper

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

» Leader (Leader), responsible for voting initiation and resolution, updating system Status
» Learners (learner), including followers (follower) and observers (Observer), follower to accept client requests and want clients to return results, participate in voting during the main selection process
»observer can accept client connections, send write requests to leader, but observer not participate in the voting process, only sync leader status, observer to expand the system, improve read speed
» Clients (client), requesting initiator

The core of zookeeper is atomic broadcasting, a mechanism that ensures synchronization between individual servers. The agreement to implement this mechanism is called the Zab Association.
On. There are two modes of the Zab protocol, which are the recovery mode (select Master) and broadcast mode (sync). When the service is started or in the leader
After the crash, Zab entered the recovery model, when the leader was elected and most of the servers completed the sync with the leader state
, the recovery model is over. State synchronization guarantees that the leader and server have the same system state.

• To ensure the order consistency of transactions, zookeeper uses an incremented transaction ID number (ZXID) to identify the transaction. All the proposals (
Proposal) were added ZXID when they were presented. Implementation of ZXID is a 64-digit number, it is high 32 bits is epoch used to identify
Whether the leader relationship changes, each time a leader is selected, it will have a new epoch, identifying the current one that belongs to the leader
Ruling period. The lower 32 bits are used to increment the count.
• Each server has three states during its work:
Looking: The current server does not know who leader is and is searching for
Leading: The current server is the elected leader
Following:leader has been elected, the current server is synchronized with it

Other documents: http://www.cnblogs.com/lpshou/archive/2013/06/14/3136738.html

2. Reading and writing mechanism of zookeeper

»zookeeper is a cluster of multiple servers
» A leader, multiple follower
» Save a copy of the data per server
» Global Data Consistency
» Distributed Reading and writing
» Update request Forwarding, implemented by leader

3, the Zookeeper guarantee

» Update request order, update requests from same client to be executed sequentially in the order in which they are sent
» Data Update atomicity, once data updates are either successful or fail
» Global Unique Data View, client regardless of which server is connected to, the data view is consistent
» Real-time, within a certain range of events, the client can read the latest data

4, Zookeeper node data operation process

Note: 1. Send a written request to Follwer in client

2.Follwer sent the request to leader

3.Leader receive a vote and notify Follwer to vote

4.Follwer sent the poll results to leader

5.Leader After the result is summarized, if writing is required, start writing to notify leader of the write operation at the same time, then commit;

6.Follwer returns the request result to the client

follower has four main functions:
·1. Send a request to leader (Ping message, request message, ACK message, revalidate message);
• 2. Receive leader message and handle it;
• 3. Receive client's request, if for write request, send to leader to vote;
• 4. Returns the client result.
The follower message loop handles several of the following messages from leader:
• 1. Ping message: Heartbeat message;
• 2. Proposal News: Leader launched a proposal to ask follower to vote;
• 3. Commit message: Server-side latest proposal information;
• 4. UpToDate message: Indicates synchronous completion;
• 5. Revalidate message: According to leader Revalidate result, close the session to revalidate or allow it to accept the message;
• 6. Sync message: Returns the sync result to the client, which was initially initiated by the client to force the latest updates.

5. Zookeeper leader election

• Half through
–3 Machine hangs a 2>3/2
–4 Machine hangs 2 sets of 2. >4/2

The a proposal says, I want to choose myself, B do you agree. C. Do you agree? B said, I agree to choose A;c said, I agree to choose a. (Note that more than half of this, in fact, in the real world election has been successful.)

But the computer world is very strict, in addition to understand the algorithm, to continue to simulate. )
• Then the B proposal says, "I want to choose myself, a Do you agree;" A says, I already half agree to be elected, your proposal is invalid, C says, A has already half agreed to be elected, b proposal is invalid.
• Then the C proposal says, "I want to choose myself, a Do you agree;" A says, I already half agreed to be elected, your proposal is invalid; B says that a has already been approved by more than half, and C's proposal is invalid.
• Elections have been leader, followed by follower, and can only obey leader's orders. And here's a little bit of detail, which is actually who starts first.

6, Zxid

The state information of the znode node contains Czxid, so what is ZXID?
Each change in the zookeeper state corresponds to an incremented transaction ID called Zxid. Because of the increasing nature of ZXID, if the zxid1 is less than Zxid2, then zxid1 must occur before Zxid2.

Creating any node, or updating the data of any node, or deleting any node can cause the zookeeper state to change, resulting in an increase in the value of the ZXID.

7, zookeeper working principle

The core of»zookeeper is atomic broadcasting, a mechanism that ensures synchronization between individual servers. The protocol that implements this mechanism is called the Zab protocol. There are two modes of the Zab protocol, namely, the recovery model and the broadcast mode.

When the service starts or the leader crashes, the Zab enters the recovery model, and when the leader is elected and most of the servers are completed and the leader state synchronizes, the recovery model ends.

State synchronization guarantees that the leader and server have the same system state

» Once leader has synchronized with most of the follower, he can begin to broadcast the message, i.e., into the broadcast state. When a server joins the Zookeeper service, it starts in recovery mode,

Leader is found, and the leader is synchronized with the state. At the end of synchronization, it also participates in message broadcasting. The zookeeper service remained in broadcast state until leader collapsed or leader lost most of it.

of followers support.

» Broadcast mode requires that the proposal be processed sequentially, so ZK uses an incremented transaction ID number (ZXID) to guarantee it. All the proposals (proposal) were added ZXID when they were presented.

In the implementation of ZXID is a number 64, it is high 32 bit is epoch to identify leader relationship is changed, each time a leader was selected, it will have a new epoch. The low 32 bit is an incrementing count.

» When leader crashes or leader loses most of the follower, ZK enters recovery mode, and the recovery model needs to elect a new leader to restore all servers to a correct state.

» Each server starts asking the other server who it wants to vote for.
» For other server inquiries, the server responds to its own recommended leader ID and the ZXID of the previous transaction each time it is in its own state (each server recommends itself when the system starts)
» After receiving all of the server responses, calculate which server is the largest ZXID and set this server-related information to the next server to be voted on.
» Calculate the winner of the sever with the highest number of votes in the process, and if the winner has more than half of the votes, the server is selected as leader. Otherwise, continue this process until leader is elected.

»leader will start waiting for the server to connect
»follower connect leader, send the largest zxid to leader
»leader the synchronization point according to the follower Zxid
» Notify follower has become uptodate state after synchronization is complete
»follower received the UpToDate message, you can again accept the client's request for service

8, data consistency and Paxos algorithm

• It is said that the difficult to understand the Paxos algorithm and the popularity of the algorithm as admirable, so we first look at how to maintain data consistency, here is a principle:
• In a distributed database system, if the initial state of each node is consistent, each node performs the same sequence of operations, then they finally get a consistent state.
Paxos algorithm to solve the problem, the solution is to ensure that each node to perform the same sequence of operations. Well, it's not easy, Master maintains a
Global Write queue, all write operations must be placed in this queue number, then no matter how many nodes we write, as long as the write operation is by number, you can guarantee a
induced sex. Yes, that's it, but if master hangs up.
The Paxos algorithm makes a global number of writes by voting, and at the same time, only one write is approved, while concurrent writes are needed to win votes.
Only a majority of the votes will be written to be approved (so there will always be only one write operation to be approved), other write operation competition failure had to initiate a
Round the ballot, so that, in the polling day and year after day, all writes are strictly numbered. The number is strictly incremented when a node accepts a
A write with a number of 100, followed by a write with number 99 (due to a number of unforeseen reasons, such as network latency), it immediately became aware of its own data
Inconsistent, automatically stop external services and restart the synchronization process. Any node that hangs will not affect the entire cluster's data consistency (total 2n+1, unless hung more than n units).
Summarize
As a subproject in the Hadoop project, Zookeeper is an essential module in the management of Hadoop clusters, which is used primarily to control data in the cluster,

It manages the Namenode in the Hadoop cluster, as well as the state synchronization between Hbase Master election and the Server. \

About Paxos algorithm can view the article zookeeper fully analytic--paxos as the soul

Recommended books: "From Paxos to zookeeper distributed consistency principle and practice"

9, Observer

zookeeper need to ensure high availability and strong consistency;
• Additional servers are required to support more clients;
server increased, the voting stage delay increased, affecting performance;
• Tradeoff between scalability and high throughput, introduction of observer
Observer not participate in voting;
observers accepts the client's connection and transfers the write request to the leader node;
• Add more observer nodes to increase scalability without impacting throughput

10, why the number of zookeeper cluster, generally the odd number.

The leader election algorithm adopts the Paxos protocol;
Paxos Core idea: When most servers write successfully, the task data is written successfully if you have 3 servers, two write successfully, and three write successfully if you have 4 or 5 servers.
The number of server is generally odd (3, 5, 7) If there are 3 servers, a maximum of 1 servers are allowed to hang; if there are 4 servers, you can also allow up to 1 servers to hang out.

We can see that 3 servers and 4 servers have the same disaster tolerance, so in order to save the server resources, we usually use the odd number, as the number of servers deployed.

11, the Zookeeper data model

» Hierarchical directory structure, naming compliant with regular file system specifications
» Each node is called Znode in Zookeeper, and it has a unique path identifier
» Node Znode can contain data and child nodes, but ephemeral type nodes cannot have child nodes
The data in»znode can have multiple versions, such as multiple versions of data under one path, so querying the data under this path requires a version
» Client applications can set up a monitor on a node
» Node does not support partial read-write, but one-time full read-write

12, the Zookeeper node

There are two types of»znode, short (ephemeral) and persistent (persistent).
The type of the»znode is determined at creation time and can no longer be modified
» Short Znode Client session at the end, zookeeper will remove the brief Znode, a short znode can not have child nodes
» Persistent Znode are not dependent on client sessions and will only be deleted if the client explicitly wants to delete the persistent znode
»znode has four types of directory nodes
»persistent (lasting)
»ephemeral (for the time being)
»persistent_sequential (persistent sequential numbered directory node)
»ephemeral_sequential (ephemeral sequential numbered directory nodes)

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More