Zookeeper source Reading (a): Zab agreement

Last Update:2018-07-26 Source: Internet

Author: User

Tags ack commit zookeeper

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

The zookeeper has a in-memory DB inside, which is represented as a tree structure. Each tree node is called a Znode (related code in Datatree.java and Datanode.java)

Clients can connect to any one of the zookeeper clusters.

For read requests, the local Znode data is returned directly. The write operation is converted to a transaction and forwarded to the leader processing of the cluster. The zookeeper commit transaction guarantees that the write operation (update) is consistent for all machines in the zookeeper cluster.

The protocol for submitting transactions in zookeeper is not a Paxos, but a Zab agreement adapted from the two-phase submission protocol.

Zab can meet the following characteristics

Reliable delivery: If message M is submitted (commit) by a server, then M will also eventually be submitted by all servers.

Total order: If the server submitted a before submitting B, then all servers that submitted a, B will also submit a before submitting B.

Casual Order: For two submitted messages A, B, if a causal relationship takes precedence over (causally precedes) b, then A is to be submitted before B.

The cause-and-effect precedence of the third article refers to the two messages sent by the same sender, a prior to B, or the message a sent by the previous leader before the current leader.

The server in the Zab protocol has two modes: Broadcast mode, recovery mode (leader outage or follower does not constitute quorum)

Leader before you begin broadcast, you must have a quorum (majority) of follower that is updated synchronously.

When the server resumes online during the leader service, it enters recovery mode and synchronizes with leader.
The broadcast mode uses two-phase commit, but simplifies the protocol without the need for abort. Follower either ACK, or abandon leader, because zookeeper guaranteed only one leader at a time. There is also no need to wait for all server ACK, only one quorum answer is required.

Follower receives proposal, writes to disk (as much as possible) and returns an ACK.

Leader receives most ACK, broadcasts a commit message and deliver the message itself.

Follower deliver the message after it receives a commit.

However, this simplified two-phase commit cannot handle leader failures, so the recovery mode is added. The following two issues need to be addressed when switching leader.

Never forget delivered messages

Leader was down before commit was delivered to any follower, only to commit itself. The new leader must ensure that the transaction must also commit.

Let go to messages that is skipped

Leader produces a proposal, but before the outage, there is no follower to see the proposal. This proposal must be discarded when the server resumes.

New leader before propose a new message, you must ensure that all messages in the transaction log are proposed and committed.

In order to ensure that follower see proposal, as well as the messages submitted, leader sent follower follower not seen proposal, as well as the last commit of the message before the number of commits.

Because proposal are stored in the follower transaction log, and the order is guaranteed, the order of commits is also deterministic. Solve the first problem.

After the leader reboot, which did not send the proposal, the new leader will tell it to truncate the transaction log and truncate it to the last commit location of the epoch corresponding to follower.

For a detailed proof of Zab, refer to Zab-high-performance broadcast for Primary-backup systems

Reference documents:

Http://zookeeper.apache.org/doc/r3.4.5/zookeeperOver.html

A Simple totally ordered broadcast protocol

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More