ZAB protocol for ZooKeeper

Source: Internet
Author: User

ZAB protocol for ZooKeeper

ZooKeeper is a highly available consistency coordination framework. The natural ZooKeeper also implements a consistency algorithm. ZooKeeper uses the ZAB protocol as a data consistency algorithm. ZAB (ZooKeeper Atomic Broadcast) is called: the atomic message broadcast protocol (ZAB) is extended and transformed Based on the Paxos algorithm. The ZAB protocol supports crash recovery. ZooKeeper uses a single master process Leader to process all client transaction requests, the ZAB protocol is used to broadcast the number of servers to all Follower in the form of transactions. Due to the dependency between transactions, the ZAB protocol ensures that the change sequence of the Leader broadcast is processed in sequence ,: if a State is processed, the state on which it depends is also processed in advance. The crash recovery supported by the ZAB protocol can ensure that the Leader can be re-elected when the Leader process crashes and ensure data integrity;
All transaction requests in ZooKeeper are processed by a master server, that is, the Leader. The other servers are Follower, and the Leader converts the transaction requests from the client to the transaction Proposal, and distribute Proposal to all other Follower in the cluster, and then the Leader waits for the Follwer feedback. When there are more than half of the Follower feedback information (> = N/2 + 1, the Leader will broadcast the Commit information to the Follower in the cluster again, and the Commit is to submit the previous Proposal;

Protocol Status

ZAB protocol has three States, each of which belongs to one of the following three States:
1. Looking: The system is in the election status after the system is started or the Leader crashes.
2. Following: the State of the Follower node. the Follower and Leader are in the Data Synchronization phase;
3. Leading: the status of the Leader. The current cluster has a Leader as the main process;

When ZooKeeper is started, the initial status of all nodes is Looking. Then, the cluster will try to elect a Leader node and switch the selected Leader node to the Leading status; when the node finds that the Leader has been selected in the cluster, the node switches to the Following State and maintains synchronization with the Leader node. When the Follower node loses contact with the Leader, the Follower node switches to the Looking state, start a new round of elections. During the entire lifecycle of ZooKeeper, each node will change constantly between Looking, Following, and Leading States;

Status switching Diagram

After the Leader node is elected, ZAB enters the atomic broadcast stage. At this time, the Leader creates an operation sequence for each node Follower that is synchronized with itself. In a period, one Follower can only be synchronized with one Leader, the Leader node and the Follower node use heartbeat detection to detect the existence of each other. When the Leader node receives a heartbeat detection from the Follower within the timeout period, the Follower node will remain connected to the node; if the Leader does not receive heartbeat detection from half of the Follower node or the TCP connection is disconnected during the timeout period, the Leader ends the lead of the current cycle and switches to the Looking status, all Follower nodes will also discard the Leader node and switch to the Looking state, and then start a new round of elections;

Phase

The ZAB Protocol defines four stages: election, discovery, sync, and Broadcast. When ZAB election (election), when Follower has a ZXID (transaction ID) only the nodes with the lastZXID are eligible to become the Leader. In this case, the selected Leader will always have the latest transaction log, for this reason, when ZooKeeper is implemented, the discovery and synchronization are merged into the recovery phase;
1. Election: select the Leader node in the Looking state. The lastZXID of the Leader is always the latest;
2. discovery: the Follower node pushes FOllOWERINFO to the quasi-Leader. This information includes the epoch of the previous cycle. It accepts the NEWLEADER command of the quasi-Leader and checks the newEpoch validity, the quasi-Leader must ensure that the epoch and ZXID of Follower are smaller than or equal to its own;
3. sync: Synchronize the data of Follower and Leader. The Leader initiates the synchronization command to maintain the consistency of cluster data;
4. Broadcast: Leader broadcasts Proposal and Commit. Follower accepts Proposal and Commit;
5. Recovery: after the Leader is elected in the Election phase, the main task of this phase is to synchronize data so that the Leader has a highestZXID, and the cluster maintains data consistency;

Election)
In the election phase, you must ensure that the selected Leader has a highestZXID. Otherwise, Data Consistency cannot be ensured in the rediscovery phase. In the rediscovery phase, the Leader requires Follower to synchronize data to itself without Follower and requires the Leader to maintain data synchronization, all elected leaders must have the latest ZXID;
During the election process, the ZXID of each Follower node is compared. Only the Follower of highestZXID can be elected as the Leader;
Election Process:
1. Each Follower sends a Vote Voting request for the Leader to other nodes, waiting for a reply;
2. If the Vote received by Follower is larger than its own (ZXID update), it will Vote and update its own Vote; otherwise, it will reject the Vote;
3. Each Follower maintains a voting record table. When a node receives a half vote, it ends the vote and selects the Follower as the Leader. The vote ends;

ZAB uses ZXID as the transaction ID. ZXID is a 64-bit number, and a 32-bit low is an incremental counter, when a transaction request from each client is sent, the Leader will add 1 to the counter after a new transaction is generated, and the 32-bit height is the epoch number of the Leader cycle, when a new Leader node is selected, the Leader will retrieve the ZXID of the maximum transaction Proposal in the local log, parse the corresponding epoch, and add this value to 1 as the new epoch, generate a new ZXID starting from 0 for the lower 32 bits. ZAB uses epoch to differentiate different Leader cycles;

Recovery)
The selected Leader in the election phase has the latest ZXID. The main task of this phase is to update the Follower node data based on the transaction log of the Leader;
Leader: The Leader generates a new ZXID and epoch, receives the FOllOWERINFO (containing the LastZXID of the current node) sent by the Follower, and then sends NEWLEADER to the Follower; the Leader sends an update command to the Follower based on the LastZXID sent by Follower according to the data update policy;
Synchronization policy:
1. SNAP: If the Follower data is too old, the Leader will send the snapshot SNAP command to the Follower to synchronize data;
2. DIFF: The Leader sends the DIFF command from Follolwer. lastZXID to Leader. lastZXID to synchronize data to Follower;
3. TRUNC: When Follower. lastZXID is greater than Leader. lastZXID, Leader sends the TRUNC command from Leader. lastZXID to Follower. lastZXID to let Follower discard the data segment;
Follower: Send the FOLLOERINFO command to the Leader. If the Leader rejects the command, it will go to the Election stage. If the epoch in the command is smaller than the epoch of the current Follower, the Follower will go to the Election stage; follower also receives the SNAP/DIFF/TRUNC command to synchronize data with ZXID. After successful synchronization, it replies to ACKNETLEADER and enters the next stage; after the Follower synchronizes all transactions, the Leader adds the node to the available Follower list;
SNAP and DIFF are used to ensure the consistency of Committed data on the Follower node in the cluster. TRUNC is used to discard the data that has been processed but has no Committed;

Broadcast (Broadcast)
When the client submits a transaction request, the Leader node generates a transaction Proposal for each request and sends it to all the Follower nodes in the cluster. After receiving feedback from half Follower, the Leader node starts to submit the transaction, the ZAB protocol uses the atomic broadcast protocol. In the ZAB protocol, you only need to get feedback from half of the Follower nodes to commit transactions, this also causes data inconsistency when the Leader crashes. ZAB uses crash recovery to handle digital inconsistency; message broadcasting uses the TCP protocol for communication, ensuring the order of accepting and sending transactions. When a message is broadcast, the Leader node assigns a globally increasing ZXID (transaction ID) to each transaction Proposal, and each transaction Proposal is processed in the ZXID order;
The Leader node allocates a queue to each Follower node in the order of transaction ZXID, and sends transactions according to the FIFO rules of the queue. After the Follower node receives the transaction Proposal, it writes the transaction to the local disk as a transaction log. After the transaction is successful, the Ack message is fed back to the Leader node, after receiving the Ack feedback from half of the Follower node, the Leader commits the transaction and broadcasts the Commit message to all the Follower nodes at the same time. After the Follower node receives the Commit, it starts to Commit the transaction;

Ubuntu 14.04 installs distributed storage Sheepdog + ZooKeeper

CentOS 6 installs sheepdog VM distributed storage

ZooKeeper cluster configuration

Use ZooKeeper to implement distributed shared locks

Distributed service framework ZooKeeper-manage data in a distributed environment

Build a ZooKeeper Cluster Environment

Test Environment configuration of ZooKeeper server cluster

ZooKeeper cluster Installation

Zookeeper3.4.6 Installation

References:
Http://web.stanford.edu/class/cs347/reading/zab.pdf
Http://www.tcs.hut.fi/Studies/T-79.5001/reports/2012-deSouzaMedeiros.pdf

This article permanently updates the link address:

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.