This article belongs to the Distributed System Learning Note Series, the previous note collated the Paxos algorithm, this article belongs to the fourth chapter of the original book, combing zookeeper target characteristics and Zab protocol.
1. Introduction Zookeeper1.1ZooKeeper Guaranteed Conformance Characteristics
Zookeeper is a typical distributed data consistency solution that can be based on distributed programs such as data publishing/subscription, load Balancing, naming services, distributed coordination notifications, cluster management, master elections, distributed locks, distributed queues, and more. Zookeeper can guarantee the following distributed conformance characteristics.
1. Sequential consistency:
Transaction requests originating from the same client will eventually be applied to zookeeper in exactly the order in which they are initiated.
2. Atomicity:
The update operation either succeeds or fails with no intermediate state.
3. Single System Image:
Regardless of which server the client connects to, the client sees that the data model on the service side is consistent (the same view of the services).
4. Reliability (Reliability):
Once an update succeeds, it is persisted until the client overwrites the update with a new update.
5, real-time (timeliness):
Zookeeper only guarantees that the client will eventually be able to read the latest data status from the server for a certain amount of time.
1.2Zookeeper Design Goal 1.2.1 Simple data model
Zookeeper enables distributed programs to coordinate with each other through a shared, tree-structured namespace. It consists of a series of data nodes called Znode. Similar to a file system, but zookeeper stores all of the data in memory to increase server throughput and reduce latency.
1.2.2 Building a Cluster
It can be seen that the implementation of zookeeper is a client, server configuration, server side provides a consistent replication, storage services, the client side will provide some specific semantics, such as distributed locks. Here do not do too much to explain, combined: each machine that makes up the zookeeper cluster maintains the state of the current server in memory, and each machine maintains communication with each other, more than half of the machines can work properly, and the cluster can provide services externally. Zookeeper's client chooses any machine in the cluster to create a TCP connection.
1.2.3: Sequential access
For each update request from the client, zookeeper assigns a globally unique incrementing number that reflects the sequence of operations for all things, and the application can implement a higher-level synchronization primitive based on this.
1.2.4: High Performance
Data is stored in memory and is suitable for reading-oriented scenarios.
2, Zookeeper Zab protocol 2.1ZAB Protocol
ZooKeeper is a highly available consistency coordination framework, and does not fully adopt the Paxos algorithm, but uses the ZAB (ZooKeeper Atomic Broadcast) Atomic Message Broadcast protocol as the core algorithm for data consistency, The ZAB protocol is designed for zookeeper to support crash recovery of atomic message broadcast algorithms.
Based on the ZAB protocol, zookeeper implements a system architecture based on the primary and standby mode to ensure data consistency among the replicas in the cluster. Specific: Zookeeper uses a single main process leader to process all client transaction requests, using the ZAB protocol to broadcast the state of the server to all follower on a transactional basis. Therefore, the client's large number of concurrent requests can be handled very well (I understand that ZK uses the TCP protocol and a transaction ID to implement the full-order nature of the transaction, the leader pattern is first-to-first execution to solve the causal order), on the other hand, because there may be dependencies between transactions, The ZAB protocol guarantees that the change sequence of the leader broadcast is processed sequentially, and a state is processed so that the state it depends on is also processed in advance; Finally, considering that the live process leader may crash at any time or exit abnormally, So the Zab protocol also leader the process to be able to re-elect leader and ensure the integrity of the data;
The protocol core is as follows:
All transaction requests must have a globally unique server (leader) to coordinate processing, and the rest of the cluster servers are called follower servers. The leader server is responsible for translating a client request into a transaction proposal (proposal) and distributing the proposal to all follower servers in the cluster. After the leader server waits for feedback from all follower servers, once more than half of the follower servers have the correct feedback, the leader server will then distribute the commit message to all follower servers again , requiring it to submit the previous proposal.
2.2 Protocol Introduction
This piece of writing in the book is more detailed, I simply organize the focus. Above we introduce the core content of the protocol, we can summarize the two basic modes of the ZAB protocol: Message broadcast and crash recovery.
2.2.1 Message Broadcast:
When a client commits a transaction request, the leader node generates a transaction proposal for each request, sends it to all follower nodes in the cluster, receives more than half follower of feedback, and starts committing the transaction, and the Zab protocol uses the Atomic Broadcast Protocol , in the ZAB protocol only need to get more than half of the follower node feedback ACK can be committed to the transaction, which also led to the leader after several crashes may be inconsistent data, Zab use crash recovery to handle the digital inconsistency problem The message broadcast uses the TCP protocol to communicate all guarantees the order of accepting and sending transactions. Broadcast messages when the leader node assigns a globally incrementing ZXID (transaction ID) for each transaction proposal, each transaction proposal is processed in ZXID order;
Before we introduced the 2-phase commit protocol, the ZAB protocol simplifies the Protocol:
- Remove the interrupt logic, follower either ack, or discard leader;
- Leader does not require all follower to respond successfully, as long as a majority ACK is available.
Follower received proposal, write to disk, return ACK. Leader receives most ACK, broadcasts a commit message and submits the message itself. Follower submits the message after it receives a commit.
2.2.2 Crash Recovery:
Above we talk about the Zab protocol under normal conditions of the message broadcast process, then once the leader server crashes or with more than half of the follower server lost contact, it enters the crash recovery mode.
During a crash recovery process, in order to ensure data consistency requires handling special cases: proposal that have been committed by leader can also be committed by all follower, skipping transactions that have been discarded proposal. For this requirement, if the leader election algorithm can ensure that the elected leader server has the largest number (Zxid largest) transaction proposal in all machines in the cluster, then it is guaranteed that the newly elected leader server must have all the submitted proposals, Eliminates the leader server check proposal commit and discard work.
Data synchronization
Completed the leader election this step, before formally accepting the new transaction request, the leader server to confirm that the transaction log proposal is not already in the cluster of more than half of the machine submitted, that is, whether to complete the data synchronization.
1. Leader wait for server connection;
2. Follower connection leader, the largest Zxid sent to leader;
3. Leader the synchronization point according to the zxid of follower;
4. After completing the synchronization notification follower has become uptodate status;
4 O Follower receives the uptodate message, it can re-accept the client's request for service.
The Zab protocol uses ZXID as the transaction number, the ZXID is 64 digits , the low 32 bits is an incremented counter, and each client's transaction request leader generate a new transaction after the counter will add 1, high 32 bits is the leader period epoch number, When a new leader node is elected, leader will take out the maximum transaction proposal in the local log Zxid parse out the corresponding epoch to add the value 1 as the new epoch, the lower 32 bits from 0 to generate a new ZXID Zab uses the epoch to differentiate between different leader cycles, effectively avoiding different leader server errors using the same ZXID number to propose different transaction proposal anomalies, greatly simplifying the process of improving data recovery;
2.3 Problem Description
Above we introduced the general content of the Zab protocol and the basic mode of message broadcast and crash recovery, the original book from the system model, problem description and other aspects of the ZAB protocol introduced, here only the problem to describe the relevant concepts:
main process Cycle :
Zookeeper uses a single main process leader to process all client transaction requests, uses the ZAB protocol to broadcast the server state as a transaction to all follower , and to guarantee the consistency of transactions broadcast by all main processes. We need to make sure that the main process starts sending a status change message only if the recovery of the ZAB layer is complete. To achieve this, we assume that all processes implement a ready (e) Call that allows the Zab layer to notify the application (the main process and all backup replication processes) that Zab is ready to start broadcasting state changes. The ready call also sets a value for the variable instance, letting the main process determine its instance value. The instance value is used to uniquely identify the period of the current main process, and at broadcast time, the main process sets the Epoch field of the transaction ID number with the instance value ——— we assume that the value of E is unique in all the master process instances. The uniqueness of the instance is guaranteed by Zab.
Transactions :
The main process propagates state changes to the backup process, which we call a transaction. We assume that there is a function call similar to transaction (V,Z) that implements the broadcast of the main process to the state change. Each time the main process contains <v for the transaction call, Z> has two parts: the value of the transaction V and the identity of the transaction z (or ZXID). Each transaction identifies z=<e, C>, or Z, which consists of two parts, the time identifying E and counter C. We use Epoch (z) to identify the time portion of the transaction identification number, counter (z), to identify the counter value of the transaction identification number. We say that time (epoch) e is prior to time E ', i.e. E<e '. For a given master process instance Ρe,epoch (z) = Instance = e. For each new transaction, we increment the counter C. We say the transaction ID Z precedes the transaction ID z ', i.e. either epoch (Z) < epoch (z '), or epoch (z) = = Epoch (z ') but counter (Z) < counter (z ').
2.4 Algorithm Description
We look at the Zab protocol from an algorithmic perspective and can be subdivided into three stages: Discovery (Discovery), Sync, broadcast (broadcast), before we Discovery (Discovery) merges with Sync (sync) for the recovery (recovery) phase .
Phase One discoveryfollower node to quasi-leader push Followerinfo, this information contains the epoch of the previous period, accept the quasi-leader Newleader instructions, check Newepoch validity, Quasi-leader to ensure that the epoch and zxid of follower are less than or equal to their own; Detailed steps are as follows:
F1.1 follower F sends the Epoch Value Cepoch (F.P) of its last accepted transaction proposal to quasi leader L.
L1.1 when the Cepoch (F.P) message is received for more than half follower, quasi-leader L generates Newepoch (e ') messages to these half-follower.
where e ' = max (epoch) +1;
F1.2 when the follower receives a Newepoch (e ') message from the Quasi leader L, if it detects that the current Cepoch (F.P) is less than E ', the Cepoch (F.P) is assigned an E ', and the ACK message is fed to the quasi-leader L. The message contains the Epoch Cepoch (F.P) and the historical transaction set hFof the follower.
When leader L receives an ACK of more than half follower acknowledgment messages, a follower F is selected from the more than half of the servers, using it as the initialization transaction set IE.
Phase Two synchronization
After the discovery phase is complete, the synchronization phase is entered. The data of follower and leader are synchronized, the synchronous instruction is initiated by leader, and the consistency of cluster data is always maintained;
L2.1 leader L sends E ' and Ie 'in the form of Newleader (e ',ie ') messages to all quorum follower.
F2.1 when the follower receives the Newleader (E ', Ie ') message from leader L, if follower discovers its own Cepoch (F.P)≠e ', it directly accesses the next round of loops.
If Cepoch (F.P) =e ', then follower performs the transaction application operation and finally feeds back to leader L.
L2.2 when leader L receives more than half of follower feedback messages for Newleader (E ',Ie ') , A commit message is sent to all follower.
F2.2 when follower receives a commit message from leader, it processes and submits all transactions that are not processed in the I-e ' sequence.
Stage Three broadcast
After the synchronization phase is complete, the ZAB protocol can formally accept the client's new transaction request and carry out the message broadcast process.
Leader a request is received, a propose is generated. Then perform a two-phase commit. The leader node assigns a queue for each follower node to be placed into the queue in the order of transaction Zxid, and the transaction is sent according to the rule FIFO of the queue. After the follower node receives the transaction proposal, the transaction is written to the local disk as a transaction log, and after a successful feedback ACK message to the leader node, the leader commits the transaction after receiving the ACK feedback from the half follower node. At the same time, the commit message is broadcast to all the follower nodes, and the follower node begins to commit the transaction when it receives a commit;
This block, see the above 2.2, the broadcast flowchart, does not unfold.
is the delivery of information between the processes in the process,
Figure:
Cepoch = Follower sends its last promise to the prospective leader
Newepoch = Leader proposes a new Epoch E '
ACK-E = Follower acknowledges the new epoch proposal
Newleader = Prospective leader proposes itself as the new leader of Epoch E '
Ack-ld = Follower acknowledges the new leader proposal
Commit-ld = Commit new leader proposal
Propose = Leader proposes a new transaction
ACK = Follower acknowledges leader Proosal
COMMIT = Leader commits proposal
The book only introduces three stages, here to add the election, the election has a variety of methods, here are only the default fast leader election:
Restrictions:
Just before the crash recovery was mentioned, in order to ensure data consistency, the election phase must ensure that the chosen leader has the maximum Zxid, and the implicit rule is to see all the historical commited transactions.
So how does fast leader election choose a leader with highest LASTZXID?
Election process:
1. Each follower sends a vote poll request to the other node for leader and waits for a reply;
2. Follower receive vote if the larger than their own (ZXID update) vote, and update their own vote, otherwise refused to vote;
3. Each follower maintains a voting record table, when a node receives more than half of the votes, the end of the vote and the follower election as leader, the vote is closed;
2.5 Run Analysis:
There are three states in the Zab protocol, each of which belongs to one of the following three types:
1. looking: The system is in an election state when it starts or when the leader crashes
2. following: Follower node in the state, follower and leader in the data synchronization phase;
3. leading: leader in the state, the current cluster has a leader main process;
Zookeeper start when all nodes initial state is looking, then the cluster will try to elect a leader node, the election leader node to switch to leading state When the node discovers that the leader is already elected in the cluster, the node switches to the following state, then synchronizes with the leader node, and when the follower node loses contact with leader, the follower node switches to looking state. Start a new round of elections; in the whole life cycle of zookeeper, each node will be transformed between looking, following and leading states.
After electing the leader node Zab into the atomic broadcast stage, at this time leader for each node synchronized with himself follower create an operation sequence, a period of a follower can only and one leader to keep in sync, Leader nodes and follower nodes use heartbeat detection to perceive each other's presence; When the leader node receives heartbeat detection from follower within the timeout period, the follower node remains connected to the node , if leader does not receive heartbeat detection or TCP disconnection from more than half follower nodes within the timeout period, the leader will end the current cycle of the leader, switch to the looking state, All follower nodes will also abandon the leader node to switch to the looking state, and then start a new round of elections;
3, Zab and Paxos difference and contact
The author argues that Zab is not a typical implementation of Paxos, but rather a different design goal.
Contact:
- Both have roles similar to those of the leader process, which are responsible for coordinating the operation of multiple follower processes.
- Leader will wait for more than half of the follower to make the right feedback before submitting a proposal.
The Zab also introduces an additional synchronization phase. The Paxos algorithm does not care about the logical order between requests, but only the total order between the data, but few people use the Paxos algorithm directly, simplifying or improving. It can be considered that Zab is a simplification of a paxos algorithm.
*******************************************************************************************
Reference:
Http://www.solinx.co/archives/435?utm_source=tuicool&utm_medium=referral
http://my.oschina.net/zhengyang841117/blog/186676
< from Paxos to zookeeper distributed consistency principle and practice > Reading Notes-zab protocol