Zab: The Atomic broadcast protocol of zookeeper ensures that messages sent to copies are in the same order.
Zookeeper uses a protocol called Zab (zookeeperatomic broadcast) as the core of its consistent replication. It features high throughput, low latency, robustness, and simplicity, but does not require its scalability.
The implementation of zookeeper is composed of client and server. The server provides a consistent replication and storage service, and the client provides some specific semantics, such as distributed locks, election algorithms, and distributed mutex. In terms of storage content, the server stores more data states than the data content itself. Therefore, Zookeeper can be used as a small file system. The storage capacity of the Data status is relatively small and can be fully loaded into the memory, thus greatly eliminating communication latency.
The server can be restarted after crash. Considering fault tolerance, the server must "remember" The previous data status. Therefore, data must be persistent. However, when the throughput is very high, disk I/O becomes a bottleneck, the solution is to use the cache to change random write to continuous write.
Considering the status of the main operation data of zookeeper, Zookeeper proposes two security attributes (Safety Property) to ensure State consistency ):
- Total order: If message a is sent before message B, all servers should see the same result.
- Causal order: If message a occurs before message B (A causes B) and is sent together, message a is always executed before message B.
To ensure the preceding two security attributes, Zookeeper uses the TCP protocol and leader. The TCP protocol is used to ensure the full order of messages (first come first). The leader solves the causal order problem: first come to leader for first execution. With the leader, the zookeeper architecture becomes: master-slave mode, but in this mode, the master (Leader) will crash. Therefore, Zookeeper introduces the Leader Election Algorithm, to ensure the robustness of the system. To sum up, Zookeeper has two phases:
- Atomic Broadcast
- Leader Election
1. Atomic Broadcast
A leader node exists at the same time. Other nodes are called "follower". For an update request, if the client is connected to the leader node, the leader node executes the request. If the client is connected to the follower node, the request must be forwarded to the leader node for execution. However, for read requests, the client can directly read data from follower. to read the latest data, the client needs to read the data from the leader node. The read/write ratio of zookeeper is.
The leader sends a request to other follower in the two-step submission mode of a simplified version, but there are two obvious differences with the two-step submission:
? Because there is only one leader, the request submitted by the leader to the follower will be accepted (no other leader interference)
? No need for all follower to respond successfully, as long as one majority
In general, if there are 2f + 1 nodes, the failure of F nodes is allowed. Because any two majority vertices have an intersection, when the leader switches, the intersection nodes can be used to obtain the latest status of the current system. If no majority exists (the number of living nodes is smaller than F + 1), the algorithm process ends. However, there is a special case:
If there are three nodes A, B, and C, A is the leader, and B crash, A and C can work normally, because a is the leader, and a and c also constitute the majority; if a crash fails, the Leader Election majority cannot be formed.
2. Leader Election
The Leader Election mainly relies on the paxos algorithm. The biggest problem encountered by the leader election is the question of "new and old interactions" and whether the new leader should continue the old leader status. Here we need to divide the time points of the old leader crash into several situations:
1. The old leader crash before commit (submitted to local)
2. The old leader crash after commit, but some follower receives the commit request
In the first case, only the old leader knows the data. After the old leader restarts, it must be synchronized with the new leader and delete the data from the local device to maintain the same status.
In the second case, the new leader should be able to obtain the latest data submitted by the old leader through a majority. After the old leader restarts, it may think that it is a leader and may continue to send unfinished requests, therefore, because two leaders exist simultaneously, the algorithm process fails. The solution is to add the leader information to the ID of each message. zookeeper is called zxid, and zxid is a 64-bit number, the high 32-bit is the leader information, also known as epoch. The leader increases progressively every time it is converted. The low 32-bit is the message number, and the leader should start numbering from 0 again. With zxid, follower can easily identify whether the request comes from the old leader and reject the old leader's request.
Because data is deleted in the old leader (Case 1), Zookeeper's data storage must support the compensation operation, which requires logging like a database.
3. ZAB and paxos
The author of Zab believes that ZAB and paxos are different. Zab is a simplified form of paxos, and paxos cannot guarantee the full order.
The first point here is that the consistency of paxos cannot meet the requirements of zookeeper. For example, if the leader in the paxos system is P1 at the beginning, it initiates two transactions <t1, V1> (the transaction with the serial number T1 needs to write the value V1) and <t2, V2>. The new leader is P2, which initiates a transaction <t1, V1 '>. Then a new leader is P3, which summarizes and obtains the final execution sequence <t1, V1 '> and <t2, V2>, that is, T1 of P2 is in front, the T2 of P1 is later.
Analyze why the zookeeper requirements are not met:
Zookeeper is a tree structure. Many operations must be checked before they can be executed. For example, transaction T1 of P1 may be the creation node "/", t2 may be the creation of the node "/A/AA". Only the parent node "/a" can be created to create the child node "/A/AA ". The transaction T1 initiated by P2 may become the "/B" creation ". In this way, the sequence after P3 summary is first created "/B" and then "/A/AA", because "/a" is not yet created, creating "A/AA" won't work.
Solution:
To ensure this, Zab must ensure that the transactions initiated by the same leader are applied in order, and ensure that only after all the transactions of the previous leader are applied, the newly selected leader can initiate a transaction.
Zab's core idea is to ensure that only one node is the leader at any time. All update transactions are initiated by the leader to update all copies (called Follower ), the two-phase commit protocol is used for the new update. As long as the majority of nodes are successfully prepare, they will be notified of the commit. Each follower needs to apply transactions in the order that the leader asked them to prepare. Because the transactions processed by Zab will never be rolled back, Zab's 2 PC has been optimized. If multiple transactions notify zxid of the largest commit, all previous follower will commit.
There are several key points:
1. The leader's follower detects exceptions through heartbeat;
2. If a node after an exception is detected tries to become a new leader, it must first obtain support from most nodes and then synchronize transactions from the node with the latest status, after the transaction is completed, it can officially become the leader to initiate the transaction;
3. The key to distinguishing old and new leaders is an epoch that will continue to grow;
4. End
In this article, we only want to analyze zookeeper from the perspective of protocol and algorithm, rather than the source code implementation. Due to the change in zookeeper version, the corresponding implementation may not be found in the scenario described in this article.
Constantly improving and updating in the future ......