The zookeeper has a in-memory DB inside, which is represented as a tree structure. Each tree node is called a Znode (related code in Datatree.java and Datanode.java)
Clients can connect to any one of the zookeeper clusters.
For read requests, the local Znode data is returned directly. The write operation is converted to a transaction and forwarded to the leader processing of the cluster. The zookeeper commit transaction guarantees that the write operation (update) is consistent for all machines in the zookeeper cluster.
The protocol for submitting transactions in zookeeper is not a Paxos, but a Zab agreement adapted from the two-phase submission protocol.
Zab can meet the following characteristics
Reliable delivery: If message M is submitted (commit) by a server, then M will also eventually be submitted by all servers.
Total order: If the server submitted a before submitting B, then all servers that submitted a, B will also submit a before submitting B.
Casual Order: For two submitted messages A, B, if a causal relationship takes precedence over (causally precedes) b, then A is to be submitted before B.
The cause-and-effect precedence of the third article refers to the two messages sent by the same sender, a prior to B, or the message a sent by the previous leader before the current leader.
The server in the Zab protocol has two modes: Broadcast mode, recovery mode (leader outage or follower does not constitute quorum)
Leader before you begin broadcast, you must have a quorum (majority) of follower that is updated synchronously.
When the server resumes online during the leader service, it enters recovery mode and synchronizes with leader.
The broadcast mode uses two-phase commit, but simplifies the protocol without the need for abort. Follower either ACK, or abandon leader, because zookeeper guaranteed only one leader at a time. There is also no need to wait for all server ACK, only one quorum answer is required.
Follower receives proposal, writes to disk (as much as possible) and returns an ACK.
Leader receives most ACK, broadcasts a commit message and deliver the message itself.
Follower deliver the message after it receives a commit.
However, this simplified two-phase commit cannot handle leader failures, so the recovery mode is added. The following two issues need to be addressed when switching leader.
Never forget delivered messages
Leader was down before commit was delivered to any follower, only to commit itself. The new leader must ensure that the transaction must also commit.
Let go to messages that is skipped
Leader produces a proposal, but before the outage, there is no follower to see the proposal. This proposal must be discarded when the server resumes.
New leader before propose a new message, you must ensure that all messages in the transaction log are proposed and committed.
In order to ensure that follower see proposal, as well as the messages submitted, leader sent follower follower not seen proposal, as well as the last commit of the message before the number of commits.
Because proposal are stored in the follower transaction log, and the order is guaranteed, the order of commits is also deterministic. Solve the first problem.
After the leader reboot, which did not send the proposal, the new leader will tell it to truncate the transaction log and truncate it to the last commit location of the epoch corresponding to follower.
For a detailed proof of Zab, refer to Zab-high-performance broadcast for Primary-backup systems
Reference documents:
Http://zookeeper.apache.org/doc/r3.4.5/zookeeperOver.html
A Simple totally ordered broadcast protocol