Zookeeper consistency Protocol: Zab

Source: Internet
Author: User

Zookeeper uses a protocol called Zab (zookeeper atomic broadcast) as the core of its consistent replication. According to the author, this is a new algorithm, its feature is to fully consider the specific situation of Yahoo: high throughput, low latency, robust and simple, but it is not limited to scalability. The core content of the Protocol is as follows:

In addition, this article only discusses the consistency protocol used by zookeeper, rather than its source code implementation.

The implementation of zookeeper is composed of client and server. The server provides a consistent replication and storage service, and the client provides some specific semantics, such as distributed locks, election algorithms, and distributed mutex. In terms of storage content, the server stores more data states than the data content itself. Therefore, Zookeeper can be used as a small file system. The storage capacity of the Data status is relatively small and can be fully loaded into the memory, thus greatly eliminating communication latency.

The server can be restarted after crash. Considering fault tolerance, the server must "remember" The previous data status. Therefore, data must be persistent. However, when the throughput is very high, disk I/O becomes a bottleneck, the solution is to use the cache to change random write to continuous write.

Considering the status of the main operation data of zookeeper, Zookeeper proposes two security attributes (Safety Property) to ensure State consistency)

 

  • Total order: If message a is sent before message B, all servers should see the same result.
  • Causal order: If message a occurs before message B (A causes B) and is sent together, message a is always executed before message B.
To ensure the preceding two security attributes, Zookeeper uses the TCP protocol and leader. The TCP protocol is used to ensure the full order of messages (first come first). The leader solves the causal order problem: first come to leader for first execution. With the leader, the zookeeper architecture becomes: master-slave mode, but in this mode, the master (Leader) will crash. Therefore, Zookeeper introduces the Leader Election Algorithm, to ensure the robustness of the system. To sum up, Zookeeper has two phases:
  • Atomic Broadcast
  • Leader Election
1. atomic broadcast has a leader node at the same time. Other nodes are called "follower". For an update request, if the client is connected to the leader node, the leader node executes the request. If it is connected to the follower node, the request must be forwarded to the leader node for execution. However, for read requests, the client can directly read data from follower. to read the latest data, the client needs to read the data from the leader node. The read/write ratio of zookeeper is. The leader sends a request to other follower in the two-step submission mode of a simplified version, but there are two obvious differences with the two-step submission:
  • Because there is only one leader, the request submitted by the leader to the follower will be accepted (no other leader interference)
  • No need for all follower to respond successfully, as long as one majority
In general, if there are 2f + 1 nodes, the failure of F nodes is allowed. Because any two majority vertices have an intersection, when the leader switches, the intersection nodes can be used to obtain the latest status of the current system. If no majority exists (the number of living nodes is smaller than F + 1), the algorithm process ends. But there is a special case: if there are three nodes A, B, and C, A is the leader. If B crash, A and C can work normally, because a is the leader, A and C also constitute the majority. If a crash does not work, the leader election does not constitute the majority. 2. The leader electionleader election mainly depends on the paxos algorithm. For more information about the algorithm process, see other blog posts. Here we only consider the problems caused by the leader election. The biggest problem encountered by the leader election is the question of "new and old interaction", whether the new leader should continue the old leader status. Here we need to divide the time points of the old leader crash into several situations:
  1. Old leader crash before commit (submitted to local)
  2. The old leader crash after commit, but some follower receives the commit request
In the first case, only the old leader knows the data. After the old leader restarts, it must be synchronized with the new leader and delete the data from the local device to maintain the same status. In the second case, the new leader should be able to obtain the latest data submitted by the old leader through a majority. After the old leader restarts, it may think that it is a leader and may continue to send unfinished requests, therefore, because two leaders exist simultaneously, the algorithm process fails. The solution is to add the leader information to the ID of each message. zookeeper is called zxid, and zxid is a 64-bit number, the high 32-bit is the leader information, also known as epoch. The leader increases progressively every time it is converted. The low 32-bit is the message number, and the leader should start numbering from 0 again. With zxid, follower can easily identify whether the request comes from the old leader and reject the old leader's request. Because data is deleted in the old leader (Case 1), Zookeeper's data storage must support the compensation operation, which requires logging like a database. 3. The authors of ZAB and paxoszab believe that ZAB and paxos are not the same, but paxos is not used because paxos cannot guarantee the full order:
Because multiple leaders canpropose a value for a given instance two problems arise.First, proposals can conflict. Paxos uses ballots to detect and resolve conflicting proposals. Second, it is not enough to know that a given instance number has been committed, processes must also be able to figure out which value has been committed.
The paxos algorithm does not relate to the logic order between requests, but only the full order between data. However, few directly use the paxos algorithm, which will be simplified and optimized. Generally, paxos has several simplified forms. One of them is that a leader can be simplified to a phase (phase2 ). A strong leader is required for a scenario with only one stage. Therefore, the focus of the work is the leader election. Considering the learner process, a "Learning" stage is required. In this way, paxos can be simplified to two phases:
  • Previous phase2
  • Learn
If the majority student wants to learn successfully, this is actually the Zab protocol. The paxos algorithm emphasizes the control of the election process and does not take much consideration of the decision. Zab just supplements this. Some people have said that all distributed algorithms are simplified paxos. Although they are absolute, they are true in many cases. But I wonder if the author of Zab agrees with this statement? 4. The end of this article is to analyze zookeeper from the perspective of protocol and algorithm, rather than the source code implementation. Due to the change in zookeeper version, the corresponding implementation may not be found in the scenario described in this article. In addition, this article tries to reveal the fact that Zab is a simplified form of paxos. [References]
  • A simple totally ordered broadcast Protocol
  • Paxos

 

From: http://blog.csdn.net/chen77716/article/details/7309915

Zookeeper consistency Protocol: Zab (transfer)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.