Editor's note: This is a look at the Raft algorithm blog in the more popular one, explaining the point of view more novel, illustrated, worth a look. SOURCE Link: Why Raft is a more understandable distributed consistency algorithm
Consistency is a temple-level problem in the distributed world, and its research can be traced back to decades ago.
Question of the Byzantine general
Leslie Lamport, more than 30 years ago, published the question of the Byzantine general (see [1]).
Located in Istanbul, now Turkey, Byzantium is the capital of the Eastern Roman Empire. Because of the vast territory of the Byzantine Empire at that time, for the purpose of defence, each army was separated very far, the general and the general can only rely on Messenger to preach news. At the time of the war, all the generals in the Byzantine army had to agree on a consensus and decide if there was a chance to win before attacking the enemy's camp. However, there may be traitors and enemy spies in the army, and the decisions of the Generals disturb the order of the whole army, and in the consensus, the results do not represent the views of the majority. At this time, in the case of known members are unreliable, the rest of the loyal generals without the influence of a traitor or spy how to reach an agreement, the Byzantine issue formed. The Byzantine hypothesis is the modelling of the real world, where computers and networks can behave unpredictably due to hardware errors, network congestion or disconnection, and malicious attacks.
Lamport has been studying this kind of problem and published a series of papers. But a comprehensive summary is to answer the following three questions:
- Is there a solution to the distributed consistency problem like the Byzantine generals?
- What are the conditions to be met if there is a solution?
- On the basis of certain preconditions, a solution is proposed.
The first two questions Lamport in the paper "Byzantine general question" has been answered, and the third problem in the later paper "The part-time Parliament" proposed an algorithm and named Paxos. This paper uses a large number of mathematical proofs, and I can not understand the basic (mathematical symbols are not fully recognized-?-;), considering that everyone is more difficult to understand, later Lamport wrote another paper, "Paxos Made Simple" completely gave up all the mathematical symbols of the proof, Use logic derivation in pure English. I reluctantly read it all over again, and then feel enlightened, but you ask me to understand, my standards should still not understand. For me to understand an algorithm has a clear standard, is really understand in the mind will be able to map the algorithm into code, and read the following a paper is only if the Enlightenment can not be mapped to the definition of code.
Although Lamport thinks Paxos is simple, it may be just for his mind. The fact is that it is still difficult to understand, so Raft is based on the desire to get an easier-to-understand alternative to the PAXOS algorithm. As one of the main objectives of the algorithm, it can be seen from the thesis topic "In Search of an understandable Consensus algorithm".
Before I get to the point, I think of an old story that can be intuitively felt in terms of the comprehensible difference between the different perspectives of a problem.
The understandable nature of different perspectives
Vaguely remember about 20 years ago, when I was in junior high school, I saw such an interesting question in a book that might probably be called "Divergent Thinking in mathematics" (not very clearly remember the title).
A B Two people in a round table on the flat put black and white go son, each time put a son, pieces do not overlap, who first no place to lose.
How can I let you win?
This question has two layers of meaning, first, is there a way to ensure that it will win? Second, what if there is proof? Here, pause and think for 10 seconds.
The above figure answers this question, that is, the forerunner wins, here uses three different ways of thinking.
- If the table is only a go son so big.
- If the table is infinitely large, the forerunner occupies the center, because the circle is a symmetrical figure, so long as the opponent can also find the position to put, you always on the other side of the symmetry to find position.
- A circle can draw a single number of equal diameter and cross-cut small circle.
Three different ways of thinking are gradually deepened in the difficulty of comprehension. The first is a very simplistic thinking, but mathematically not rigorous. The second is the limit of thinking, and the first combination of the mathematical induction method, in mathematics is rigorous. The third is in image thinking, using the concept of geometry, but it is difficult for people without the basic knowledge of geometry to understand.
The easy-to-understand description of the Raft protocol
Although Raft's thesis is easier to read than the Paxos simple version, the paper still radiates more and is relatively lengthy. After reading after the volume of meditation think or tidy up to be more secure, become really belong to their own. This is where I use the first lazi of black and white chess to describe and validate the work of the Raft protocol under the proof of concept.
There are three types of roles in a cluster organized by the Raft protocol:
- Leader (leader)
- Follower (Mass)
- Candidate (candidate)
Like a democratic society, leaders are elected by popular vote. At first there was no leader, all the participants in the cluster were the masses, and then a general election was launched, and all the masses were able to run during the election, when the role of all the masses became candidates, and the democratically elected leaders began the term of this leader, and then the election ended, All candidates except the leader are returned to the mass role to obey the leadership of the leader. A concept "term of office" is mentioned here, expressed in terms of term. The core concepts and terminology about the Raft agreement are so much and very well matched to the reality of democracy that it is easy to understand. The changes to the three roles are as follows, which are easy to understand in the context of the electoral process.
Leader election process
Under minimalist thinking, a minimal Raft Democratic cluster would require three participants (e.g. A, B, C), so that a majority of votes could be cast. The initial state of ABC is Follower, and then there are three possible scenarios when an election is initiated. The first two can be selected Leader, the third is that the ballot is invalid (Split votes), each party has voted for themselves, and no one won the majority of votes. Each participant then randomly took a break (election Timeout) to re-sponsor the poll until one of the parties received a majority vote. The key here is the random timeout, the first to resume voting from timeout in the direction of the other two parties in the timeout request to vote, then they can only vote for each other, and soon agreed.
After electing Leader, Leader maintains its rule by sending heartbeat information to all Follower on a regular basis. If Follower has not received Leader's heartbeat for some time, it is thought that Leader may have been hung up again to initiate the main selection process.
The effect of Leader node on consistency
The Raft protocol strongly relies on the availability of LEADER nodes to ensure consistency of cluster data. The flow of data can only be transferred from the Leader node to the Follower node. When the Client submits data to the cluster Leader node, the data received by the Leader node is in an uncommitted state (uncommitted), and then the Leader node replicates the data to all Follower nodes and waits for the response to be received. Make sure that at least half of the nodes in the cluster have received data before confirming to the Client that the data has been received. Once the data has been sent to the Client to receive an ACK response, indicating that the data state enters committed (Committed), the Leader node sends a notification to the Follower node informing the data that the state has been committed.
In this process, the master node may be hung up at any stage to see how the Raft protocol guarantees data consistency for different stages.
1. Before the data reaches the Leader node
This stage Leader hanging out does not affect consistency, not much to say.
2. Data arrives at Leader node but not replicated to Follower node
This phase Leader hangs, the data belongs to the uncommitted state, and the Client does not receive an ACK to consider the timeout failure to safely initiate the retry. Follower node does not have this data, re-select the primary after the Client retry resubmit can be successful. The original Leader node is restored as Follower joins the cluster to re-synchronize the data from the new Leader of the current term, forcing the consistency of the Leader data.
3. The data reaches the Leader node and is successfully replicated to all Follower nodes, but has not yet received the Leader response
This stage Leader hangs, although the data in the Follower node in the uncommitted state (uncommitted) but consistent, re-elected Leader can complete the data submission, the Client because I do not know the success of the submission is not, you can retry the submission. In this case, Raft requires the RPC request to achieve idempotent, that is, to implement the internal de-heavy mechanism.
4. The data reaches the Leader node and is successfully copied to the Follower partial node, but has not yet received the Leader response
This phase Leader hangs, the data in the Follower node is in the uncommitted state (uncommitted) and inconsistent, the Raft protocol requires that the vote can only be cast to the node with the latest data. So the node with the latest data will be selected as Leader and then forced to synchronize the data to Follower, the data will not be lost and eventually consistent.
5. Data arrives at Leader node, successfully replicated to Follower all or most nodes, data in Leader in committed state, but Follower in uncommitted state
This stage Leader, re-elect the new Leader after the processing process and stage 31-like.
6. Data arrives at Leader node, successfully replicated to Follower all or most nodes, data is in committed state at all nodes, but not yet responding to Client
At this stage Leader hangs, Cluster internal data is in fact already consistent, Client repeated retry based on the power of the strategy for consistency without impact.
7. The brain fissure caused by the network partition, appear double Leader
The network partition separates the original Leader node from the Follower node, Follower the Leader Heartbeat will initiate an election to generate a new Leader. This creates a double Leader, the original Leader alone in a zone, submitting data to it that cannot be copied to the majority node so that the commit is never successful. The submission of data to the new Leader can be successful, and after the network recovery The old Leader discovers that the new Leader in the cluster is automatically downgraded to Follower and synchronizes the data from the new Leader to achieve cluster data consistency.
Exhaustive analysis of the minimum cluster (3 nodes) facing all the situation, it can be seen that the Raft protocol is a good response to the consistency problem, and easy to understand.
Summarize
Summarize this article by quoting a summary of the last section of the Raft paper.
The correctness, efficiency and simplicity of the algorithm are the main design objectives.
While these are valuable goals, these goals will not be achieved until the developer writes out a usable implementation.
So we believe that comprehensible is equally important.
Think deeply, think of Paxos algorithm is Leslie Lamport in 1990 in the public published on their website, think about when we just heard? When is there a usable implementation? And the Raft algorithm is published in 2013, we can see in the reference [5] above the number of different languages open-source implementation library, which is the importance of understanding.
Reference
[1]. LESLIE LAMPORT, ROBERT Shostak, MARSHALL Pease. The Byzantine general problem. 1982
[2]. Leslie Lamport. The part-time Parliament. 1998
[3]. Leslie Lamport. Paxos Made Simple. 2001
[4]. Diego Ongaro and John ousterhout. Raft Paper. 2013
[5]. Raft Website. The Raft Consensus algorithm
[6]. Raft Demo. Raft Animate Demo
"Turn" Raft why is it easier to understand the distributed consistency algorithm