Paxos has been synonymous with distributed consistency algorithms for quite a long time since it was introduced in 1990. But because it is difficult to understand and realize, the current well-known implementation of only Chubby, Zookeeper, Libpaxos several, of which Zookeeper used Zab made a lot of improvements to Paxos. To this end, in 2013, Stanford's Diego Ongaro, John Ousterhout, proposed a new and easier to understand and implement the consistency algorithm, namely raft.
?
Raft and Paxos are just to ensure that the n/2+1 node normal, can serve. Compared to Paxos, its advantage is easy to understand and realize. RAF decomposed the algorithm into: Select leader, log copy, security and several sub-issues. Its process is: at the beginning of the cluster in the election of leader responsible for the management of log replication, leader receive transaction requests from the client (log), and copy them to other nodes in the cluster, and then notify other nodes in the cluster to submit logs, Leader is responsible for ensuring that the other nodes and its log synchronization. When leader is down, the other nodes in the cluster re-sponsor the election and elect the new leader.
?
Role
?
Raft involves three roles:
Leader: The leader, who handles requests from clients, manages log replication, and maintains a heartbeat with follower to maintain its leadership position.
Follower: A follower who responds to a log copy request from leader in response to an election request from candidate. All nodes are follower at the beginning.
Candidate: The candidate, responsible for initiating the election vote, raft after the start or leader down, a node from follower to candidate, and the election, after the successful election, from candidate to leader.
?
The following is a raft role state transition diagram:
?
?
Term (tenure)
?
The concept of term (tenure) was used in raft, and the election was a term (tenure) in which only one leader could be produced. Term uses a sequential increment of numbers, and all follower are initially 1. One of the follower timer expires triggering the election, its status is converted to candidate, when the term plus 1 becomes 2, and then the election, there are several possibilities:
?
1, if the current term is 2 of the tenure of the election did not elect leader or abnormal, terms increase to 3, and start a new round of elections.
2, this round term for 2 of the tenure of the election after leader, if leader down, at this time the other follower to Candidate,term increment, and launch a new election.
3, if leader or candidate found their term than the other follower hours, leader or candidate to follower,term increments.
4, if follower found his term than other follower hours, update term and other follower consistent.
?
Each term increment will take place a new round of elections, in the normal operation of raft, all nodes are consistent. If a node does not fail, a term (tenure) will persist until a node receives a request that has a period that is less than the current terms hour.
?
Election
?
Initially all nodes are follower, and the timer time is different. After a node timer triggers an election, the term increments, the node is converted from follower to candidate, and a poll request (Requestvote RPC) is initiated to the other nodes. There are several possibilities:
?
1, received a majority of nodes (N/2+1) vote, from candidate to leader, send heartbeat to other nodes to maintain the leader status.
2. If the Appendentries RPC request sent by another node is received, and the node is longer than the current node term, a new effective leader is found and converted to follower, otherwise the request is kept candidate rejected.
3. The election expires, the term increases, and the election is re-launched.
?
Each of the term periods, each node can only vote 1 times, if more than a majority of candidate are not received by more than half of the vote, then each candidate period increments, restart the timer and re-launch the election. Because the timer time is random, there will not be multiple candidate at the same time to initiate the poll issue.
?
Log Replication
?
To ensure the consistency of the nodes, it is necessary to ensure that all nodes execute the same sequence of operations sequentially, the purpose of log replication is to do this.
?
1. Leader receives the client transaction request (that is, the log), appends the log to the local log, and copies it to the other follower by Appendentries RPC.
2, follower received the log, append to the local log, and send an ACK message to leader.
3, leader received more than a majority of follower ACK message, the log is submitted and formally submit the log, notify the client, and send appendentries RPC request notification follower submit the log.
?
Security
?
1, each term period can only elect a leader.
2. Leader will not delete or overwrite existing log entries, only append.
3. If the log entry at the same index location has the same term number, it is assumed to be the same from the beginning to the same index location.
4. If a log entry is submitted during a term, the log entry must appear in all the leaders of the larger term number.
5. If the log entry for leader at an index location has been committed, then the same index location of the other nodes does not commit a different log entry.
?
Requestvote RPC and Appendentries RPC
?
Raft node communication uses two types of RPC, Requestvote RPC and appendentries RPC:
Requestvote RPC: It is requested to vote, initiated by candidate during the election.
Appendentries RPC: The additional entry RPC, initiated by leader, is used for log replication and heartbeat mechanisms.
?
Reference documents
?
Find an easy-to-understand consistency algorithm (extended version)
Consistency algorithm raft detailed
Raft Why is it easier to understand distributed consistency algorithms
?
Postscript
?
This paper summarizes the raft, and the previous article Paxos, 2PC, 3PC are based on non-Byzantine fault-tolerant distributed consistency algorithm, that is, in addition to consider the loss of messages, timeouts, chaos, but do not consider the message is tampered with. From the next article, we will summarize the distributed consistency algorithm based on Byzantine fault tolerance, which is widely used in Bitcoin, Ethereum, and other blockchain products.
Add: This article syncs to my subscription number: Blockchainblockchain (I learn blockchain).
Distributed consistency algorithm raft