In search of an understandable consensus Algorithm

Source: Internet
Author: User

This article mainly introduces the Distributed Coordination consistency algorithm raft. The raft algorithm was proposed by Diego ongaro and John Ousterhout of Stanford University. Although it differs little from paxos in terms of functionality and performance, it is easier to understand than paxos. This is also the purpose of the author's design of this algorithm.

Consistency algorithms generally have the following features:

1. Make sure that the data is correct when the network latency, packet loss, or duplication occurs.

2. Ensure that the system is available even if some nodes are down.

3. Ensure log consistency without relying on time. This refers

4. The execution of a command only needs to be completed in most systems, even if it is actually completed. Execution of a small number of nodes is not completed.

Diego believes that the main disadvantage of paxos is that it is hard to understand, and its architecture is not suitable for building real systems. Therefore, raft's design principle is to make people more understandable and can be seen in many subsequent choices. Similar to paxos, paxos also adopts a centralized approach. All write operations can only be completed through the leader, and other nodes can only follow.

1. raft divides the server into three states:

1.) leader

You can only have a unique header.

2.) follower

Followers, and accept logs sent by the leader.

3.) candidate

If the follower does not receive the heartbeat of the leader within the timeout period, the follower automatically becomes the candidate.

Is its status transition diagram. At the beginning, all the server states are follower. When the leader's heartbeat or command is not received during the timeout period, the status changes to candidate; candidate is passed

After obtaining a majority of votes, the election becomes a leader. The leader will continue to do so until it goes down and restarts or finds a new term than him (later)


2. raft time period

Raft divides time into one term, and each term can have only one leader. The term is used to indicate that the cluster does not depend on time. You only need to use the term to determine the validity period of the current leader.

When one leader goes down and other follower does not receive heartbeat within the timeout time, it automatically increases the term by 1 and changes it to candidate to initiate a vote.


3. raft Communication

Raft server uses Remote Procedure Call (RPC) for communication. There are only two RPC methods and their responses.

1.) Request vote rpcs

This RPC is sent to other servers when a voting request is initiated by candidate. Other servers return a response to RPC based on the log and term information, depending on the leader election.

2.) appendentries rpcs

This RPC is used by the leader to send heartbeat or copy logs to the follower. the follower also needs to respond to the leader.

4. Leader Election

The raft server in the cluster is started as a follower. Each follower sets a random election timeout time because no leader exists at this time, therefore, the first follower must change to the candidate state due to timeout and initiate an election and vote for itself. After receiving the request vote request, other servers determine whether to vote in favor based on the current status:

1.) If the term of candidate is less than its own term, it will vote against it. If the candidate term is found to be larger and its own term is set to the new term accepted, then we can see condition 2.

2.) If you do not vote and the candidate log is to be updated, vote in favor.

When a candidate receives more than half of the votes, it is considered as the leader. In each round of voting, at most one candidate is elected, because each user can only vote for one vote and more than half of them will be elected. However, it is also possible that no one is elected in a round of voting. At this time, you can only wait for the next election to time out and re-start a round of voting. If many follower become candidate due to election timeout at the same time, the leader cannot be generated for the current term and the next round is required. This is called spilt vote.

To avoid spilt vote, raft uses a random election timeout time for each server, which is generally 150 ~ Between Ms. This reduces the number of candidate events within the same time period. In general, the follower with the latest timeout can be elected quickly, and then send heartbeat to other servers to prevent them from timeout.


5. Log Replication

Only the leader can send logs to other servers through appendentries RPC and require the leader to execute commands Based on the logs. This process is called Log replication. Let's take a look at what the log looks like.

For example, a box is a log item with term information and commands. At the same time, there is an index on the log item, marking its location in the log. When a log is saved on most servers, it can be considered as submitted. Of course, there are some restrictions on log submission in the future.

Logs have the following properties:

1) if the term and index of the two logs are the same, the command is also the same.

2) If the term and index of the two logs are the same, the previous logs are also the same.

First, because each term has only one leader, and the log of the leader will never be overwritten, it will only increase, so it is obviously satisfied. The second reason is that when follower receives the appendentries RPC, it checks whether the previous log item of the new log item in RPC is included in itself. Otherwise, it rejects the addition of the new log item.

When the follower rejects RPC because the previous log item does not exist, the leader will try to send the previous log item after receiving the response until the follower accepts it. This situation is especially suitable for leader restart, because after the leader restarts, it does not know which logs are saved by follower. It will first try to copy the last log, it is not until the log items that follower can really accept are found that normal log replication is restored.

 

6. Security

To ensure that all servers can execute the same command under any conditions, some constraints need to be added.

1.) Election restrictions

The election must ensure that the selected leader must contain logs that have been committed in the previous term. To meet this condition, raft requires that logs can only be sent from the leader to the follower, and the leader logs can only be added and cannot be overwritten. At the same time, when voting, the candidate must obtain the majority of votes, and each submitted log must be included in these majority of servers, then get the majority of votes, this candidate must be the latest log in most servers, and all the logs that have been submitted must be included.

Note that when the server is less than half of the original, the leader cannot be selected, that is, the cluster cannot work, therefore, no logs with commit exists and the leader can be selected even if the current server does not exist.


2.) logs that are not the current term cannot be submitted.

When a leader marks a log as a commit, it can only mark the logs that have been copied to most follower and the term is the current term, and cannot mark the logs of the previous term as commit. These constraints are designed to avoid the following situations:

As described in Figure (a) S1 goes down after index 2 is copied to S2. (B) in this case, S5 may time out first than S2, resulting in the first change to candidate. It can get the votes of S3 and S4 and its own votes, so it can become a leader, and start a new term (note 2 is not a new term, otherwise S5 cannot be elected) (c) S5 will be down upon election. At this time, S1 restarts and is elected as leader. A new term is opened, but log 2 is copied to the majority, but not submitted. (D) When S1 crashes again, S5 may be elected again (note that the original new term is maintained after S5 is restarted. At this time, the term size is the same as that of other servers, however, after an election time-out, the term becomes larger, so it can be elected). In this way, the previous log no. 3 will be copied and the log no. 2 of other servers will be overwritten. (E) only when the current term 4 is copied to the majority like S1 is submitted, and the previous logs are submitted by default.


7. Leader completeness

Definition: If a log is submitted in the previous term, the log is included in the leader log of the later term.

To prove completeness, the key point is that most servers vote in favor of voter. Assume that the term T has submitted a log, and the term U is the smallest leaderu does not include the commit log, u> T. Therefore, when leaderu is elected, the voter in the majority must contain the log of the term t submission, but vote for leaderu. However, the voting principle is that leaderu is updated than voter logs, that is, it must contain the logs submitted by term T. This leads to a conflict.

In search of an understandable consensus Algorithm

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.