Raft consistency algorithm

Source: Internet
Author: User

Turn from

http://blog.csdn.net/cszhouwei/article/details/38374603

Why not Paxos

The Paxos algorithm is Leslie Lambert (Leslielamport, the "La" in LaTeX, which is now in Microsoft Research), which was presented in 1990 as a consistency algorithm based on message passing. Because the algorithm is difficult to understand at first did not arouse the attention of people, so that Lamport in 1998 eight years later re-published to ACM Transactions on computer Systems (the Part-timeparliament). Even so the Paxos algorithm has not been paid attention to, 2001 Lamport feel that the peer can not accept his sense of humor, and then in an easy to accept the way to re-express (Paxos madesimple). It can be seen that Lamport has a unique feeling for Paxos algorithm. In recent years, the general use of Paxos algorithm has proved its important position in distributed consistency algorithm. 2006 Google's three papers at the beginning of the "cloud", wherein the chubby lock service using Paxos as the Chubby cell consistency algorithm, Paxos's popularity from the road. Lamport himself in his blog to describe his 9 years of time to publish the algorithm.

"There is the only one consensus protocol, and that's spaxos-all other approaches was just broken versions of Paxos."Chubby Authors

"The Dirtylittle secret of the NSDI community is so at the very five people really, trulyunderstand every part of Paxos ;-). "NSDI Reviewer

Notes: Back then, I do not know how much information to read, just barely understand "Basic Paxos", due to the lack of practical experience, so far for "Multi-paxos" still like cloud fog, no avail. In this paper, the protagonist raft, "Insearch of an understandable Consensus algorithm", from the beginning of its design, the author will understandable as the highest standard, which in many decision-making choices are reflected.

Problem description

There are two models for node communication in Distributed systems: Shared memory and message delivery (Messages passing). In a distributed system based on the messaging communication model, the following errors inevitably occur: The process may be slow, broken, and restarted, and messages may be delayed, lost, and duplicated (regardless of "byzantinefailure").

A typical scenario is: In a distributed database system, if the initial state of each node is consistent, each node executes the same sequence of operations, then they can finally get a consistent state. To ensure that each node executes the same sequence of commands, a "consistency algorithm" is executed on each instruction to ensure that the instructions seen by each node are consistent. A general consistency algorithm can be applied in many scenarios and is an important problem in distributed computing. The study of consistency algorithms has not stopped since the 1980s.

Figure 1Replicated State Machine Architecture

The Raft algorithm abstracts this type of problem into "replicatedstate machine", as detailed in each server's log of user commands for local state machines to execute in sequence. Obviously, in order to ensure the consistency of "replicated state machine", we only need to guarantee the consistency of "Replicatedlog".

Algorithm Description

In general, in a distributed environment, there are two ways to agree:

1. Symmetric, leader-less

All servers are peer, and client can interact with any server

2. Asymmetric, leader-based

At any one time, there are only 1 servers with decision-making power, and the client interacts only with the leader

The "Designing for understandability" raft algorithm uses the latter, based on the following considerations:

1. problem decomposition:normaloperation & Leader Changes

2. simplified operation:noconflicts in normal operation

3. more efficient:moreefficient than leader-less approaches

Basic concept Server states

The raft algorithm divides the server into 3 roles:

1. Leader

Responsible for client interaction and log replication, with a maximum of 1 in the system at the same time

2. Follower

Passive response request RPC, never unsolicited request RPC

3. Candidate

Intermediate state of transition from follower to leader

Figure 2Server States

Terms

As we all know, in distributed environment, "time synchronization" itself is a big problem, but in order to identify "outdated information", time information is essential. Raft in order to solve this problem, the time is divided into a term, can be considered a "logical time." As shown in the following:

1. There are at most 1 leader per term

2. Certain term does not exist because of the election failure leader

3. Local maintenance currentterm per server

Figure 3Terms

Heartbeats and timeouts

1. All servers are started with the follower role and start the election timer

2. Follower expects to receive RPC from leader or candidate

3. Leader must broadcast heartbeat reset follower's election timer

4. If the follower election timer expires, it is assumed that leader has crash and the election is initiated

Leader election

Self-currentterm, converted from follower to candidate, set Votedfor to itself, initiate requestvote RPC in parallel, and retry continuously until either of the following conditions are met:

1. Get more than half of the server's votes, convert to leader, broadcast heartbeat

2. Appendentries RPC received a legitimate leader, converted to follower

3. Election timed out, no server elections succeeded, self-currentterm, re-election

Additional details:

1. Candidate may receive appendentries RPC from other leader while awaiting the result of the poll. If the term of the leader is not less than the local currentterm, then the legitimacy of the leader identity is recognized, and the initiative is downgraded to follower, whereas the candidate status is maintained and the voting results are continued.

2. Candidate neither elected nor received RPC from other leader, which typically occurs when multiple nodes simultaneously initiate an election (Split Vote), and eventually each candidate will time out. In order to reduce the conflict, a "random retreat" strategy is adopted here, each candidate restarts the election timer (random value), greatly reducing the conflict probability

Log replication

Figure 4Log Structure

Normal operation Flow:

1. Client sends command to leader

2. Leader append command to local log

3. Leader broadcast APPENDENTRIESRPC to follower

4. Once the log entries committed successfully:

1) leader applies the corresponding command to the local statemachine and returns the result to the client

2) leader committed log entries to follower through subsequent APPENDENTRIESRPC

3) follower receives the committed log entry, applies it to the local statemachine

Safety

To ensure the correctness of the process, the raft algorithm guarantees that the following attributes are true at all times:

1. Election Safety

In any given term, a maximum of one leader is elected

2. Leader append-only

Leader never "rewrite" or "delete" The local log, just "append" the local log

3. Log Matching

If the log entries on two nodes have the same Index and term, the logs in the two nodes [0, Index] are exactly the same

4. Leader completeness

If a log entry is commit in a term, then the leader of the subsequent arbitrary term will have the log entry

5. State machine Safety

Once a server applies a log entry to the local state machine, all servers will then apply the same log entry for that offset

Intuitive Explanation:

For the sake of understanding the correctness of the raft algorithm, some non-strict proofs of the above properties are given here.

"Electionsafety": to disprove the law, assuming that a term is simultaneously elected to produce two Leadera and Leaderb, according to the electoral process definition, A and B must simultaneously obtain more than half of the nodes of the vote, at least the presence of node n at the same time give A and B votes, contradictions

leaderappend-only: Raft algorithm leader Authority supremacy, when follower and leader divergence, always leader to cover the revision follower

logmatching: in two steps, first proving that a log entry with the same index and term is the same, and then proving that all previous log entries are the same. The first step, obviously, is directly available from election safety. Proof of the second step with inductive method, initial state, all nodes are empty, obviously satisfied, subsequent each appendentries RPC call, leader will contain the last log entry of index and term, if the follower checksum finds inconsistencies, The appendentries request is rejected and the repair process is entered, so each time the appendentries call succeeds, leader can be confident that follower has caught up with the current update

leadercompleteness: to satisfy this nature, raft also introduces some additional restrictions, such as candidate Requestvote RPC requests to carry local log information, if follower find themselves "more complete", The candidate is rejected. The so-called "more complete" refers to the local term is larger or the same is the same but the index is larger. With this restriction, we can use contradiction to prove the nature. Assuming that the TERMX successfully commits a log entry, considering that the minimum termy does not contain the log entry and satisfies the y>x, then there must be a node n that has accepted the log entry from Leaderx and voted for the Leadery election, and the subsequent contradiction is self-evident.

statemachine Safety: Because of the nature of leadercompleteness, this nature is self-evident

Cluster Membership Changes

In the actual system, due to hardware failure, load changes and other factors, the dynamic increase or decrease of the machine is unavoidable. The simplest way to do this is to have the system temporarily offline, modify the configuration, and go back online. But there are two drawbacks to this approach:

1. The system is temporarily unavailable

2. Human-made operation error-prone

Figure 5Online Switch Directly

failed Attempts: Broadcast system configuration changes through the OPS tool, obviously, in a distributed environment, it is not possible for all nodes to switch to the latest configuration at the same time. It is not difficult to see, the system exists a time window of conflict, while there are two parts of old and new majority.

Phase Two scenario: in order to avoid the conflict, raft introduced the Joint intermediate Configuration and adopted a two-phase scheme. When leader receives the configuration Switch command (cold->cnew), it copies the cold,new as a log entry, and any server adds the new configuration item to the local log All subsequent decisions must be based on the most recent configuration item (regardless of whether the configuration item has a commit), and when leader confirms Cold,new successful commit, the same policy is used to submit the cnew. In the system configuration switching process as shown, it is not difficult to see this method to eliminate the cold and cnew at the same time the conflict, to ensure the consistency of the configuration switching process.

Figure 6Joint Consensus

Log compaction

With the continuous operation of the system, the operation log expands, causing the log replay time to increase, resulting in a decrease in system availability. Snapshots (Snapshot) should be the most common means for "log compression", raft is no exception. The specific practice is as follows:

Figure 7S Log compression based on "snapshot"

Unlike raft other operations leader-based, snapshot is generated independently by individual nodes. In addition to the role of log compression, snapshot can also be used for synchronization states: Slow-follower and New-server,raft use Installsnapshot RPC to complete the process without further repeating.

Client interaction

A typical user interaction process:

1. Client sends command to leader

If leader unknown, select any node, if the node is not leader, redirect to Leader

2. Leader Append log entries, wait for commit, update local state machine, and finally respond to client

3. If the client times out, retry continuously until a response is received

The attentive reader may have found a loophole here: leader before responding to the client crash, if the client simply retries, it may cause the command to be executed multiple times.

raft gives the scheme: The client assigns each command a unique identifier, leader checks the local log before receiving the command, and responds directly if the identity already exists. So, as long as the client does not have crash, can achieve "exactly Once" the semantic guarantee.

personal advice: try to ensure the operation of "idempotent", simplify the system design!

Development status

Although the raft algorithm was born soon, but in the industry has aroused widespread concern, strongly recommended that you visit its official website Http://raftconsensus.github.io, which has a wealth of learning materials, the current raft algorithm open source implementation has covered almost all major languages (c/c+ +/java/python/javascript ... ), its popularity is evident. Thus, whether a technology can be in the industry, sometimes "understandable", "achievable" is the most important.

Application Scenarios

Timyang in the "Paxos in large-scale systems common scenarios" article, listed some Paxos commonly used applications:

1. Database replication, logreplication ...

2. Naming Service

3. Configuration Management

4. User Roles

5. Number Assignment

Note: for the distributed lock, data replication and other scenarios, it is very easy to understand, but for the "naming Service" type of application scenario, how to do the actual operation, still expressed confusion. Read some data found that, with the help of the zookeeper watch mechanism, when the configuration changes can be real-time notification of the registered client, but how to ensure reliable delivery of the notification, the system may be in the presence of both old and new two configuration? Please have the relevant experience in the private Exchange ~

Raft consistency algorithm

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.