Raft Brief Introduction

Source: Internet
Author: User

Consistency algorithm-Raft

Raft status

A Raft cluster consists of several server nodes, usually 5, which allows the entire system to tolerate the failure of 2 nodes, each of which is in one of the following three states:

    • follower(跟随者): All nodes follower start with a state. If leader you do not receive the message, it becomes a candidate state.
    • candidate(候选人): It will "pull the ballot" to the other nodes, if the majority of the votes are obtained leader . This process is called the Leader election (Leader election).
    • leader(领导者): All modifications to the system will be passed first leader .

Raft Consistency algorithm

Raft simplifies the management of log copies by selecting a leader, for example, log entries (logs entry) only allow flow from leader to follower.

Based on the leader method, the raft algorithm can be decomposed into three sub-problems:

Leader election(leading election): After the original leader hung off, a new leader must be chosen

Log replication(log copy): Leader receives logs from the client and replicates to the entire cluster

Safety(security): If any server replays log entries to the state machine, the other servers will only play back the same log entries

Leader election (leading election)

Raft uses a heartbeat mechanism to trigger a leader election. When the server program starts, they are all follower (follower) identities. If a follower does not receive any messages for a period of time, that is, the election is timed out, then he will assume that there are no available leaders in the system and then start the election to elect a new leader. To start an election process, follower add 1 to the current term and convert it to a candidate state.

Then he would send a poll RPCs to the other server nodes in the cluster in parallel to vote for himself. The status of the candidate is maintained until any one of the following conditions occurs,

    • He won the election himself.

      • If this node wins more than half of the vote will become leader, each node will vote according to the principle of first-come-first-served, and a term can only be cast to one node, This guarantees that a term has a maximum of one node to win more than half of the vote.
      • When a node wins an election, he becomes a leader and sends this message to all nodes so that all nodes fall back to follower.
    • Other servers become leaders

      If, while waiting for an election, candidate receives the RPC that other servers want to be leader, it is handled in two cases:

      • If the term of leader is greater than or equal to its term, then the change candidate will turn into a follower state
      • If the term of the leader is less than its own term, then it will be rejected leader and kept in candidate state
    • After a while, no one wins.

      • It is possible that many follower at the same time become candidate, leading to the absence of candidate for most elections, leading to the inability to elect the Lord. When this happens, each candidate time out and then re-sends the term, initiating a new round of election RPC. It is important to note that if there is no special treatment, there may be situations that lead to an infinitely repeated selection of the master.
      • Raft uses a random timer method to avoid the above situation, each candidate select a time interval of random values, such as 150-300ms, using this mechanism, generally only one server will enter the candidate state, and then get the majority of the server election, Finally become the main. Each candidate will restart the timer after receiving the leader heartbeat message, thus avoiding an election situation when the leader is working normally.

Log replication (journal copy)

Once elected leader , it will begin accepting client requests, each with an instruction that can be replayed into the state machine. The leader instruction is appended to one log entry , and then sent to the other server by appendentries RPC in parallel, when the entry is replicated by the majority server, leader the entry is replayed into the state machine and the result is returned to the client.

When the follower outage or slow operation, the leader appendentries is infinitely re-sent to these follower, until all follower copy the log entry.

The raft log replication guarantees the following properties (log Matching property):

    • If two log entry have the same index and term, they store the same instruction
    • If two log entry are in two different logs and have the same index and term, their previous log entry are exactly the same

One feature is guaranteed by the following:

    • Leader under a specific term and index, only one log entry is created
    • Log entry does not change their position in the log

Feature two is guaranteed by the following:

    • Appendentries will do a consistency check of log entry, when sending a APPENDENTRIESRPC, leader will take the log entry that need to be copied to the previous log entry (index, iterm)

If follower does not find the same log entry as it does, it will refuse to accept the new log entry so that the feature two can be satisfied.

Security

Election restrictions

In some consistency algorithms, even if a server does not contain all of the previously committed log entry, it can be selected as the primary, which will result in additional complexity by copying the missing logs from other servers on the leader to the leader. In contrast, raft uses a simpler approach, which guarantees that all the submitted log entry will be on the leader of the current election, so in the raft algorithm, the log will only flow from leader to follower.

In order to achieve this goal, raft will guarantee in the election that a candidate can only be elected after receiving the majority of the server's votes. Getting a majority of the votes indicates that at least one server in the server that elects it has all of the log entry already committed, and leader's log is at least as new as follower, so that leader must have all of the committed log entry.

Log entries prior to submission of tenure

The leader knows that a log record in the current term can be submitted, as long as it is stored on most servers. If a leader crashes before submitting a log entry, future leaders will continue to try to replicate the log record. However, a leader cannot conclude that a previous term's log entry was saved on most servers and must have been submitted. Shows a situation where old log entries that have been stored on most nodes are still likely to be overwritten by future leaders.

As an example, figure (c) takes place on a log entry although it has been replicated to most servers, but it is still possible to be overwritten, (d), the entire occurrence of the timing is as follows:

    • In Figure A, S1 is selected as the primary and then copied to log entry to S2 with Log index 2.
    • In Figure B, S1 hangs up, then S5 gets the S3,S4 and its own election, becomes leader, and then it receives a new log entry from the client (3)
    • Figure C, S5 hung up, S1 re-work, and was selected as the main, continue to copy the log entry (2), before the log entry (2) was submitted, S1 and hung off
    • In Figure D, S5 is re-elected as the leader and then overwrites the log entry of term 3 to the other log with Log index 2 entry

To describe the situation, Raft never submits a log entry in a previous term by calculating the number of copies. Only the log entries in the current term of the leader can be submitted by calculating the number of copies; Once the log entries for the current term are submitted in this manner, the previous log entries will be submitted indirectly because of the log matching feature. For example, in Figure E, if S1 logs entry (4) to most servers before it hangs, it will ensure that the previous log entry (2) is committed, and S5 will not be elected as a leader.

Security justification

To prove that the leader of the term T (Leader T) had submitted a log entry during his tenure, but the log entry was not stored in the Journal of the leader of a future term. You do not have this log entry for the leader of the minimum term u that is set to be greater than T.

if S1 (the leader of the term T) submits a new log in its term of office, then S5 is elected as the leader after the term U, then at least one machine, such as S3, has both a log from S1 and a vote for S5.
    1. At the time of the leader U election there must be no log entry submitted (the leader will never delete or overwrite any entries).
    2. The leader T copies this log entry to most nodes in the cluster, while the leader U wins the ballot from most nodes in the cluster. So at least one node (voter, voter) has accepted a log entry from leader T and voted for leader U, a voter who is the key to this contradiction.
    3. The voter must accept the submitted log entry from leader T before voting for the leader U, otherwise he will reject the additional log request from Leader T (because his term number will be larger than T).
    4. Voters still keep this log entry when they vote for the leader U because any middle leader contains the log entry (based on the assumptions above), the leader never deletes the entry, and the follower only deletes the entry when it conflicts with the leader.
    5. When voters cast their votes to leader U, the Journal of leader U must be as new as the voter himself. This leads to one of the contradictions between the two.

      • First, if the last log of the voter and leader U has the same term number, then leader U's log is at least as long as the voter, so the journal of leader U must contain all the voters ' logs. This is another contradiction, because the voter contains the log entry that has been submitted, but in the above assumptions, leader U is not included.
      • In addition, the last journal of leader U must have a term number that is bigger than the vote. In addition, he is larger than T, because the last log of the voter's term number is at least as large as T (he contains the submitted logs from the term T). Before you create the leader U last log the leader must have already included the submitted log (according to the above assumptions, leader U is the first leader not to contain the log entry). So, according to the log matching feature, the leader U must also contain the log that was submitted to the course, where contradictions arise.
    6. Therefore, assuming that the assumption is not established, all leaders larger than t must contain all the logs that have been submitted from T. The log matching principle ensures that future leaders will also include indirectly submitted entries

Follower and candidate collapse

If a follower or candidate crashes, it will be treated as follows:

    • The leader will constantly send it an election and append the log RPC until the success
    • The follower ignores the RPC that it has processed to append the log

Time and Availability

The election of leaders is the most crucial aspect of the Raft in the time requirement. Raft can elect and maintain a stable leader as long as the system meets the following time requirements:

广播时间(broadcastTime) << 选举超时时间(electionTimeout) << 平均故障间隔时间(MTBF)
    • broadcast time refers to the average time to send RPCs to other servers in the cluster and receive responses from a server in parallel;
    • election time-out is the time-out limit for elections
    • The average failure interval is the average time between two failures for a single server.

The reason for the election timeout to be greater than the broadcast time is to prevent followers from re-electing the host because they have not yet received the leader's heartbeat.

The reason for the election time-out to be less than MTBF is that the server that works normally does not reach the majority when the election is prevented.

For broadcast time, it is generally between [0.5ms,20ms], and the average failure interval is generally very large, at least in months. Therefore, the general election time-out period is generally chosen as [10ms,500ms]. Therefore, when the leader is hung up, it can be re-elected in a relatively short time.

Animated demo Raft

HTTP://THESECRETLIVESOFDATA.C ...

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.