Distributed consistency algorithm-Raft, consistency-raft

Source: Internet
Author: User

Distributed consistency algorithm-Raft, consistency-raft

In the previous article, we talked about the Paxos protocol. This article describes the Raft protocol, which is simpler and easier to implement than the Paxos protocol. For Raft protocol and project implementation, please refer to the link https://raft.github.io/, which contains a large amount of documents. The screen has been animated and demonstrated, which is very helpful for the solution.
Concepts and terminology
Leader: a leader that provides the customer with a service (generating and writing logs) node. At any time, only one leader can be in the raft system.
Follower: A node that passively accepts requests. It does not send any requests and only responds to requests from leader or candidate. If a customer request is received, the request is forwarded to the leader.
Candidate: A candidate generated during the election process. If follower does not receive the heartbeat or log of the leader within the timeout period, it switches to the candidate status and enters the election process.
TermId: the term number. The time is divided into one term of office. After each election, a new termId will be generated, with only one leader in one term of office. TermId is equivalent to the proposalId of paxos.
RequestVote: requests for voting. The candidate starts during the election process and becomes the leader after receiving the quorum (majority) response.
AppendEntries: The mechanism for attaching logs, sending logs by the leader, and heartbeat
Election timeout: The election times out. If follower does not receive any message (append log or heartbeat) for a period of time, the election times out.
The Raft protocol consists of three parts: leader Election, log replication, and member change.

Principles and features of Raft Protocol
A. There is a leader in the system. All requests are handled by the leader. The leader initiates a synchronization request and returns the client only after the majority sends a response.
B. The leader never modifies its own logs and only performs append operations.
C. logs only flow from the leader to the follower. The leader contains all submitted logs.
D. If the log reaches the majority in a term, the Sino-Japanese log will certainly exist in the future term.
E. If a node applies logs in a certain term or index, the same logs will be applied to other nodes at the same location.
F. Consistency is ensured without relying on the physical sequence of each node, and the logical increment of term-id and log-id are ensured.
E. Availability: as long as most machines can run and communicate with each other, the availability can be ensured. For example, a five-node system can tolerate the failure of two nodes.
F. Easy to understand: Compared with the Paxos protocol, the implementation logic is clear and easy to understand, and there are many engineering implementations, while Paxos is hard to understand, and there is no engineering implementation.
G. The main implementation includes three parts: leader Election, log copying, copying snapshots and member changes; log types include election voting, append logs (Heartbeat), and copying snapshots

Leader Election Process
Keywords: Random timeout, FIFO
When the server is started, the initial status is follower. If the heartbeat packet sent by the leader is not received within the timeout period, the candidate status is selected for election. The initial status of the server is the same as that of the leader when it fails. In order to avoid the division of votes, for example, if leader A fails on five nodes (ABCDE), four nodes are left. The raft Protocol stipulates that each server can only vote for one vote in one term, assume that B and D have the latest logs and initiate an election vote at the same time, B and D may receive two votes respectively. If this is the case, most of them will not be confirmed, the leader cannot be selected. To avoid this situation, raft uses the random timeout mechanism to avoid the division of votes. The election timeout time is randomly selected from a fixed interval. Because the timeout time of each server is different, after the leader fails, the follower with the shortest timeout time and the most logs will start to select the master, and become a leader. Once candidate becomes the leader, a heartbeat packet will be sent to other servers to prevent the start of a new round of election.

Send log information: (term, candidateId, lastLogTerm, lastLogIndex)
Candidate process:
1. No logs (including heartbeat) of the leader are received within the timeout period)
2. Switch the status to candidate, auto-increment currentTerm, and set the timeout time.
3. broadcast an election request to all nodes and wait for a response. There may be three situations:
(1). If the majority response is received, it becomes the leader.
(2) If you receive the heartbeat of the leader and the leader's term> = currentTerm, switch to the follower state,
Otherwise, keep the Candidate identity
(3) if the majority is not reached within the timeout period and the leader heartbeat is not received, the vote is likely to be divided, and currentTerm will be added to start a new round of elections.

Follower process:
1. If term <currentTerm, an updated term is returned to candidate.
2. If there is no vote yet, or the candidateId log (lastLogTerm, lastLogIndex) is the same or updated as the local log, it will be voted.
Note: within a term period, each node can only vote for one ticket at most. Follow the first-come-first-served principle.

Log replication process
Keywords: log continuous consistency, majority, leader log unchanged
When a leader sends a log to follower, it carries the adjacent previous log. When follwer receives the log, it finds the previous log at the same date number and index location. If the log exists and matches, logs are received. Otherwise, the leader reduces the log index location and retries until a certain location is consistent with follower. Then, follower deletes all the indexed logs and appends the logs sent by the leader. Once the logs are successfully appended, all the logs of follower and leader are consistent. Only when the majority follower responds and receives the log, indicates that the transaction can be committed and the client can be returned for successful submission.
Send log information: (term, leaderId, prevLogIndex, prevLogTerm, leaderCommitIndex)
Leader process:
1. Received client requests and local persistent logs
2. Send Logs to each node
3. If the majority is reached, then commit is returned to the client.
Note:
(1). If the lastLogIndex> = nextIndex passed to follower, it will be passed from nextIndex.
. If a success is returned, the nextIndex and matchIndex corresponding to the follower are updated.
. If the log fails, it indicates that the follower has more logs, but the nextIndex value is decreased and the log is retried.
(2) If N> commitIndex exists, and the majority is matchIndex [I]> = N, and log [N]. term = currentTerm,
Set commitIndex = N.

Follower process:
1. Compare the term number with its own currentTerm. If the term <currentTerm, false is returned.
2. If (prevLogIndex, prevLogTerm) does not exist, it indicates that the log is still poor and false is returned.
3. If (prevLogIndex, prevLogTerm) conflicts with existing logs, delete your own logs based on the leader.
4. append the logs uploaded by the leader to the end.
5. If leaderCommitIndex> commitIndex, it indicates that it is a new commit point. Play back the log and set commitIndex =
Min (leaderCommitIndex, index of last new entry)

Note: by default, if the log does not match, it will push forward one by one based on the logIndex until the match position is found. There is a simple idea that each time a term is pushed forward, in this way, network interaction can be reduced, and the location of the match can be updated as soon as possible, at the cost of transmitting unnecessary logs.

Snapshot Process
To avoid logs occupying the disk space, you need to regularly clean up the logs and take snapshots before they are cleared, so that the newly added nodes can be restored through snapshots + logs.
Snapshot properties:
1. Last submitted log (termId, logIndex)
2. After a new snapshot is generated, you can delete the previous log and previous snapshots.
Log deletion cannot be too fast. Otherwise, the machines after the crash can be restored through Logs. If the logs do not exist, they must be recovered through snapshots, which is relatively slow.

Procedure for leader to send snapshots
Transmit parameters (leaderTermId, lastIndex, lastTerm, offset, data [], done_flag)
1. If the log lags too far (exceeds the threshold), the snapshot sending process is triggered.
Note: Snapshots cannot be too frequent. Otherwise, disk I/O pressure may be high. However, you also need to regularly clean unnecessary logs to relieve the space pressure on logs, in addition, it can speed up follower catch-up.

Procedure for follower to receive snapshots
1. If leaderTermId <currentTerm, return
2. Create a snapshot for the first disk.
3. Write data to a snapshot at the specified offset.
4. If it is not the last block, wait for more blocks.
5. After receiving the snapshots, discard the old ones.
6. Delete unnecessary logs

Cluster configuration change
C (old): old Configuration
C (new): new configuration
C (old-new): Transition configuration. You must reach the majority in both old and new.
Principle: two leaders will not appear during the configuration change process.
Two-phase scheme: introduce transition phase C (old-new)
Convention: After receiving the new configuration, any follower uses the new configuration to determine the majority.
Change process:
1. The leader receives a request to switch from C (old) to C (new ).
2. Create the configuration log C (old-new). This log must reach the majority simultaneously in C (old) and C (new ).
3. After any follower receives the configuration, it uses the C (old-new) method to determine whether the log has reached the majority (even if the C (old-new) log has not reached the majority)
Note: In Stage 1, 2 and 3, only the C (old) node may become the leader, because C (old-new) cannot become the majority.
4. C (old-new) log commit (reaching the majority), neither C (old) nor C (new) can reach the majority separately, that is, there will be no two leader
5. Create configuration log C (new) and broadcast it to all nodes.
6. Similarly, after any follower receives the configuration, it uses C (new) to determine whether the log has reached the majority.
Note: In stages 4, 5, and 6, only nodes that may contain C (old-new) configurations become leaders.
7. After C (new) configures the log commit, C (old-new) cannot reach the majority.
8. For nodes not configured in C (new), you can exit and the change is complete.
Note: In Stage 7 and 8, only C (new) configuration may be included as a leader.
Therefore, there will always be only one leader throughout the process. If the leader is not configured in C (new), it must be automatically disabled after the C (new) log is submitted.

In addition to the above two-phase scheme, the Raft author proposed a relatively simple one-phase scheme, adding or deleting only one node at a time. In this way, the design does not need to introduce the transition state, so we will not repeat it here, if you are interested, you can go to his graduation thesis. I will attach it to the reference document below.

Q &
1. Is there a "live lock" in the Raft protocol? How can this problem be solved?
The active lock is relative to the deadlock. The so-called deadlock refers to the situation where two or more threads lock and wait for each other, and thus cannot push forward. The active lock refers to multiple working threads (nodes) all are running, but the overall system status cannot be promoted. For example, in basic-paxos, in some cases, voting is always unable to reach the majority. In Raft, As long as only one stage is submitted, multiple nodes may initiate the master election at the same time during the master election process, which leads to the division of votes and failure to select the master, this remains the case in the next round of elections, making the system unable to move forward. Raft solves the "live lock" problem through random timeout.

2. Does the Raft system require strong physical clock consistency between nodes?
The Raft protocol has no requirements for clock consistency in physics and does not require time Calibration by using the atomic clock NTP. However, there are requirements for time-out settings. The specific rules are as follows:
BroadcastTime interval electionTimeout interval MTBF (Mean Time Between Failure)
First, the broadcast time is much smaller than the election timeout time. The leader continuously sends a heartbeat to the follower through the broadcast. If this time is shorter than the timeout time, the follower will mistakenly think that the leader has crashed, the master selection is triggered, and the time-out time is much smaller than the average machine failure time. If MTBF is shorter than the time-out time, the master selection problem will always occur. Before the master selection is completed, the service cannot be normally provided externally, so we need to ensure that. Generally, broadcastTime can be regarded as a network RTT, within 1 ms in the same city, within ms in a remote location. If it is cross-country, it may take several hundred ms. The average machine failure time is at least in the unit of month, therefore, the election timeout must be set to around 1 s to 5 s.

3. How can I ensure that the leader has all the logs?
On the one hand, for scenarios where the leader remains unchanged, logs can only flow from the leader to the follower. in case of conflicts, the leader's log prevails. On the other hand, for scenarios where the leader has been changing, it is ensured through the election mechanism, during the election, the comparison method (LogTerm, LogIndex) updated by WHO should be recognized by the majority, indicating that the logs of the new leader should be at least the latest in the majority. On the other hand, the submitted logs must have reached the majority, so it is inferred that the leader has all submitted logs and will not leak them.

4. Why does the Raft Protocol require log continuity? What are the advantages and disadvantages of log continuity?
According to the master selection process of the Raft protocol, the (termId, logId) must be the most up-to-date leader in the majority, that is, the leader must already contain all submitted logs. Therefore, the leader does not need to obtain logs from other follower nodes, which ensures that logs are always routed from the leader to follower and simplifies the logic. However, the disadvantage is that any follower must accept all the previous logs before accepting the logs, and can have the right to vote only when it catches up. [otherwise, when copying logs, regardless of whether they are the majority. If there are many different logs, it will take a long time for follower to catch up with each other. Any follower that does not catch up with the latest log has no right to vote, resulting in poor network conditions, it is not easy to reach the majority. While Paxos allows logs to be "empty", which makes network jitter better, but the logic for processing "empty" is complicated.

5. How does Raft ensure log continuity?
When a leader sends a log to follower, it carries the adjacent previous log. When follwer accepts the log, it finds the previous log at the same date number and index location. If the log exists and matches, accept logs. Otherwise, the leader will reduce the log index location and retry until a certain location is consistent with follower. Then, follower deletes all the indexed logs and appends the logs sent by the leader. Once the logs are successfully appended, all the logs of follower and leader are consistent. The Paxos Protocol does not require log continuity and can be unordered.

6. What should I do if TermId becomes the majority first? Is it possible?
If the majority is reached for a small TermId node, it means that the node with the largest TermId node used to be the leader and had the most logs, but did not reach the majority, so its logs can be overwritten. However, this node will attempt to continue voting, and the new leader will send a log to this node. If the leader finds that the returned termT> currentTerm has not reached the majority, it will change to follower again, promote a node with a larger TermId to become a leader. However, it is not guaranteed that a node with a large termId will become a leader, because the leader gives priority to determining whether to reach the majority. If it has reached the majority, it will continue to be the leader.

7. Do majority logs be submitted?
Not necessarily, it must be the logs generated in current_term. Only when the majority is reached can the logs be considered as submitted. The logs are persistent and will not change. In Raft, the leader keeps the termId of the original log unchanged. Any log has the termId and logIndex attributes. When the leader changes frequently, it is possible that a log may be issued to the majority in a certain state, but the log may eventually be overwritten, for example:

 

(A). S1 is leader, termId is 2, and a log is written to S1 and S2, (termId, logIndex) is)
(B). S1 crash, S5 use S3, S4, S5 to be elected leader, self-incrementing termId is 3, a log is written locally, (termId, logIndex) is (3, 2)
(C). S5 crash. After S1 is restarted, the leader is re-elected. The self-incrementing termId is 4, and () is re-copied to the majority. crash before submission
(D ). s1 crash, S5 use S2, S3, S4 as leader, then copy () the log to the majority and submit it) this log will be overwritten even if it reaches the majority.
(E). If S1 reaches the majority in the first term of office (), S3 will not become the leader and will not be overwritten.

References
Https://raft.github.io/raft.pdf
Https://ramcloud.stanford.edu /~ Ongaro/thesisloud
Https://ramcloud.stanford.edu /~ Ongaro/userstudy/paxos.pdf

 

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.