In the previous article we introduced the multi-copy state machine, raft is a multi-copy state machine that maintains multiple copy logs.
Raft first will elect a unique leader, leader is responsible for the management of the log, the use of the log to add and change the status of the operation through leader completed. Leader accepts the user's log requests and distributes the logs to other nodes in the system. and tells the other nodes when they can safely apply the logs to the state machine. This simplifies the management of multi-copy logs, where the data flows from leader to other nodes, while other nodes do not send logs to leader, which is very simple. When the leader is dropped, raft will pick a new leader.
By using leader, the raft decomposes the consistency problem into 3 sub-problems that are independent of one another:
Leader election: When the system is initialized, it is necessary to select leader and leader to select a new leader when it is dropped.
Log distribution: Leader accepts log requests and distributes them to other nodes.
Ensure correctness: Ensure that all log copies are consistent.
This article only describes the leader election process, the other two sub-issues I will be described in the following article. State and log of a node
In a raft cluster, any node is in one of three states at any one time: Leader, Follower, candidate. Most of the time the system is running, there will be a leader in the system, and the others are follower. By leader accepts the user's request, each user request is treated as a log. In the context of raft, what we call a log is not a system operation error, warning, debug information, etc., but refers to the operation record. For example, to provide a key-value pair of stored raft clusters, the log refers to "set a=100", "Set b=3", "Delete C" This causes the state to change the operation of key-value pairs, while viewing the value of a is not a log, because it does not cause changes in the system state. Follower is passive and accepts only requests from leader and candidate and does not make any requests. The candidate state is a temporary state during the leader election process. The following figure depicts the transition relationship between the states in 3:
term
The raft divides the time into successive terms (term), each of which may be arbitrary, and the term is numbered with successive integers. Each term is first leader election, when the election, multiple candidate competition becomes leader, once a node becomes leader, the other node becomes follower, and the node that is leader will be the same as leader during that term, If the leader node fails, the other nodes will be elected within the new term. There will be no multiple leader in any one term. In the raft system, tenure is an important concept, each node maintains the value of the current term of office, each time the communication between nodes contains the term information, each node detects its own term value is lower than the other nodes, will update its own term value, set to the higher detected value. When leader and candidate find their term of office is lower than other nodes, they immediately convert themselves to follower. RPC
The communication between raft nodes takes the RPC call, the core part of the raft algorithm only needs two rpc:requestvote and appendentries. Requestvote RPC was initiated by candidate for leader elections. Appendentries RPC is initiated by leader for heartbeat and log replication. and follower does not initiate any RPC. leader election process
Raft uses the heartbeat mechanism to trigger leader elections. When the system starts, all nodes are initialized to the follower state, set a term of 0, and start the timer, after the timer expires, the follower node is converted to a candidate node, and once converted to a candidate node, start a few things immediately:
1. Increase in the number of your own tenure
2. Start a new timer
3. Give yourself a vote
4. Send Requestvote RPC requests to all other nodes and wait for other nodes to reply.
If a majority node's consent vote is received before the timer times out, it is converted to leader. If you accept appendentries heartbeat RPC to another node, it means that the other nodes have been selected as leader and converted to follower. If the timer does not receive any of the above two messages when it times out, repeat step 1-4 for the new election.
The node sends appendentries heartbeat RPC to all nodes immediately after the poll that receives the majority of the nodes becomes leader. After all candidate received the heartbeat RPC, convert to follower, end of election.
Each follower can only vote one vote in one term and take a first-come-first-served strategy. Each follower has a timer that is still not receiving a heartbeat RPC from leader when the timer expires, and then converts to candidate to start requesting a poll. That is, in the current leader when the fall, there will be follower began to convert to candidate start voting.
If multiple nodes initiate a poll at the same time, each node does not get a majority (this becomes split Vote), the number of terms is increased and the vote is re-elected within the new term. Is it possible to split vote repeatedly, will never be able to choose leader it.
Raft takes a random timeout, the raft system has an election timeout configuration item, follower and candidate timer timeout time is recalculated each time, randomly selecting the configuration time between 1 time times and twice times. Even if all nodes start at the same time, due to the random time-out setting, the nodes will not be converted to candidate at the same time, and the node first converted to candidate would vote first, thus obtaining a majority of votes. Thus, in each term, multiple nodes simultaneously request a vote and have only a small chance of obtaining a few votes, the probability of a continuous occurrence of this situation is smaller, the theoretical probability is very small, in fact, can be considered completely impossible to happen. In my raft implementation, basically all of the 1-2 terms of office to elect leader.
In the next article I'll cover raft RPC requestvote and appendentries in detail.