TIKV Source Parsing Series--How to use Raft

Source: Internet
Author: User
Tags etcd
This is a creation in Article, where the information may have evolved or changed.

TIKV Source Parsing Series--How to use Raft

This series of articles is mainly for TIKV community developers, focusing on tikv system architecture, source structure, process analysis. The goal is to enable developers to read, to have a preliminary understanding of the TIKV project, and better participate in the development of TIKV.
It is important to note that TIKV is written in the rust language, and users need to have a general understanding of the rust language. In addition, this article series does not cover the details of the TIKV Center Control Service Placement Driver (PD), but it shows how some important processes tikv interact with PD.
TIKV is a distributed KV system that uses the Raft protocol to ensure strong data consistency, while supporting distributed transactions using the MVCC + 2PC approach.

Overview

This document is mainly for TIKV community developers, mainly introduces TIKV system architecture, source structure, process analysis. The goal is to enable developers to read the document, to have a preliminary understanding of the TIKV project, to better participate in the development of TIKV.

It is important to note that TIKV is written in the rust language, and users need to have a general understanding of the rust language. In addition, this document does not cover the details of the TIKV Center Control Service Placement Driver (PD), but explains how some of the important processes tikv interact with PD.

TIKV is a distributed KV system that uses the Raft protocol to ensure strong data consistency, while supporting distributed transactions using the MVCC + 2PC approach.

Architecture

The overall architecture of the TIKV is relatively simple, as follows:

Placement Driver : Placement Driver (PD) is responsible for the management scheduling of the entire cluster.

node : node can be thought of as an actual physical machine, with each node responsible for one or more stores.

Store: Store uses ROCKSDB for actual data storage, usually a store corresponding to a hard disk.

Region: The region is the smallest unit of data movement, corresponding to a real data range within the Store. Each region will have multiple replicas (replica), each in a different Store, and these replicas form a Raft group.

Raft

TIKV uses the Raft algorithm to achieve a distributed environment under the strong consistency of data, about Raft, you can refer to the paper "in the Search of an understandable Consensus algorithm" and the official website, here do not do a detailed explanation. Simply understood, Raft is a replication log + state machine model, we can only write through leader, leader will be the command via log copy to followers, when the majority of the cluster nodes are collected When we get to this log, we think that this log is committed and can be apply to state machine.

TIKV's Raft main transplant Etcd Raft, which supports all functions of Raft, including:

    • Leader election

    • Log Replicationlog Compaction

    • Membership Changesleader Transfer

    • Linearizable/lease Read

It is important to note that TIKV and Etcd's handling of the membership change is slightly different from the Raft paper, mainly because TIKV membership change takes effect only at the time of log applied, so the main The goal is to achieve simplicity, but there is a risk that if we have only two nodes, to remove a node from the inside, if a follower has not received the Confchange log Entry,leader is dropped and unrecoverable, the entire cluster will not work. So we usually recommend that users deploy 3 or more odd nodes.

The Raft library is a standalone library that users can easily embed directly into their own applications, and only need to handle their own storage and send messages. Here is a brief introduction to how to use Raft, code in the TIKV source directory/src/raft below.

Storage

First, we need to define our own storage,storage primarily to store Raft related data, trait defined as follows:

pub trait Storage {    fn initial_state(&self) -> Result<RaftState>;    fn entries(&self, low: u64, high: u64, max_size: u64) -> Result<Vec<Entry>>;    fn term(&self, idx: u64) -> Result<u64>;    fn first_index(&self) -> Result<u64>;    fn last_index(&self) -> Result<u64>;    fn snapshot(&self) -> Result<Snapshot>;}

We need to implement our own Storage trait, which explains in detail the meanings of each interface:

Initial_state: Initialize Raft Storage when called, it will return a raftstate,raftstate definition as follows:

pub struct RaftState {    pub hard_state: HardState,    pub conf_state: ConfState,}

Hardstate and confstate are protobuf, defined as:

message HardState {    optional uint64 term   = 1;     optional uint64 vote   = 2;     optional uint64 commit = 3; }message ConfState {    repeated uint64 nodes = 1;}

Inside the hardstate, save the last term information saved by the Raft node, which node was previously vote, and the log index that was already commit. Confstate, however, holds all node ID information for the Raft cluster.

When calling Raft related logic outside, the user needs to handle raftstate persistence.

entries: Get the Raft log entry of the [Low, high) interval, controlling the maximum number of entires returned by Max_size.

Term,first_index and Last_index are respectively given the current term, as well as the minimum and final log index.

Snapshot: Get a snapshot of the current Storage, sometimes, the current Storage data volume is relatively large, the generation of snapshot will be time-consuming, so we may have to asynchronous to build on another thread, without blocking the current Raft thread, this time, can return snapshottemporarilyunavailable error, when, Raft know is preparing snapshot, will after a while try again.

It should be noted that the above Storage interface is only required by the Raft library, and we will also use this Storage to store data such as Raft log, so we also need to provide additional interfaces separately. In Raft storage.rs, we provide a memstorage for testing, you can also refer to memstorage to achieve their own storage.

Config

Before using Raft, we need to know Raft some relevant configuration, defined in Config, here only to note:

pub struct Config {    pub id: u64,    pub election_tick: usize,    pub heartbeat_tick: usize,    pub applied: u64,    pub max_size_per_msg: u64,    pub max_inflight_msgs: usize,}

ID: The unique identifier of the Raft node, within a Raft cluster, the ID is impossible to duplicate. Inside the TIKV, the ID is guaranteed to be globally unique through PD.

Election_tick: When follower has not received a message from leader after Election_tick time, the election will be restarted and tikv default to 50.

Heartbeat_tick: Leader sends a heartbeat message to follower every Hearbeat_tick time. Default 10.

Applied: Applied is the last log index that has been applied.

max_size_per_msg: Limit the maximum message size per send. Default 1MB.

max_inflight_msgs: Limit the number of message in-flight when copying the maximum. Default 256.

Here to explain in detail the meaning of tick, tikv Raft is timed-driven, assuming we call Raft tick every 100ms, then when the call to Headtbeat_tick tick, leader will send the heart to follower Jump.

Rawnode

We use the Rawnode to raft,rawnode the following constructor:

We need to define Raft Config, and then pass in an implementation good storage,peers This parameter is only for testing, actually to pass empty. Once the Rawnode object is generated, we can use the Raft. Here are some of the functions we focus on:

Tick: We use the tick function to periodically drive Raft, and at TIKV we call tick every 100ms.

propose: Leader writes the command sent by the client to raft log through the propose command and copies it to the other nodes.

propose_conf_change: Similar to propose, only used to handle confchange commands alone.

Step: When the node receives a message from another node, it actively calls the driver Raft.

has_ready: Used to determine if a node is ready.

ready: To get the readiness state of the current node, we will use the Has_ready to determine if a rawnode is available.

Apply_conf_change: When a confchange log is successfully applied, you need to invoke this driver Raft actively.

advance: Tell Raft that you've finished ready and started the next iteration.

For Rawnode, we focus on the concept of ready, which is defined as follows:

pub struct Ready {    pub ss: Option<SoftState>,    pub hs: Option<HardState>,    pub entries: Vec<Entry>,    pub snapshot: Snapshot,    pub committed_entries: Vec<Entry>,    pub messages: Vec<Message>,}

SS: If softstate changes, such as adding, deleting nodes, the SS will not be empty.

HS: HS will not be empty if hardstate is changed, for example, if Vote,term is added again.

Entries: it needs to be stored to Storage before messages is sent.

Snapshot: If snapshot is not empty, it needs to be stored to Storage.

committed_entries: Has been committed the raft log, can apply to state machine.

messages: Messages sent to other nodes usually need to be sent after the entries is saved successfully, but for leader, messages can be sent first, entries saved, this is Raft One of the optimization methods mentioned in the paper, TIKV also adopted.

When the external discovery of a rawnode is ready, it is available and processed as follows:

    1. Persistent non-empty SS and HS.

    2. If it is leader, first send messages.

    3. If the snapshot is not empty, save snapshot to Storage, while the data inside the snapshot is applied asynchronously to state machine (although it can also be synchronized, but snapshot is usually larger, and the block thread is synchronized) 。

    4. Save the entries to Storage.

    5. If it is follower, send messages.

    6. Apply committed_entries to state machine.

    7. Call advance to tell Raft that it has finished working with ready.

(not to be continued ...) )

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.