Design and implementation of TIKV source parsing series--multi-raft

Source: Internet
Author: User
Tags epoll
This is a creation in Article, where the information may have evolved or changed.

This series of articles is mainly for TIKV community developers, focusing on tikv system architecture, source structure, process analysis. The goal is to enable developers to read, to have a preliminary understanding of the TIKV project, and better participate in the development of TIKV.
It is important to note that TIKV is written in the rust language, and users need to have a general understanding of the rust language. In addition, this article series does not cover the details of the TIKV Center Control Service Placement Driver (PD), but it shows how some important processes tikv interact with PD.
TIKV is a distributed KV system that uses the Raft protocol to ensure strong data consistency, while supporting distributed transactions using the MVCC + 2PC approach.
This article is section II of this series.

Placement Driver

Before proceeding, let's briefly introduce Placement Driver (PD). PD is the global central controller of TIKV, which stores the metadata information of the whole TIKV cluster, is responsible for the scheduling of the whole tikv cluster, the generation of global IDs, and the global TSO timing.

PD is a very important central node that solves a single point of failure problem by integrating ETCD, automatically supporting distributed extensions and failover. For a detailed introduction to PD, we will open a new article to explain it later.

In Tikv, the interaction with PD is placed in the source PD directory, and now the interaction with PD is implemented by the RPC of its own definition, the protocol is very simple, in pd/mod.rs we directly provide the Client trait to interact with PD, and implement the RPC Client.

The Client trait of PD is very simple, most of which is the set/get operation of the cluster meta-information, which requires some extra attention:

Bootstrap_cluster: When we start a tikv service, we first need to is_cluster_bootstrapped to determine whether the entire TIKV cluster has been initialized, if not initialized, we will be in the TIKV Service above creates the first region.

Region_heartbeat: Periodically, the region reports its own relevant information to PD for subsequent scheduling of PD. For example, if the number of peers reported by a region to PD is less than the preset number of copies, then PD adds a new copy of the Peer to the region.

Store_heartbeat: The regular store reports its own information to PD for subsequent dispatch by PD. For example, the store tells the current disk size of the PD, as well as the remaining space, and if PD finds no space, it does not consider migrating other peers to the store.

Ask_split/report_split: When the region discovers that it needs to split, Ask_split tells PD,PD to generate the ID of the new split region, and when the region splits successfully, it report_split Known PD.

Note that we will let PD support the GRPC protocol later, so the Client API will be subject to change.

Raftstore

Because the TIKV goal is to support more than tb+ data, a Raft cluster is definitely unable to support so much data, so we need to use multiple Raft clusters, that is, Multi Raft. In TIKV, the implementation of Multi Raft is done in Raftstore, and the code is in the Raftstore/store directory.

Region

Because we want to support Multi Raft, we need to fragment the data so that each Raft is solely responsible for a subset of the data.

The usual data slicing algorithm is the Range used by Hash and range,tikv to fragment data. Why use range, the main reason is to better the same prefix key aggregation together, easy to scan and other operations, this Hash is not supported, of course, in the split/merge above Range is also better than Hash processing a lot, many times will only involve the change of meta-information, are No large-scale nudge data.

Of course, the problem with Range is that there is a good chance that one region will become a performance hotspot due to frequent operations, but there are also some optimizations, such as dispatching these region to a better machine via PD, and providing Follower share reading pressure.

In short, in tikv, we use Range to slice the data and divide it into one Raft group, each Raft group, which we use region to represent.

The PROTOBUF protocol for region is defined as follows:

message Region {    optional uint64 id                  = 1 [(gogoproto.nullable) = false];    optional bytes  start_key           = 2;    optional bytes  end_key             = 3;    optional RegionEpoch region_epoch   = 4;    repeated Peer   peers               = 5;}message RegionEpoch {    optional uint64 conf_ver    = 1 [(gogoproto.nullable) = false];    optional uint64 version     = 2 [(gogoproto.nullable) = false];}message Peer {          optional uint64 id          = 1 [(gogoproto.nullable) = false];     optional uint64 store_id    = 2 [(gogoproto.nullable) = false];}

ID: The unique representation of region, globally unique distribution through PD.

Start_key, End_key: Used to indicate the range of this region [Start_key, End_key], for the first Region,start and end key are empty, TIKV internal special processing.

Region_epoch: When a region adds or removes a Peer, or split, and so on, we think that the epoch of this region is changing, and Regionepoch Conf_ver will do it every time Confcha Nge is incremented while version is incremented each time split/merge is done.

peers: The node information contained in the current region. For a Raft Group, we usually have three copies, each of which we use peer to indicate that peer's ID is also globally distributed by PD, while store_id indicates which store the peer is on.

Rocksdb/keys Prefix

For the actual data storage, whether it is Raft meta,log, or the data of state machine, we are stored in a Rocksdb instance. For Rocksdb, you can refer to facebook/rocksdb in detail.

We use different prefixes to differentiate the data such as Raft and state machine, we can refer to raftstore/store/keys.rs, and we add the ' z ' prefix for the actual data of state machine. For other local metadata (including Raft), we add the 0x01 prefix uniformly.

Here is a brief description of some important metadata in the key format, we ignore the first 0x01 prefix.

    • 0x01: For storing storeident, when initializing this store, we will store the store's Cluster Id,store ID and other information stored in this key.

    • 0x02: Used to store Raft some information, 0x02 followed by the ID of the Raft region (8-byte big endian), followed by a Suffix to identify the different subtypes:

      • 0x01: For storing Raft log, followed by log Index (8-byte big-endian order)

      • 0x02: for storing raftlocalstate

      • 0X03: for storing raftapplystate

    • 0X03: Used to store some meta-information locally in region, followed by the Raft region ID, followed by a Suffix to represent a different subtype 0x03:

      • 0x01: for storing regionlocalstate

For the several types mentioned above, it is defined in PROTOBUF:

message RaftLocalState {    optional eraftpb.HardState hard_state        = 1;    optional uint64 last_index                  = 2;}message RaftApplyState {    optional uint64 applied_index               = 1;    optional RaftTruncatedState truncated_state = 2;}enum PeerState {    Normal       = 0;    Applying     = 1;    Tombstone    = 2;}message RegionLocalState {    optional PeerState state        = 1;    optional metapb.Region region   = 2;}

raftlocalstate: The hardstate used to hold the current Raft and the last Log index.

raftapplystate: Used to hold the log index of the current Raft last apply and the log information that was truncated.

regionlocalstaste: For storing region information and corresponding peer status on the store, normal indicates that a normal peer,applying indicates that the peer has not finished the apply snaps Hot operation, while Tombstone indicates that the Peer has been removed from region and cannot participate in the Raft Group.

Peer Storage

As already known, we use Rawnode to Raft. Because a region corresponds to a Raft group,region inside the Peer corresponds to a Raft copy. So, we encapsulate the operation of the Rawnode in the Peer.

To use Raft, we need to define our own Storage, which is implemented in Raftstore/store/peer_storage.rs's Peerstorage class.

When creating Peerstorage, first we will get the Peer raftlocalstate,raftapplystate from Rocksdb, and last_term, etc., which will be cached in memory for easy follow-up and quick access.

Peerstorage need to pay attention to several places:

The first is raft_init_log_term and Raft_init_log_index, and their values are 5 (as long as more than 1 are possible). In Tikv, a Peer is created in the following ways:

    1. Actively created, usually for the first copy of the first region of the Peer, we use this method of creation, when initialized, we will set its Log term and Index to 5.

    2. Passive creation, when a region adds a copy of peer, when the Confchange command is applied, Leader will send Message,store to this new peer's Store to receive this Message After that, it is found that no corresponding peer exists, and that the message is valid, a corresponding peer is created, but at this point the peer is an uninitialized peer and does not know any information about the region in which we initialized its Log term and Index. Leader will be able to know that this Follower and no data (0 to 5 in the presence of a Log gap), Leader will send this Follower directly snapshot.

    3. Split creates, when a region splits into two areas, one of which inherits the meta-information of the area before splitting, but modifies its range scope. Another region-related meta-information will be created, the new region corresponding to the Peer, the initial Log term and Index is also 5, because this time Leader and Follower have the latest data, do not need to snapshot. (Note: The actual Split situation is very complex, it is possible to send snapshot, but there is no too much explanation).

Then there is the need to pay attention to snapshot treatment. Whether generate or apply snapshot, is a relatively time-consuming operation, in order not to let snapshot processing card master Raft thread, Peerstore will be synchronized only update snapshot related meta-information, so as not to hinder the subsequent RAF The t process, and then asynchronously snapshot the other thread. Peerstorage will maintain a snapshot state, as follows:

pub enum SnapState {    Relax,    Generating(Receiver<Snapshot>),    Applying(Arc<AtomicUsize>),    ApplyAborted,}

Note here that generating is a channel Receiver, and when the asynchronous snapshot is generated, it sends a message to the channel so that the next time the Raft is checked, it can be snapshot directly from the channel. Applying is a shared atomic integer that allows multiple threads to determine the state of the current applying, including:

pub const JOB_STATUS_PENDING: usize = 0;pub const JOB_STATUS_RUNNING: usize = 1;pub const JOB_STATUS_CANCELLING: usize = 2;pub const JOB_STATUS_CANCELLED: usize = 3;pub const JOB_STATUS_FINISHED: usize = 4;pub const JOB_STATUS_FAILED: usize = 5;

For example, if the status is job_status_running, the operation that is currently in progress applying snapshot is indicated. At this stage, we are not allowed to FAILED, that is, if apply snapshot failure, we will panic.

Peer

Peer encapsulates Raft Rawnode, and our handling of Raft's propose,ready is done in peer.

First, focus on the propose function, and Peer's propose is the entrance to the external Client command. Peer will determine the type of command:

    • If it is a read-only operation, and Leader is still within the lease validity period, Leader will be able to provide the local read directly without the need to go through the Raft process.

    • If it is a Transfer Leader operation, peer will first determine whether or not Leader, while judging the need to become a new Leader Follower is not enough new Log, if the conditions are satisfied, peer will call Rawnode Transf Er_leader command.

    • If it is a change peer operation, peer will call Rawnode Propose_conf_change.

    • For the remainder, Peer will call Rawnode's propose directly.

Before propose, Peer will also save this command corresponding to the callback in Pendingcmd, when the corresponding log is applied, will be the command inside the unique UUID to find the corresponding callback Called, and returns the corresponding result to the Client.

Another thing to focus on is Peer's Handle_raft_ready series function, which was described in the previous raft chapter, when a Rawnode ready, we need to do a series of processing of the data in ready, including entries writing Stor Age, send messages,apply committed_entries and advance. These are all done within Peer's Handle_raft_ready series functions.

For committed_entries processing, Peer parses the actual command, invokes the corresponding processing flow, executes the corresponding function, such as Exec_admin_cmd executes the confchange,split and other admin commands, and Exec_w The rite_cmd executes the usual data manipulation commands on state machine. In order to guarantee the consistency of the data, Peer will only save the modified data to the Writebatch of Rocksdb when execute, and then modify the corresponding memory meta-information after the last atom writes to the ROCKSDB, after the write succeeds. If the write fails, we will panic directly to ensure the integrity of the data.

When peer handles ready, we also pass in a Transport object that allows peer to send message,transport trait defined as follows:

pub trait Transport: Send + Clone {    fn send(&self, msg: RaftMessage) -> Result<()>;}

It has only one function SEND,TIKV implementation of Transport sends a message that needs send to the server layer, which is sent by the server layer to the other nodes.

Multi Raft

Peer is just a copy of a single region, because TIKV supports Multi Raft, so for a store, we need to manage multiple copies of region, which are managed uniformly within the store class.

Store will save all peers information, using: Region_peers:hashmap<u64, peer>

The Region_peers key is the region ID, and peer is the copy peer on the Store.

The Store uses Mio to drive the entire process (later we will use Tokio-core to simplify asynchronous logic processing).

We register a base Raft tick inside Mio, every 100ms, call once, Store will traverse all the peers, once call the corresponding Rawnode Tick function, drive Raft.

The store accepts the request processing of the outside Client through Mio's notify mechanism, as well as other store-sent Raft message. For example, after receiving the MSG::RAFTCMD message, store will call Propose_raft_command to process, and after receiving the Msg::raftmessage message, the store will call On_raft_message to process.

At the end of each EventLoop loop, which is the tick callback of Mio, the Store handles On_raft_ready:

    1. The Store iterates through all the ready peers, calls Handle_raft_ready_append, and we use a writebatch to process all the ready append data while preserving the relevant results.

    2. If the writebatch succeeds, it will call Post_raft_ready_append, which is used to handle the follower message (the Leader message has been completed in Handle_raft_ready_append).

    3. The Store then calls handle_raft_ready_apply,apply related committed entries in turn, and then calls On_ready_result to process the final result.

Server

The Server layer is the TIKV network layer, at this stage, TIKV uses Mio to achieve the entire network processing, and the network protocol is the use of custom, as follows:

message = header + body header:  | 0xdaf4(2 bytes magic value) | 0x01(version 2 bytes) | msg_len(4 bytes) | msg_id(8 bytes) |

Any message, we use the header + body way, the body is the actual message data, using PROTOBUF encoding, and the header, the first is a two-byte magic Value,0xdaf4, then the version number, and then is the entire length of the message and the unique ID of the message.

For Mio, under Linux is the encapsulated Epoll, so users familiar with Epoll should be able to easily use Mio for network development, the simple process is as follows:

    • Bind a port, generate a TcpListener object, and register to Mio.

    • Processing TcpListener On_readable Callback, call the Accept function to get the generated socket Tcpstream,register to Mio, we follow this tcpstream with the client to interact.

    • Tcpstream handles callbacks for on_readable or on_writable.

At the same time, the server through the notify of Mio to accept the message from outside, such as tikv implementation of Transport, that is, Peer in the call send, the message is sent to the Server directly through the channel, and then in the No Tify inside processing, find the corresponding store connection, and then sent to the remote store.

For snapshot, the Server will open a new connection and send it synchronously using a single thread, so the code logic will be much simpler and there is no need to handle too much asynchronous IO logic. For the receiving end, when a message is received, it will first look at the type of message, if found to be snapshot, will enter the process of accepting snapshot, will send the received data directly to snapshot related thread, write to the corresponding snaps Hot file inside. If it is another message, it will be directly dispatch to the corresponding processing logic processing, you can refer to the Server's on_conn_msg function.

Because the Server is the network IO processing, logic is relatively simple, here is not too much to explain, but, in view of the current TIKV using a custom network protocol, is not conducive to external client docking, and there is no pipeline,stream and other excellent features of the support, So later we'll switch to GRPC.

Summarize

Here, we explain the Raft Library of the TIKV core, Multi Raft. In the following chapters, we will describe how transaction,coprocessor and PD are changing the entire cluster.
(end of Part II)

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.