Design principle and use of ETCD raft Library

Source: Internet
Author: User
Tags etcd
This is a creation in Article, where the information may have evolved or changed.

As early as November 2013, in the raft paper can only be downloaded to the draft version on the Internet, I have written a blog on its brief analysis. Over the past 4 years, various raft agreements have been extensively explained, and raft has indeed been widely used. One of the most well-known applications is ETCD. Etcd the raft protocol itself as a library, located in https://github.com/coreos/etcd/tree/master/raft , and then use it itself as an app.

This article does not explain the core content of the raft protocol, but stands in the perspective of a Etcd raft library user, explaining what needs to be learned with this library.

The library is relatively a bit of a nuisance to use. The official has a use example at https://github.com/coreos/etcd/tree/master/contrib/raftexample . Overall, this library implements the core content of the raft protocol, such as the logic of append log, the selection of main logic, snapshot, and the change of members. Need to be clear: the library does not implement the network transmission and reception of messages, the repository will only send some messages to be sent in memory, the user-defined network transport layer to take out the message and send out, and on the network receiver, need to tune a library function, Used to pass the received message to the library, which is explained in detail later. At the same time, the library defines a storage interface that needs to be implemented by the library's users themselves.

The storage interface is as follows:


Storage is an interface, May was implemented by the application//-retrieve log entries from storage.////If any S Torage method returns an error, the raft instance will//become inoperable and refuse to participate in elections;  the//application is responsible for cleanup and recovery in this Case.type Storage interface {//Initialstate returns    The saved hardstate and confstate information. Initialstate () (PB. Hardstate, PB.    Confstate, error)//Entries Returns a slice of log Entries in the range [Lo,hi].    MaxSize limits the total size of the log entries returned, but//entries returns at least one entry if any. Entries (lo, Hi, maxSize uint64) ([]PB. Entry, error)//term returns the term of Entry I, which must is in the range//[FirstIndex ()-1, LastIndex ()]. The term of the entry before//FirstIndex are retained for matching purposes even though the//rest of that entry m    Ay not being available. term (i UInt64) (UInt64, error)//LastIndex RETurns the last entry in the log.  LastIndex () (UInt64, error)//FirstIndex Returns the index of the first log entry that is//possibly available via Entries (older Entries has been incorporated//into the latest Snapshot, if storage only contains the dummy entry th    E//First log entry is not available).    FirstIndex () (UInt64, error)//Snapshot Returns the most recent Snapshot.  If Snapshot is temporarily unavailable, it should return errsnapshottemporarilyunavailable,//So raft    Could know that Storage needs some time to prepare//snapshot and call snapshot later. Snapshot () (PB. Snapshot, Error)}

These interfaces will be used in the library. People who are familiar with the raft protocol are not difficult to understand. The official example mentioned above https://github.com/coreos/etcd/tree/master/contrib/raftexample Use the library's own memorystorage, and Etcd's Wal and snap package to do the persistence, restart the time from Wal and snap to get log recovery memorystorage.

To provide this io/network-intensive thing, the best way to improve throughput is batch batching. The ETCD Raft Library did exactly that.

Here's a look at the core abstract ready structure provided by ETCD to do this:

Ready encapsulates the entries and messages that is ready to read,//is saved to stable storage, committed or sent to    Other peers.//all fields in ready is Read-only.type ready struct {//The current volatile state of a Node.    Softstate would be the nil if there is no update.    It is not a required to consume or store softstate.    *softstate//The current state of a Node to being saved to stable storage before//Messages is sent.    Hardstate'll be equal to empty state if there is no update. Pb. Hardstate//Readstates can be used for node to serve linearizable read requests locally//If its applied index I    s greater than the index in ReadState.    Note that the readstate is returned when raft receives Msgreadindex.    The returned is a valid for the request, requested to read.    readstates []readstate//Entries specifies Entries to being saved to stable storage before//Messages is sent. Entries []PB. Entry//Snapshot SPECifies the snapshot to being saved to stable storage. Snapshot PB. Snapshot//committedentries specifies entries to being committed to a//store/state-machine.    These has previously been committed to stable//store. committedentries []PB.    Entry//Messages Specifies outbound Messages to being sent after Entries is//committed to stable storage.  If It contains a MSGSNAP message, the application must report back to raft//When the snapshot have been received or    Have failed by calling Reportsnapshot. Messages []PB. Message//Mustsync Indicates whether the hardstate and Entries must be synchronously//written to disk or if an as    Ynchronous write is permissible. mustsync BOOL}

It can be said that this ready struct encapsulates a batch of updates that include:

    • Pb. Hardstate: Contains the largest term that the current node has ever seen, and the commit index that has been known to the current node in this term to whom the vote was made
    • Messages: Messages that need to be broadcast to all peers
    • Committedentries: The commit has not yet been added to the state machine log
    • Snapshot: A snapshot that needs to be persisted

The user of the library is processed from the constant pop out of a ready channel provided by the node struct, and the user of the library gets the ready channel through the following methods:

func (n *node) Ready() <-chan Ready { return n.readyc }

Applications that require the processing of ready include:

    1. Hardstate, Entries, snapshot are persisted to storage.
    2. Messages (msgs mentioned above) non-blocking broadcasts to other peers
    3. Apply the Committedentries (the commit has not yet been applied) to the state machine.
    4. If you find that Committedentries has a member change type of entry, call node's Applyconfchange () method to let node know (this is not the same as the raft paper, as long as the node has received a member change log to apply)
    5. Call Node.advance () to tell raft node, this batch of status updates have been processed, the state has evolved, you can give me the next batch of ready let me deal with.

Applied through raft. Startnode () to start a copy of raft, inside the function by starting a goroutine run

func (n *node) run(r *raft)

To start the service.

App by calling

func (n *node) Propose(ctx context.Context, data []byte) error

To propose a request to raft, which is returned after the raft begins processing.

Adding or deleting nodes by calling

func (n *node) ProposeConfChange(ctx context.Context, cc pb.ConfChange) error

The node structure contains several important channel:

// node is the canonical implementation of the Node interfacetype node struct {    propc      chan pb.Message    recvc      chan pb.Message    confc      chan pb.ConfChange    confstatec chan pb.ConfState    readyc     chan Ready    advancec   chan struct{}    tickc      chan struct{}    done       chan struct{}    stop       chan struct{}    status     chan chan Status    logger Logger}
    • The PROPC:PROPC is a channel with no buffer, and the application written through the propose interface is encapsulated into a message that is push to PROPC, and node's Run method pops out of a message from PROPC. Append your own raft log, and put the message in mailbox (msgs in the raft struct. Message), the msgs is encapsulated in ready, applied from the READYC, and sent out by applying a custom transport.
    • RECVC: Applying a custom transport needs to be called after a message is received
func (n *node) Step(ctx context.Context, m pb.Message) error

To put the message into the RECVC, after some processing, the message will be sent to the corresponding peers in the mailbox. Subsequent messages are sent out via custom transport.

    • READYC/ADVANCEC:READYC and ADVANCEC are all without buffer channel,node.run () to package some of the related state updates into a ready struct (one of which is the msgs mentioned above) Into the READYC. The application is processed from the pop out of the READYC, the corresponding state, and after processing is completed, the call
rc.node.Advance()

Push an empty structure into the ADVANCEC tells Raft that the status of this batch of ready contains has been handled appropriately, Node.run () inside some state after being notified from ADVANCEC, For example, entries that have been persisted to storage are removed from memory (corresponding type unstable struct).

    • TICKC: The application periodically push an empty struct into the TICKC, Node.run () invokes the Tick () function, and for leader, tick () causes a heartbeat to the other peers, and for follower, checks whether or not to initiate a select master operation.
    • Confc/confstatec: The app takes out committedentries from ready and checks if it has a log with a member change type, you need to call
func (n *node) ApplyConfChange(cc pb.ConfChange) *pb.ConfState

This function will push Confchange to CONFC, CONFC is also a non-buffer channel,node.run () inside will take CONFC from the Confchange, and then do a real increase or decrease peers operation, The newest member group is then push to Confstatec, and the Applyconfchange function returns the newest member group from the Confstatec pop to the app.

Can say, want to use ETCD raft Library still need to know a lot of things.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.