Etcd Raft源碼分析之二:選舉流程
來源:互聯網
上載者:User
### 1.6 node tick與raft的tickElection>這一節在沒有特殊說明時,都是在raft/raft.go這個檔案中(可以查看方法前的r *raft來知道當前是在raft.go中)。node的tick()方法調用raft/raft.go中raft結構體的tick()。在1.2節中, raft.becomeFollower()中設定了raft結構體的step函數和tick函數=tickElection。 在上一小節的最後,node.run()從n.tickc通道中擷取到訊息,調用raft.tick()方法,實際上調用了raft.tickElection()。```go// tickElection is run by followers and candidates after r.electionTimeout.func (r *raft) tickElection() { r.electionElapsed++ if r.promotable() && r.pastElectionTimeout() { r.electionElapsed = 0 r.Step(pb.Message{From: r.id, Type: pb.MsgHup}) }}```當Follower或者Candidate超過選舉時間後,會發送類型為MsgHup的一條訊息給自己,接著調用r.campaign()方法。>如果訊息類型不是MsgHup、MsgVote、MsgPreVote,則調用r.step(r, m)函數。 比如前面becomeFollower時設定了r.step=stepFollower,那麼這裡就會真正調用stepFollower()方法了。 在stepFollower()方法中,可以看到,它處理的訊息類型並沒有上面的MsgHup、MsgVote、MsgPreVote。在最開始啟動raft.Node時調用了becomeFollower,初始時,raft的Term為0,後來又被更新為1。 對於類型為MsgHup的pb.Message而言,它的Term初始時為0,所以會執行下面的m.Term==0分支,並接著執行pb.MsgHup分支。 >注意:這裡的第一個switch語句塊的條件是先判斷m.Term==0。除此之外,如果m.Term==r.Term,也會執行第二個switch語句塊。```gofunc (r *raft) Step(m pb.Message) error { // Handle the message term, which may result in our stepping down to a follower. switch { case m.Term == 0: // local message case m.Term > r.Term: ... case m.Term < r.Term: ... return nil } switch m.Type { case pb.MsgHup: if r.state != StateLeader { ents, err := r.raftLog.slice(r.raftLog.applied+1, r.raftLog.committed+1, noLimit) if r.preVote { r.campaign(campaignPreElection) } else { r.campaign(campaignElection) } } else { r.logger.Debugf("%x ignoring MsgHup because already leader", r.id) } case pb.MsgVote, pb.MsgPreVote: ... default: // 必須先調用r.step=xxx設定函數,然後才能調用下面的語句,真正執行函數。 r.step(r, m) } return nil}```複習下raft調用becomeFollower,raft的狀態會更新為StateFollower(下面也列出了成為其他兩種角色的代碼):- 成為Follower,啟動選舉,定時器為tickElection,在electionTimeout逾時後,成為Candidate- 成為Candidate,增加Term,投票給自己,定時器為tickElection- 成為Leader,定時器為tickHeartbeat,定時發送心跳給Follower```go// raft/raft.gofunc (r *raft) becomeFollower(term uint64, lead uint64) { r.step = stepFollower r.reset(term) r.tick = r.tickElection r.lead = lead r.state = StateFollower r.logger.Infof("%x became follower at term %d", r.id, r.Term)}func (r *raft) becomeCandidate() { r.step = stepCandidate // 成為候選人時,Term加1 r.reset(r.Term + 1) r.tick = r.tickElection // 投票給自己 r.Vote = r.id r.state = StateCandidate r.logger.Infof("%x became candidate at term %d", r.id, r.Term)}func (r *raft) becomeLeader() { r.step = stepLeader r.reset(r.Term) r.tick = r.tickHeartbeat r.lead = r.id r.state = StateLeader r.pendingConfIndex = r.raftLog.lastIndex() r.appendEntry(pb.Entry{Data: nil}) r.logger.Infof("%x became leader at term %d", r.id, r.Term)}```### 1.7 競選LeaderFollower在electionTimeout逾時後,會競選成為Candidate。這裡我們暫時不考慮兩階段:- 調用becomeCandidate,設定投票訊息類型為MsgVote- 如果擷取到大多數選票,則調用becomeLeader()- 向每個節點發送voteMsg這裡Follower第一次執行campaign時,步驟2擷取到的選票肯定不滿足大多數,所以會向其他節點發送MsgVote訊息。```gofunc (r *raft) campaign(t CampaignType) { var term uint64 var voteMsg pb.MessageType if t == campaignPreElection { r.becomePreCandidate() voteMsg = pb.MsgPreVote // PreVote RPCs are sent for the next term before we've incremented r.Term. term = r.Term + 1 } else { r.becomeCandidate() voteMsg = pb.MsgVote term = r.Term } if r.quorum() == r.poll(r.id, voteRespMsgType(voteMsg), true) { // We won the election after voting for ourselves (which must mean that this is a single-node cluster). Advance to the next state. if t == campaignPreElection { r.campaign(campaignElection) } else { r.becomeLeader() } return } for id := range r.prs { // 如果是自己,不需要發送 if id == r.id continue var ctx []byte if t == campaignTransfer ctx = []byte(t) r.send(pb.Message{Term: term, To: id, Type: voteMsg, Index: r.raftLog.lastIndex(), LogTerm: r.raftLog.lastTerm(), Context: ctx}) }}```候選人發送VoteMsg給其他節點(Follower)後,當收到大多數Follower節點的投票結果後,候選人會成為Leader。 下面我們分析Follower節點收到候選人發送的MsgVote請求是如何處理的,這涉及到RPC調用,在Etcd中是rafthttp。### 1.8 Raft HTTP每個EtcdServer都有一個HTTP服務端,用來接收其他節點發送的訊息,以及返迴響應結果給寄件者:```go//etcdserver/api/rafthttp/http.gofunc (h *pipelineHandler) ServeHTTP(w http.ResponseWriter, r *http.Request) { var m raftpb.Message if err := h.r.Process(context.TODO(), m); err != nil { switch v := err.(type) { case writerToResponse: v.WriteTo(w) } return }}````EtcdServer`實現了etcdserver/api/rafthttp/transport.go的`Raft`介面```//etcdserver/api/rafthttp/transport.gotype Raft interface { Process(ctx context.Context, m raftpb.Message) error IsIDRemoved(id uint64) bool ReportUnreachable(id uint64) ReportSnapshot(id uint64, status raft.SnapshotStatus)}```>Q:s.r返回的是etcdserver/raft.go的raftNode,它是一個結構體。而Step方法定義在raft/Node.go介面中。 那麼問題是:raftNode結構體的Step方法,是怎麼調用到Node介面的Step方法? >A:先看下node.go下的Node介面和node結構體。node結構體實現了Node介面的所有方法,所以可以把node看做是是Node介面的實作類別。 雖然raftNode結構體中沒有定義raft.Node介面,但是它的raftNodeConfig屬性定義了!這種文法叫做struct內嵌/嵌套(embedded)interface。EtcdServer的Process方法調用raft/node.go中Node介面(其實作類別是這個檔案下的node結構體)的Step方法:```go// etcdserver/server.gofunc (s *EtcdServer) Process(ctx context.Context, m raftpb.Message) error { return s.r.Step(ctx, m)}// raft/node.gofunc (n *node) Step(ctx context.Context, m pb.Message) error { // ignore unexpected local messages receiving over network if IsLocalMsg(m.Type) { return nil } return n.step(ctx, m)}func (n *node) step(ctx context.Context, m pb.Message) error { return n.stepWithWaitOption(ctx, m, false)}// Step advances the state machine using msgs. The ctx.Err() will be returned, if any.func (n *node) stepWithWaitOption(ctx context.Context, m pb.Message, wait bool) error { if m.Type != pb.MsgProp { select { case n.recvc <- m: return nil case <-ctx.Done(): return ctx.Err() case <-n.done: return ErrStopped } } ... return nil}```候選人發送請求給Follower節點,當Follower節點收到請求時,會將訊息發送到node的recvc通道中。 注意:Etcd分布式叢集中的所有節點都會啟動raftNode、raft.Node,也都會運行node.run()方法。 ```go// raft/node.gofunc (n *node) run(r *raft) { ... for { select { ... case <-n.tickc: r.tick() case m := <-n.recvc: // filter out response message from unknown From. if pr := r.getProgress(m.From); pr != nil || !IsResponseMsg(m.Type) { r.Step(m) } } }}```回顧前面Candidate從n.tickc中擷取到定時器(ElectionTimeout)的逾時訊息,通過tickElection()調用到raft.go的`Step()`方法,然後參與競選(campaign)並發送VoteMsg給Follower節點。 這裡Follower節點收到VoteMsg請求,從n.recvc中擷取到訊息,也會調用`r.Step(m)`方法。 兩個節點的角色不一樣,但是都會調用相同的raft.Step()方法,當然兩者的處理邏輯不一樣。 ### 1.9 Follower收到MsgVote,投票Follower收到訊息m的Term=1,它自己raft.Term=0,步驟如下:- m.Term > r.Term,調用becomeFollower()- 返回MsgVoteResp給候選人。 ```go// raft/raft.gofunc (r *raft) Step(m pb.Message) error { // Handle the message term, which may result in our stepping down to a follower. switch { case m.Term == 0: // local message case m.Term > r.Term: switch { default: if m.Type == pb.MsgApp || m.Type == pb.MsgHeartbeat || m.Type == pb.MsgSnap { r.becomeFollower(m.Term, m.From) } else { // Follower收到Candidate的MsgVote請求,自己成為Follower(更改狀態) // 同樣,這裡只是設定Raft的step函數和tick函數,還沒有真正執行成為Follower的邏輯 r.becomeFollower(m.Term, None) } } case m.Term < r.Term: ... return nil } switch m.Type { case pb.MsgHup: ... case pb.MsgVote, pb.MsgPreVote: canVote := r.Vote == m.From || // We can vote if this is a repeat of a vote we've already cast... (r.Vote == None && r.lead == None) || // ...we haven't voted and we don't think there's a leader yet in this term... (m.Type == pb.MsgPreVote && m.Term > r.Term) // ...or this is a PreVote for a future term... if canVote && r.raftLog.isUpToDate(m.Index, m.LogTerm) { // ...and we believe the candidate is up to date. r.send(pb.Message{To: m.From, Term: m.Term, Type: voteRespMsgType(m.Type)}) // 返回的訊息類型為VoteResp if m.Type == pb.MsgVote { r.electionElapsed = 0 // Only record real votes. 重設選舉計數器 r.Vote = m.From // 投票給發送這條訊息的節點,即候選人 } } else { r.send(pb.Message{To: m.From, Term: r.Term, Type: voteRespMsgType(m.Type), Reject: true}) // 拒絕投票 } default: err := r.step(r, m) if err != nil { return err } } return nil}```當Follower返回MsgVote的響應結果MsgVoteResp給Candidate,Candidate的處理流程與Follower收到訊息的類似,也會調用r.Step()方法。 ### 1.10 Candidate成為Leader由於Follower返回訊息的Term等於Candidate發送訊息的Term,所以直接走第二個switch條件的default分支:```go// raft/raft.gofunc (r *raft) Step(m pb.Message) error { // Handle the message term, which may result in our stepping down to a follower. switch { case m.Term == 0: // local message case m.Term > r.Term: ... case m.Term < r.Term: ... return nil } switch m.Type { case pb.MsgHup: ... case pb.MsgVote, pb.MsgPreVote: ... default: err := r.step(r, m) // 調用Raft的stepFunc,即1.7節通過becomeCandidate設定的stepCandidate if err != nil { return err } } return nil}```前面我們看到在Step(m)方法中都沒有走到default分支,這裡開始調用r.step(r,m)函數,對應的是stepCandidate:```go// stepCandidate is shared by StateCandidate and StatePreCandidate; the difference is// whether they respond to MsgVoteResp or MsgPreVoteResp.func stepCandidate(r *raft, m pb.Message) error { // Only handle vote responses corresponding to our candidacy (while in // StateCandidate, we may get stale MsgPreVoteResp messages in this term from our pre-candidate state). var myVoteRespType pb.MessageType if r.state == StatePreCandidate { myVoteRespType = pb.MsgPreVoteResp } else { myVoteRespType = pb.MsgVoteResp } switch m.Type { case pb.MsgProp: r.logger.Infof("%x no leader at term %d; dropping proposal", r.id, r.Term) return ErrProposalDropped case pb.MsgApp: r.becomeFollower(m.Term, m.From) // always m.Term == r.Term r.handleAppendEntries(m) case pb.MsgHeartbeat: r.becomeFollower(m.Term, m.From) // always m.Term == r.Term r.handleHeartbeat(m) case pb.MsgSnap: r.becomeFollower(m.Term, m.From) // always m.Term == r.Term r.handleSnapshot(m) case myVoteRespType: // 候選人收到Follower的VoteMsg訊息,判斷投票個數 gr := r.poll(m.From, m.Type, !m.Reject) r.logger.Infof("%x [quorum:%d] has received %d %s votes and %d vote rejections", r.id, r.quorum(), gr, m.Type, len(r.votes)-gr) switch r.quorum() { case gr: if r.state == StatePreCandidate { r.campaign(campaignElection) } else { r.becomeLeader() // 滿足投票個數,成為Leader r.bcastAppend() // 向其他Follower節點發送MsgApp請求 } case len(r.votes) - gr: // pb.MsgPreVoteResp contains future term of pre-candidate m.Term > r.Term; reuse r.Term r.becomeFollower(r.Term, None) } case pb.MsgTimeoutNow: r.logger.Debugf("%x [term %d state %v] ignored MsgTimeoutNow from %x", r.id, r.Term, r.state, m.From) } return nil}```總結下每個節點在收到HTTP請求時,都會調用raft.Step(m)方法,如果訊息的類型不是MsgHup、MsgVote、MsgPreVote,則調用r.step(r,m)方法。 Raft的stepFunc在發送/收到HTTP請求之前一定會被設定,設定的地方在調用becomeFollower、becomeCandidate、becomeLeader這三個方法中。A:以Follower1轉為Candidate為例:1. 調用becomeFollower,設定r.step=stepFollower2. ticker的electionTimeout逾時,調用becomeCandidate,設定r.step=stepCandidate3. 調用campaign,發送VoteMsg給其他所有節點B:以Follower2為例:1. 調用becomeFollower,設定r.step=stepFollower2. 收到候選節點的VoteMsg請求,投票給候選節點,返回VoteMsgRespC:以Candidate轉為Leader為例:1. 收到Follower發送的VoteMsgResp,調用r.step函數,即A:2中的stepCandidate2. 判斷投票個數,如果得到大多數的選票,調用becomeLeader,設定r.step=stepLeader3. 向其他Follower節點發送MsgApp請求D:以Follower2為例:1. 收到Leader的MsgApp請求,調用r.step函數,即B:1的stepFollower2. 處理AppendEntries請求,返回MsgAppResp請求給Leader下面是Candidate發送MsgVote給Follower,Follower返回MsgVoteResp給Candidate的流程:![](https://img-blog.csdn.net/20180909105546425?watermark/2/text/aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3pxaHh1eXVhbg==/font/5a6L5L2T/fontsize/400/fill/I0JBQkFCMA==/dissolve/70)122 次點擊