TIKV Source Parsing series--placement Driver

Last Update:2017-02-10 Source: Internet

Author: User

Tags local time keep alive etcd

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

This is a creation in Article, where the information may have evolved or changed.

This series of articles is mainly for TIKV community developers, focusing on tikv system architecture, source structure, process analysis. The goal is to enable developers to read, to have a preliminary understanding of the TIKV project, and better participate in the development of TIKV.

TIKV is a distributed KV system that uses the Raft protocol to ensure strong data consistency, while supporting distributed transactions using the MVCC + 2PC approach.

This article is section III of this series.

Introduced

Placement Driver (followed by PD abbreviation) is the global central master node in Tidb, which is responsible for the whole cluster scheduling, responsible for the generation of global IDs, and the generation of global timestamp TSO. PD also holds meta-information for the entire cluster tikv, which is responsible for providing the client with routing capabilities.

As a central master node, PD automatically supports auto failover by integrating ETCD, eliminating the need to worry about single point of failure. At the same time, PD also through ETCD raft, ensure the strong consistency of data, do not worry about data loss problem.

On top of the architecture, all PD data is learned through TIKV proactive reporting. At the same time, PD on the whole tikv cluster operation, but also only in the TIKV send Heartbeat command results inside return the relevant commands, let TIKV self-processing, rather than take the initiative to give TIKV command. This design is very simple, we can think of PD is a stateless service (of course, PD will still persist some information to ETCD), all operations are passively triggered, even if PD hangs, the newly elected PD leader can also be served immediately, regardless of any previous intermediate state.

Initialization

PD integrates ETCD, so we usually need to start at least three copies to keep the data safe. At present, PD has the way of cluster initiation, initial-cluster static mode and join dynamic mode.

Before proceeding, we need to understand the port of the next Etcd, in Etcd, the default is to listen to 2379 and 23,802 ports. 2379 is primarily used by ETCD to handle external requests, while 2380 is used for communication between ETCD peers.

Suppose we now have three PD, PD1,PD2,PD3, respectively, above the HOST1,HOST2,HOST3.

For static initialization, we give the settings directly when the three PD is started initial-cluster pd1=http://host1:2380,pd2=http://host2:2380,pd3=http://host3:2380 .

For dynamic initialization, we start PD1, then start PD2, add to the PD1 cluster, join set to http://host1:2379 . Then start pd3, add to the PD1,PD2 formed cluster inside, join set to http://host1:2379 .

As you can see, static initialization and dynamic initialization are completely two ports, and the two are mutually exclusive, that is, we can only use one way to initialize the cluster. ETCD itself only supports initial-cluster the way, but for convenience, PD also provides join the way.

joinThe main use of the ETCD itself provides member related APIs, including add Member,list member, etc., so we use 2379 port, because we need to send commands to Etcd to execute. The initial-cluster ETCD itself is initialized, so 2380 ports are used.

Compared to initial-cluster the join need to consider a lot of case (in the server/join.go prepareJoinCluster function has a detailed explanation), but join the use of very natural, subsequent we will consider initial-cluster the removal of the initialization scheme.

Election

When PD is activated, we need to elect a leader to provide services externally. Although ETCD itself also has raft leader, but we still feel the use of their own leader, that is, PD leader and ETCD their own leader is not the same.

When PD starts, the Leader election is as follows:

Check the current cluster is not leader, if there is leader, watch this leader, as long as the discovery leader dropped, start again 1.

If there is no leader, start campaign, create a lessor, and write the relevant information through the ETCD transaction mechanism, as follows:

// Create a lessor. ctx, cancel := context.WithTimeout(s.client.Ctx(), requestTimeout)leaseResp, err := lessor.Grant(ctx, s.cfg.LeaderLease)cancel()// The leader key must not exist, so the CreateRevision is 0.resp, err := s.txn().    If(clientv3.Compare(clientv3.CreateRevision(leaderKey), "=", 0)).    Then(clientv3.OpPut(leaderKey, s.leaderValue, clientv3.WithLease(clientv3.LeaseID(leaseResp.ID)))).    Commit()

If the createrevision of leader key is 0, indicating that other PD has not yet been written, then I can write my own leader related information and bring a Lease. If the transaction fails to execute, indicating that the other PD has become leader, then go back to 1.

After becoming a leader, we regularly carry out the keepalive process:
```
// Make the leader keepalived.ch, err := lessor.KeepAlive(s.client.Ctx(), clientv3.LeaseID(leaseResp.ID))if err != nil {    return errors.Trace(err)}
```
When PD crashes, the previously written leader key is automatically deleted because the lease expires, so that other PD can watch to start the election again.

Initializes the raft cluster, which is the meta-information that is re-loaded into the cluster from the ETCD. Get the latest TSO information:

// Try to create raft cluster.err = s.createRaftCluster()if err != nil {    return errors.Trace(err)}log.Debug("sync timestamp for tso")if err = s.syncTimestamp(); err != nil {    return errors.Trace(err)}

After all is done, start regularly to update TSO, monitor whether the lessor is out of date, and whether the outside exits voluntarily:

for {    select {    case _, ok := <-ch:        if !ok {            log.Info("keep alive channel is closed")            return nil        }    case <-tsTicker.C:        if err = s.updateTimestamp(); err != nil {            return errors.Trace(err)        }    case <-s.client.Ctx().Done():        return errors.New("server closed")    }}

TSO

We said earlier that Tso,tso is a global timestamp, which is the cornerstone of TIDB's implementation of distributed transactions. Therefore, for PD, we must first ensure that it can quickly and massively allocate TSO for the transaction, but also need to ensure that the allocation of TSO must be monotonically increasing, it is not possible to return to the situation.

TSO is a int64, which consists of physical time + logical time two parts. Physical time is the millisecond of the current UNIX moment, while logical is the largest 1 << 18 counter. That is to say, 1MS,PD can allocate up to 262,144 TSO, this can satisfy most of the situation.

For TSO preservation in distribution, PD will do the following:

When PD becomes leader, it gets the last saved time from the ETCD, and if the local time is found to be larger than this, it will continue to wait until the current time is greater than this value:

last, err := s.loadTimestamp()if err != nil {    return errors.Trace(err)}var now time.Timefor {    now = time.Now()    if wait := last.Sub(now) + updateTimestampGuard; wait > 0 {        log.Warnf("wait %v to guarantee valid generated timestamp", wait)        time.Sleep(wait)        continue    }    break}

When PD can allocate TSO, it will first apply to ETCD for a maximum time, for example, assuming that the current time is T1, a maximum of 3s time window can be applied each time, PD will save T1 + 3s time value to ETCD, then PD will be able to use this segment in memory directly window. When the current time T2 is greater than T1 + 3s, PD continues to update to ETCD T2 + 3s:
```
 if now. Sub (S.lastsavedtime) >= 0 {Last: = S.lastsavedtime Save: = Now. ADD (s.cfg.tsosaveinterval.duration) If err: = S.savetimestamp (save); Err! = Nil {return errors. Trace (Err)}}  
```
The advantage of doing so is that even if PD is dropped, the newly initiated PD will begin assigning TSO, which is the case of 1 processing, from the maximum time last saved.

Because PD stores an assignable time window in memory, the PD can calculate TSO directly in memory and return it when the TSO is requested outside.

resp := pdpb.Timestamp{}for i := 0; i < maxRetryCount; i++ {    current, ok := s.ts.Load().(*atomicObject)    if !ok {        log.Errorf("we haven't synced timestamp ok, wait and retry, retry count %d", i)        time.Sleep(200 * time.Millisecond)        continue    }    resp.Physical = current.physical.UnixNano() / int64(time.Millisecond)    resp.Logical = atomic.AddInt64(&current.logical, int64(count))    if resp.Logical >= maxLogical {        log.Errorf("logical part outside of max logical interval %v, please check ntp time, retry count %d", resp, i)        time.Sleep(updateTimestampStep)        continue    }    return resp, nil}

Because it is calculated in memory, so the performance is very high, our own internal testing can allocate millions other TSO per second.

If the client requests TSO once per transaction to PD, the cost of each RPC is also very large, so the client will bulk get TSO for PD. The client first collects the TSO requests for a batch of transactions, such as N, and then sends the command directly to PD, and the parameter is that N,PD receives the command and generates N TSO back to the client.

Heartbeat

At the very beginning we said that all PD data about the cluster was reported by TIKV active Heartbeat, and PD's dispatch to TIKV was done at the heartbeat. Usually PD handles two heartbeats, one is the heartbeat of Tikv's own store, and the other is the leader peer reported heartbeat in the store.

For the store's heartbeat, PD is handleStoreHeartbeat processed inside the function, mostly by caching some state of the current store in the heartbeat into the cache. The store's status includes how many region the store has, and how many region leader peers are on the store, all of which are used for subsequent scheduling.

For the heartbeat of region, PD is handleRegionHeartbeat processed inside. It is important to note that only leader peer will report the information of the region to which follower peer will not report. After receiving the heartbeat of region, PD will also place it in the cache, and if PD finds a change in the epoch of region, it will also save the region's information to ETCD. PD then makes a specific dispatch of the region, such as discovering that the peer number is not enough, adding a new peer, or having a peer that is already broken, deleting the peer, and so on, and we will discuss it in detail.

Here again, the epoch of region, in the epoch of region, has conf_ver and version , respectively, represents the different version states of this region. If a region has a membership changes, that is, adding or deleting a peer, it conf_ver will add 1 if the region has occurred split or merge , version add 1.

Whether PD or TIKV, we use the epoch to determine if the region has changed, rejecting some dangerous operations. For example, the region has been divided version into 2, then if there is a write request with version 1, we will think that the request is stale, will be directly rejected. Because the version changes indicate that the region has changed, it is possible that the stale request will need to operate with a key that is not in the new range but in the previous area range.

Split/merge

As we said earlier, PD will dispatch the region within the region's heartbeat and then take the relevant dispatch information directly in the heartbeat return value, allowing TIKV to handle it himself, and after tikv processing is completed, the next heartbeat weight New report, PD will know if the dispatch is successful.

For membership changes, it is easier, because we have the maximum number of copies of the configuration, assuming three, then when the heartbeat of region, found only two peer, then add peer, if there are four peer, remove peer. For the region of Split/merge, the situation is slightly more complicated, but also relatively simple. Note that at this stage, we only support split,merge in the development phase, not released, so here is just a split example:

In TIKV, leader peer periodically checks if the region occupies more space than a certain threshold, assuming we set the size of region to 64MB, and if a region exceeds 96MB, it needs to be split.
Leader peer will first send a request splitting command to PD, which handleAskSplit is processed in the process, because we are a region divided into two, for these two newly divided region, one will inherit all the meta information of the region before, and another related information, such as Region ID, the new peer ID, requires PD generation and returns it to leader.
Leader Peer writes a split raft log, executed at apply, so that the region is split into two.
After the split succeeds, Tikv tells PD,PD to handleReportSplit handle it, update the cache-related information, and persist to ETCD.

Routing

Because PD preserves all tikv cluster information, it naturally provides the ability to route to the client. Suppose the client wants to key write a value to.

The client first obtains from the PD key which REGION,PD to return this region-related meta-information.
The client caches itself so that it does not need to be acquired from PD every time. Then send the command directly to the leader peer in region.
It is possible that the leader of the region has drifted to other peer,tikv and will return an NotLeader error with the new leader address, the client updating in the cache, and re-sending the request to the new leader.
It is also possible that version of region has changed, such as split, when it key may have fallen into the new region, the client will receive StaleCommand the error, and then re-obtained from the PD, into State 1.

Summary

As the central dispatching module of TIDB cluster, PD is designed so as to ensure the stateless and convenient expansion. This article mainly describes how PD interacts with TIKV,TIDB. Later, we will describe in detail the core scheduling function, which is how PD controls the entire cluster.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More