This is a creation in Article, where the information may have evolved or changed.
This series of articles is mainly for TIKV community developers, focusing on tikv system architecture, source structure, process analysis. The goal is to enable developers to read, to have a preliminary understanding of the TIKV project, and better participate in the development of TIKV.
TIKV is a distributed KV system that uses the Raft protocol to ensure strong data consistency, while supporting distributed transactions using the MVCC + 2PC approach.
This article is section III of this series.
Introduced
Placement Driver (followed by PD abbreviation) is the global central master node in Tidb, which is responsible for the whole cluster scheduling, responsible for the generation of global IDs, and the generation of global timestamp TSO. PD also holds meta-information for the entire cluster tikv, which is responsible for providing the client with routing capabilities.
As a central master node, PD automatically supports auto failover by integrating ETCD, eliminating the need to worry about single point of failure. At the same time, PD also through ETCD raft, ensure the strong consistency of data, do not worry about data loss problem.
On top of the architecture, all PD data is learned through TIKV proactive reporting. At the same time, PD on the whole tikv cluster operation, but also only in the TIKV send Heartbeat command results inside return the relevant commands, let TIKV self-processing, rather than take the initiative to give TIKV command. This design is very simple, we can think of PD is a stateless service (of course, PD will still persist some information to ETCD), all operations are passively triggered, even if PD hangs, the newly elected PD leader can also be served immediately, regardless of any previous intermediate state.
Initialization
PD integrates ETCD, so we usually need to start at least three copies to keep the data safe. At present, PD has the way of cluster initiation, initial-cluster
static mode and join
dynamic mode.
Before proceeding, we need to understand the port of the next Etcd, in Etcd, the default is to listen to 2379 and 23,802 ports. 2379 is primarily used by ETCD to handle external requests, while 2380 is used for communication between ETCD peers.
Suppose we now have three PD, PD1,PD2,PD3, respectively, above the HOST1,HOST2,HOST3.
For static initialization, we give the settings directly when the three PD is started initial-cluster
pd1=http://host1:2380,pd2=http://host2:2380,pd3=http://host3:2380
.
For dynamic initialization, we start PD1, then start PD2, add to the PD1 cluster, join
set to http://host1:2379
. Then start pd3, add to the PD1,PD2 formed cluster inside, join
set to http://host1:2379
.
As you can see, static initialization and dynamic initialization are completely two ports, and the two are mutually exclusive, that is, we can only use one way to initialize the cluster. ETCD itself only supports initial-cluster
the way, but for convenience, PD also provides join
the way.
join
The main use of the ETCD itself provides member related APIs, including add Member,list member, etc., so we use 2379 port, because we need to send commands to Etcd to execute. The initial-cluster
ETCD itself is initialized, so 2380 ports are used.
Compared to initial-cluster
the join
need to consider a lot of case (in the server/join.go
prepareJoinCluster
function has a detailed explanation), but join
the use of very natural, subsequent we will consider initial-cluster
the removal of the initialization scheme.
Election
When PD is activated, we need to elect a leader to provide services externally. Although ETCD itself also has raft leader, but we still feel the use of their own leader, that is, PD leader and ETCD their own leader is not the same.
When PD starts, the Leader election is as follows:
Check the current cluster is not leader, if there is leader, watch this leader, as long as the discovery leader dropped, start again 1.
If there is no leader, start campaign, create a lessor, and write the relevant information through the ETCD transaction mechanism, as follows:
// Create a lessor. ctx, cancel := context.WithTimeout(s.client.Ctx(), requestTimeout)leaseResp, err := lessor.Grant(ctx, s.cfg.LeaderLease)cancel()// The leader key must not exist, so the CreateRevision is 0.resp, err := s.txn(). If(clientv3.Compare(clientv3.CreateRevision(leaderKey), "=", 0)). Then(clientv3.OpPut(leaderKey, s.leaderValue, clientv3.WithLease(clientv3.LeaseID(leaseResp.ID)))). Commit()
If the createrevision of leader key is 0, indicating that other PD has not yet been written, then I can write my own leader related information and bring a Lease. If the transaction fails to execute, indicating that the other PD has become leader, then go back to 1.
After becoming a leader, we regularly carry out the keepalive process:
// Make the leader keepalived.ch, err := lessor.KeepAlive(s.client.Ctx(), clientv3.LeaseID(leaseResp.ID))if err != nil { return errors.Trace(err)}
When PD crashes, the previously written leader key is automatically deleted because the lease expires, so that other PD can watch to start the election again.
Initializes the raft cluster, which is the meta-information that is re-loaded into the cluster from the ETCD. Get the latest TSO information:
// Try to create raft cluster.err = s.createRaftCluster()if err != nil { return errors.Trace(err)}log.Debug("sync timestamp for tso")if err = s.syncTimestamp(); err != nil { return errors.Trace(err)}
After all is done, start regularly to update TSO, monitor whether the lessor is out of date, and whether the outside exits voluntarily:
for { select { case _, ok := <-ch: if !ok { log.Info("keep alive channel is closed") return nil } case <-tsTicker.C: if err = s.updateTimestamp(); err != nil { return errors.Trace(err) } case <-s.client.Ctx().Done(): return errors.New("server closed") }}
TSO
We said earlier that Tso,tso is a global timestamp, which is the cornerstone of TIDB's implementation of distributed transactions. Therefore, for PD, we must first ensure that it can quickly and massively allocate TSO for the transaction, but also need to ensure that the allocation of TSO must be monotonically increasing, it is not possible to return to the situation.
TSO is a int64, which consists of physical time + logical time two parts. Physical time is the millisecond of the current UNIX moment, while logical is the largest 1 << 18
counter. That is to say, 1MS,PD can allocate up to 262,144 TSO, this can satisfy most of the situation.
For TSO preservation in distribution, PD will do the following:
When PD becomes leader, it gets the last saved time from the ETCD, and if the local time is found to be larger than this, it will continue to wait until the current time is greater than this value:
last, err := s.loadTimestamp()if err != nil { return errors.Trace(err)}var now time.Timefor { now = time.Now() if wait := last.Sub(now) + updateTimestampGuard; wait > 0 { log.Warnf("wait %v to guarantee valid generated timestamp", wait) time.Sleep(wait) continue } break}
-
When PD can allocate TSO, it will first apply to ETCD for a maximum time, for example, assuming that the current time is T1, a maximum of 3s time window can be applied each time, PD will save T1 + 3s time value to ETCD, then PD will be able to use this segment in memory directly window. When the current time T2 is greater than T1 + 3s, PD continues to update to ETCD T2 + 3s:
if now. Sub (S.lastsavedtime) >= 0 {Last: = S.lastsavedtime Save: = Now. ADD (s.cfg.tsosaveinterval.duration) If err: = S.savetimestamp (save); Err! = Nil {return errors. Trace (Err)}}
The advantage of doing so is that even if PD is dropped, the newly initiated PD will begin assigning TSO, which is the case of 1 processing, from the maximum time last saved.
Because PD stores an assignable time window in memory, the PD can calculate TSO directly in memory and return it when the TSO is requested outside.
resp := pdpb.Timestamp{}for i := 0; i < maxRetryCount; i++ { current, ok := s.ts.Load().(*atomicObject) if !ok { log.Errorf("we haven't synced timestamp ok, wait and retry, retry count %d", i) time.Sleep(200 * time.Millisecond) continue } resp.Physical = current.physical.UnixNano() / int64(time.Millisecond) resp.Logical = atomic.AddInt64(¤t.logical, int64(count)) if resp.Logical >= maxLogical { log.Errorf("logical part outside of max logical interval %v, please check ntp time, retry count %d", resp, i) time.Sleep(updateTimestampStep) continue } return resp, nil}
Because it is calculated in memory, so the performance is very high, our own internal testing can allocate millions other TSO per second.
If the client requests TSO once per transaction to PD, the cost of each RPC is also very large, so the client will bulk get TSO for PD. The client first collects the TSO requests for a batch of transactions, such as N, and then sends the command directly to PD, and the parameter is that N,PD receives the command and generates N TSO back to the client.
Heartbeat
At the very beginning we said that all PD data about the cluster was reported by TIKV active Heartbeat, and PD's dispatch to TIKV was done at the heartbeat. Usually PD handles two heartbeats, one is the heartbeat of Tikv's own store, and the other is the leader peer reported heartbeat in the store.
For the store's heartbeat, PD is handleStoreHeartbeat
processed inside the function, mostly by caching some state of the current store in the heartbeat into the cache. The store's status includes how many region the store has, and how many region leader peers are on the store, all of which are used for subsequent scheduling.
For the heartbeat of region, PD is handleRegionHeartbeat
processed inside. It is important to note that only leader peer will report the information of the region to which follower peer will not report. After receiving the heartbeat of region, PD will also place it in the cache, and if PD finds a change in the epoch of region, it will also save the region's information to ETCD. PD then makes a specific dispatch of the region, such as discovering that the peer number is not enough, adding a new peer, or having a peer that is already broken, deleting the peer, and so on, and we will discuss it in detail.
Here again, the epoch of region, in the epoch of region, has conf_ver
and version
, respectively, represents the different version states of this region. If a region has a membership changes, that is, adding or deleting a peer, it conf_ver
will add 1 if the region has occurred split
or merge
, version
add 1.
Whether PD or TIKV, we use the epoch to determine if the region has changed, rejecting some dangerous operations. For example, the region has been divided version
into 2, then if there is a write request with version
1, we will think that the request is stale, will be directly rejected. Because the version
changes indicate that the region has changed, it is possible that the stale request will need to operate with a key that is not in the new range but in the previous area range.
Split/merge
As we said earlier, PD will dispatch the region within the region's heartbeat and then take the relevant dispatch information directly in the heartbeat return value, allowing TIKV to handle it himself, and after tikv processing is completed, the next heartbeat weight New report, PD will know if the dispatch is successful.
For membership changes, it is easier, because we have the maximum number of copies of the configuration, assuming three, then when the heartbeat of region, found only two peer, then add peer, if there are four peer, remove peer. For the region of Split/merge, the situation is slightly more complicated, but also relatively simple. Note that at this stage, we only support split,merge in the development phase, not released, so here is just a split example:
In TIKV, leader peer periodically checks if the region occupies more space than a certain threshold, assuming we set the size of region to 64MB, and if a region exceeds 96MB, it needs to be split.
Leader peer will first send a request splitting command to PD, which handleAskSplit
is processed in the process, because we are a region divided into two, for these two newly divided region, one will inherit all the meta information of the region before, and another related information, such as Region ID, the new peer ID, requires PD generation and returns it to leader.
Leader Peer writes a split raft log, executed at apply, so that the region is split into two.
After the split succeeds, Tikv tells PD,PD to handleReportSplit
handle it, update the cache-related information, and persist to ETCD.
Routing
Because PD preserves all tikv cluster information, it naturally provides the ability to route to the client. Suppose the client wants to key
write a value to.
The client first obtains from the PD key
which REGION,PD to return this region-related meta-information.
The client caches itself so that it does not need to be acquired from PD every time. Then send the command directly to the leader peer in region.
It is possible that the leader of the region has drifted to other peer,tikv and will return an NotLeader
error with the new leader address, the client updating in the cache, and re-sending the request to the new leader.
It is also possible that version of region has changed, such as split, when it key
may have fallen into the new region, the client will receive StaleCommand
the error, and then re-obtained from the PD, into State 1.
Summary
As the central dispatching module of TIDB cluster, PD is designed so as to ensure the stateless and convenient expansion. This article mainly describes how PD interacts with TIKV,TIDB. Later, we will describe in detail the core scheduling function, which is how PD controls the entire cluster.