Kubernetes Source Analysis--scheduler

Source: Internet
Author: User
Tags mutex
Scheduler Source Analysis

The last article mainly introduces the principle kubernetes principle of Kubernetes Scheduler--scheduler
This article mainly carries on the analysis to the Kubernetes Scheduler module source code. Scheduler Source Structure

Kubernetes Scheduler module in the kubernetes source directory/kubernetes/plugin, because the scheduling algorithm of different companies will often be customized according to their own needs, here design into a plugin form, convenient customization development. The following is the directory structure of scheduler, which details the features of each file:

. ├──build├──owners├──cmd # Scheduler The boot code for the module │└──kube-scheduler│├──build│├──owners│├──a  PP # Scheduler The startup logic of App Server ││├──build││├──configurator.go # App starts the implementation of the run function neutron process ││├── Configurator_test.go││├──options│││├──build│││└──options.go #定义了SchedulerSer Ver structure ││└──server.go # schedulerserver start Logic │└──scheduler.go # Scheduler module's entry function └──pkg└──sche Duler├──build├──owners├──algorithm # Preselection and optimization methods are implemented in detail │├──build│├──doc. GO│├──LISTERS.GO # defines the main function implementations of various Lister interfaces │├──predicates # predicate policy ILD││├──ERROR.GO││├──METADATA.GO││├──PREDICATES.GO # each predicate policy function Implement ││├──predicates_test.go││├──utils.go││└──utils_test.go│├──p Riorities # Kube self-with PRIThe main realization of ority policy ││├──build││├──balanced_resource_allocation.go #资源均衡节点优选 ││ ├──BALANCED_RESOURCE_ALLOCATION_TEST.GO││├──IMAGE_LOCALITY.GO # acquired Mirror node priority ││├──image_loca Lity_test.go││├──interpod_affinity.go # Interpodaffinitypriority││├──interpod_affinity_te
        St.go││├──least_requested.go # Least requested Priority││├──least_requested_test.go ││├──metadata.go││├──metadata_test.go││├──most_requested.go # MOSTREQUESTEDPR
        Iority││├──most_requested_test.go││├──node_affinity.go # calculatenodeaffinitypriority ││├──node_affinity_test.go││├──node_label.go # is there a label││├──node_label_t on the machine? Est.go││├──node_prefer_avoid_pods.go││├──node_prefer_avoid_pods_test.go││├ ──selector_spreading.go  # Selectorspreadpriority││├──selector_spreading_test.go││├──taint_toleration.go #TaintTo
        Lerationpriority││├──taint_toleration_test.go││├──test_util.go││└──util       ││├──build││├──non_zero.go││├──topologies.go││ └──util.go│├──scheduler_interface.go # defines Schedulerextender interface and Schedulealgorithm interface │├──scheduler_i 
        Nterface_test.go│└──types.go├──algorithmprovider #一组predicates和priorities的组合 │├──build │├──defaults││├──build││├──compatibility_test.go││└──defaults . Go #kubernetes默认的预选优选policies │├──plugins.go│└──plugins_test.go├──api│├─ ─build│├──latest││├──build││└──latest.go #定义api版本 │├──register. Go│├──types. Go # defines structural bodies such as predicatepolicy and Prioritypolicy │├──v1││├──build││├──register.go
        ││└──types.go│└──validation│├──build│├──validation.go
        │└──validation_test.go├──equivalence_cache.go├──extender.go├──extender_test.go ├──factory│├──build│├──factory.go # starts with scheduler configuration file or keys generation Scheduler│├──f Actory_test.go│├──plugins.go│└──plugins_test.go├──generic_scheduler.go #实现Schedule (),
        The main logic of scheduling is realized, such as ├──generic_scheduler_test.go├──metrics│├──build│└──metrics.go ├──scheduler.go #定义Scheduler和Configurator接口, the Logic of Scheduleone () has been realized ├──scheduler_test.go├──schedule
        Rcache│├──build│├──cache.go #实现了Cache接口中的函数并定义了schedulerCache接口等 │├──cache_test.go │├──interfacE.go #定义了Cache接口 │├──node_info.go│└──util.go├──testing│├──build│ ├──fake_cache.go│└──pods_to_cache.go└──util├──build├──backoff_utils . go└──backoff_utils_test.go
Scheduler operating mechanismThe scheduler configfactory defines the podqueue used to store the pod that needs to be dispatched, and the pod is added to the queue whenever new pod is established. Scheduleone will select the next pod to be scheduled from the queue for detailed scheduling operations.

Selecting a good scheduling node will bind the pod to the selected node nodes.

The following figure is the complete process for the entire scheduler:
Scheduler Core Code explanation Scheduler Module Entrance

The main () function is a scheduler module portal that creates a scheduleserver based on configuration and starts.

Func Main () {
    s: = options. Newschedulerserver ()
    s.addflags (pflag.commandline)

    flag. Initflags ()
    logs. Initlogs ()
    defer logs. Flushlogs ()

    Verflag. printandexitifrequested ()
    //Boot Scheduler server
    If err: = App. Run (s); Err!= Nil {
        glog. Fatalf ("Scheduler app failed to run:%v", err)
    }
}
schedulerserver Start

App. Run (s) defines the scheduler, which is used to monitor whether the pod is to be scheduled, and to perform the corresponding dispatch work.

Func Run (S *options. Schedulerserver) Error {KUBECLI, err: = CreateClient (s) If Err!= nil {return FMT. Errorf ("Unable to create Kube client:%v", err)} Recorder: = Createrecorder (KUBECLI, s) sched, err: = Creates Cheduler (S, kubecli, recorder) If Err!= nil {return FMT. Errorf ("Error creating Scheduler:%v", err)} go starthttp (s) Run: = Func (_ <-chan struct{}) {Sche D.run ()//Start scheduler, start scheduling select {}} if!s.leaderelection.leaderelect {Run (nil) PA Nic ("unreachable")} ID, err: = OS. Hostname () If Err!= nil {return FMT. Errorf ("Unable to get hostname:%v", err)}//Todo:enable Other lock types RL: = &resourcelock. endpointslock{Endpointsmeta:metav1. objectmeta{Namespace: "Kube-system", Name: "Kube-scheduler",}, Client:kube CLI, Lockconfig:resourcelock. resourcelockconfig{IDentity:id, Eventrecorder:recorder,},} leaderelection. Runordie (leaderelection. leaderelectionconfig{Lock:rl, Leaseduration:s.leaderelection.leaseduration.duration, R
        Enewdeadline:s.leaderelection.renewdeadline.duration, Retryperiod:s.leaderelection.retryperiod.duration, Callbacks:leaderelection. leadercallbacks{Onstartedleading:run, Onstoppedleading:func () {glog. Fatalf ("Lost Master")},},}) Panic ("unreachable")}
scheduling pod integral logic

Scheduleone () is a specific implementation of the scheduling of pod, including the acquisition of Pod, pod scheduling, pod and node binding operations.

Func (S *scheduler) Scheduleone () {pod: = S.config.nextpod () if pod. Deletiontimestamp!= Nil {s.config.recorder.eventf (pod, v1. Eventtypewarning, "failedscheduling", "Skip schedule deleting pod:%v/%v", pod. Namespace, pod. Name) Glog. V (3). Infof ("Skip schedule deleting pod:%v/%v", pod. Namespace, pod. Name) return} glog. V (3). Infof ("Attempting to schedule pod:%v/%v", pod. Namespace, pod. Name Start: = time. Now () dest, err: = S.config.algorithm.schedule (pod, s.config.nodelister) If Err!= nil {### retur n} metrics. Schedulingalgorithmlatency.observe (metrics.  Sinceinmicroseconds (START)//optimistically assume that binding'll succeed and send it to apiserver//In
    The background.
    If The binding fails, scheduler'll release resources allocated to assumed pod//immediately. Assumed: = *pod assumed. Spec.nodename = dest If err: = S.config.schedulercache.assumepod (&assumed);Err!= Nil {glog.
        Errorf ("Scheduler cache Assumepod failed:%v", err) return}//pod-selected node for bind operation go func () { Defer metrics. E2eschedulinglatency.observe (metrics. Sinceinmicroseconds (start)) b: = &v1. binding{Objectmeta:metav1. Objectmeta{namespace:pod. Namespace, Name:pod. Name}, Target:v1. objectreference{Kind: "Node", Name:dest,},} Bindingstart: = time. Now ()//If binding succeeded then podscheduled condition'll be updated in apiserver so that//it ' s ATO
        Mic with setting host. ERR: = S.config.binder.bind (b) If err: = S.config.schedulercache.finishbinding (&assumed); Err!= Nil {glog. Errorf ("Scheduler cache finishbinding failed:%v", err)} If Err!= nil {### Retu RN} metrics. Bindinglatency.observe (metrics.
   Sinceinmicroseconds (Bindingstart))     s.config.recorder.eventf (pod, v1. Eventtypenormal, "scheduled", "successfully assigned%v to%v", pod.
 Name, Dest)} ()}
Preselection and optimization

/kubernetes/plugin/pkg/scheduler/scheduler.go.schedule () implements the selection of node nodes in the scheduling process, including the predicate and priority two major steps mentioned earlier. Preselection

Filters the nodes to find the ones this fit based on the given predicate functions/per node is passed through the P Redicate functions to determine if it's a fit func findnodesthatfit (pod *v1. Pod, Nodenametoinfo Map[string]*schedulercache. NodeInfo, nodes []*v1. Node, Predicatefuncs Map[string]algorithm. Fitpredicate, extenders []algorithm. Schedulerextender, Metadataproducer algorithm. Metadataproducer,) ([]*v1. Node, Failedpredicatemap, error) {var filtered []*v1.  Node failedpredicatemap: = failedpredicatemap{} If len (predicatefuncs) = = 0 {filtered = nodes} else
        {//Create filtered list with enough spaces to avoid growing IT//and allow assigning. filtered = make ([]*v1. Node, Len (nodes)) errs: = []error{} var predicateresultlock sync.
        Mutex var filteredlen int32//We can use the same metadata producer to all nodes.
     Meta: = Metadataproducer (pod, nodenametoinfo)   Checknode: = func (i int) {nodename: = Nodes[i].
            Name fits, failedpredicates, err: = Podfitsonnode (pod, meta, nodenametoinfo[nodename], Predicatefuncs) If Err!= nil {predicateresultlock.lock () errs = append (errs, err) p Redicateresultlock.unlock () return} if fits {filtered[atomic.
                AddInt32 (&filteredlen, 1)-1] = Nodes[i]} else {Predicateresultlock.lock ()
        Failedpredicatemap[nodename] = Failedpredicates predicateresultlock.unlock ()}} Each node is filtered workqueue concurrently with all the SET predicates function.
            Parallelize (len (nodes), checknode) filtered = Filtered[:filteredlen] If len (errs) > 0 { return []*V1. node{}, failedpredicatemap{}, errors. Newaggregate (ERRS)}}///If Extenders is in the configuration, further filtering if Len (filtered) &Gt 0 && Len (extenders)!= 0 {For _, Extender: = Range Extenders {filteredlist, failedmap, err: = Extender. Filter (pod, filtered) If Err!= nil {return []*V1.
                node{}, failedpredicatemap{}, err} for Failednodename, failedmsg: = Range Failedmap {  If _, Found: = Failedpredicatemap[failednodename];!found {Failedpredicatemap[failednodename] = []algorithm. predicatefailurereason{}} Failedpredicatemap[failednodename] = append (failedpredicatemap[ Failednodename], predicates.
                Newfailurereason (failedmsg))} filtered = Filteredlist if len (filtered) = = 0 { Break}} return filtered, Failedpredicatemap, nil}
Optimize
Func prioritizenodes (pod *v1. Pod, Nodenametoinfo Map[string]*schedulercache. NodeInfo, Meta interface{}, Priorityconfigs []algorithm. Priorityconfig, nodes []*v1. Node, extenders []algorithm. Schedulerextender,) (Schedulerapi. Hostprioritylist, error) {//If No priority configs are provided, then the Equalpriority function is applied//T The Generate the priority list in the required format//If no prority policy is selected and the Extenders policy does not have the If Len (pri orityconfigs) = = 0 && len (extenders) = = 0 {Result: = Make (SCHEDULERAPI). Hostprioritylist, 0, Len (nodes)) for I: = Range nodes {hostpriority, err: = Equalprioritymap (pod, met A, nodenametoinfo[nodes[i]. Name] If Err!= nil {return nil, err} result = Append (result, HOSTP riority)} return result, nil} var (mu = sync. mutex{} WG = Sync. waitgroup{} errs []error)
    Appenderror: = func (err error) {mu. Lock () defer MU. Unlock () errs = append (errs, err)} results: = Make ([]schedulerapi.
    Hostprioritylist, 0, Len (priorityconfigs)) for range Priorityconfigs {results = append (results, nil)}
            For I, Priorityconfig: = Range Priorityconfigs {if priorityconfig.function!= nil {//deprecated Wg. ADD (1) Go func (index int, config algorithm. Priorityconfig) {defer WG. Done () var err error results[index], err = config.
                Function (pod, nodenametoinfo, nodes) If Err!= nil {appenderror (err) } (I, priorityconfig)} else {results[i] = make (Schedulerapi. Hostprioritylist, Len (Nodes)}} Processnode: = func (index int) {nodeinfo: = Nodenametoinfo[node S[index]. Name] Var err error for I: = range prioriTyconfigs {if priorityconfigs[i]. Function!= Nil {continue} Results[i][index], err = priorityconfigs[i]. 
        MAP (pod, meta, nodeinfo) If Err!= nil {appenderror (err) return} Each node is optimally graded, with a concurrent degree of workqueue. Parallelize (len (nodes), processnode) for I, Priorityconfig: = Range Priorityconfigs {if priorityconfig.re Duce = = Nil {CONTINUE} WG. ADD (1) Go func (index int, config algorithm. Priorityconfig) {defer WG. Done () If err: = config. Reduce (pod, meta, nodenametoinfo, Results[index]); Err!= Nil {appenderror (ERR)}} (I, Priorityconfig)}//wait for all Computa
    tions to be finished. Wg. Wait () If Len (errs)!= 0 {return SCHEDULERAPI. hostprioritylist{}, errors.
    Newaggregate (ERRS)}//Summarize all scores. Result:= Make (Schedulerapi.
    Hostprioritylist, 0, Len (nodes))//Todo:consider parallelizing it. For I: = range nodes {result = Append (result, Schedulerapi. Hostpriority{host:nodes[i]. Name, score:0}) for J: = Range Priorityconfigs {//Per node Total score calculation: Weighted average of preferred strategy result[i]. Score + = Results[j][i]. Score * Priorityconfigs[j]. Weight}} If Len (extenders)!= 0 && nodes!= Nil {combinedscores: = Make (Map[string]int , Len (Nodenametoinfo)) for _, Extender: = Range Extenders {WG. ADD (1) Go func (ext algorithm. Schedulerextender) {defer WG. Done () prioritizedlist, weight, err: = ext. Prioritize (pod, nodes) If Err!= nil {//prioritization errors from Extender can I gnored, let K8s/other extenders determine priorities return} MU.L Ock () for I: = Range *prioritiZedlist {host, score: = (*prioritizedlist) [i]. Host, (*prioritizedlist) [i]. Score Combinedscores[host] + = Score * weight} mu. Unlock ()} (extender)}//wait for all go routines to finish WG. Wait () to I: = Range result {Result[i]. Score + = Combinedscores[result[i]. Host]} if Glog. V {for I: = range result {glog. V (10). Infof ("Host%s => Score%d", result[i). Host, Result[i]. Score)} return result, nil}
Summary

This article introduces the source code of scheduler, including the source structure, scheduling process and core codes. Follow the scheduling process to help you understand the overall working mechanism of the scheduler.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.