Analysis on the scheduler of Kubernetes

Source: Internet
Author: User
Tags error handling reflector
1. Introduction to Kubernetes Scheduler

Kubernetes Scheduler runs at the master node, its core function is to listen to the apiserver to get the pod that is podspec.nodename empty, and then create a binding for each such pod that the pod should be dispatched to on which node.

From where to read the pod that hasn't been scheduled yet. Of course it's apiserver. How do you know pod is not scheduled? It asks the apiserver to request the Spec.nodename field to be empty and then writes the results to the apiserver after the result is dispatched.

Although the principle of scheduling is simple, it is not easy to write a good scheduler because there are many things to consider:

Minimize the loss scalability caused by single node downtime by averaging workload to different nodes
. As the cluster scale increases, how to ensure that the scheduler will not become a performance bottleneck
high availability. The scheduler can do the composition of the cluster, any one scheduler problems, does not affect the entire cluster scheduling
flexibility. Different users have different scheduling requirements, an excellent scheduler also allows users to configure different scheduling algorithm
resources reasonable and efficient use. The scheduler should improve the resource utilization of the cluster as much as possible and prevent the waste of resources.

Unlike other components, the scheduler code is in the plugin/directory: plugin/cmd/kube-scheduler/is the main entry of the code, plugin/pkg/scheduler/is the specific scheduling algorithm. From this directory structure can also be seen, Kube-scheduler is as a plug-in access to the cluster, its final form must be user can easily customization and two development. 2. Code Analysis

2.1 Start Process

Although placed in the plugin/directory, the Kube-scheduler boot process is the same as other components, it will create a new schedulerserver, which is a structure that holds the configuration information required for scheduler startup, and then resolves the parameters of the command line. Assign values to the contents of the structure body, and run the app at the end. Run (s) ran the scheduler.


func Main () {
    s: = options. Newschedulerserver ()
    s.addflags (pflag.commandline)

    flag. Initflags ()
    logs. Initlogs ()
    defer logs. Flushlogs ()

    Verflag. Printandexitifrequested ()

    app. Run (s)

App. Runs (s) constructs various instances based on configuration information, and then runs the core logic of scheduler, which runs all the time and does not exit.


Func Run (S *options. Schedulerserver) Error {
    Configfactory: = Factory. Newconfigfactory (Leaderelectionclient, S.schedulername, S.hardpodaffinitysymmetricweight, S.FailureDomains)
    Config, err: = Createconfig (S, configfactory) ...
    Sched: = scheduler. New (config)

    run: = Func (_ <-chan struct{}) {
        sched. Run ()
        Select {}
    } ...
    Multiple Kube-scheduler deployment of highly available clusters will use the leader election feature ...

The main logic of the Run method is this: Create the configuration that scheduler needs based on the passed parameters (primarily the various structures needed), then call the Scheduler interface to create a new scheduler object, and finally run the object to open the dispatch code. It should be noted that the Config object is also created on the basis of configfactory.

Understanding the creation and content of Config is important to understand how the scheduler works, so let's start by splitting its code.

creation of 2.2 Config

Factory. The Newconfigfactory method creates a Configfactory object, which is mostly listandwatch, used to synchronize the contents of various resources from Apiserver as a reference for scheduling. In addition, there are two particularly important members of the structure: the Podqueue and Podlister,podqueue queues hold the status information of the unscheduled pod and pod that have not yet been scheduled for Pod,podlister synchronization.


Func newconfigfactory (client clientset. Interface, schedulername string, Hardpodaffinitysymmetricweight int, failuredomains string) *configfactory {//schedu Lercache preserves pod and node information, which is the source of truth Schedulercache for both information in the scheduling process: = Schedulercache. New (30*time. Second, stopeverything) Informerfactory: = Informers. Newsharedinformerfactory (client, 0) Pvcinformer: = Informerfactory.persistentvolumeclaims () c: = &configfacto ry{client:client, Podqueue:cache. Newfifo (Cache. Metanamespacekeyfunc), Scheduledpodlister: &cache. storetopodlister{}, Informerfactory:informerfactory,//Configfactory is a very important part of all kinds of ' Lister ', which are used to obtain various funds from The source list, which maintains real-time synchronization with Apiserver Nodelister: &cache. storetonodelister{}, Pvlister: &cache. Storetopvfetcher{store:cache. Newstore (Cache.
  Metanamespacekeyfunc)}, PVCLister:pvcInformer.Lister (),      PvcPopulator:pvcInformer.Informer (). Getcontroller (), Servicelister: &cache. Storetoservicelister{indexer:cache. Newindexer (Cache. Metanamespacekeyfunc, Cache. Indexers{cache. Namespaceindex:cache. Metanamespaceindexfunc})}, Controllerlister: &cache. Storetoreplicationcontrollerlister{indexer:cache. Newindexer (Cache. Metanamespacekeyfunc, Cache. Indexers{cache. Namespaceindex:cache. Metanamespaceindexfunc})}, Replicasetlister: &cache. Storetoreplicasetlister{indexer:cache. Newindexer (Cache. Metanamespacekeyfunc, Cache. Indexers{cache. Namespaceindex:cache.                 Metanamespaceindexfunc})}, Schedulercache:schedulercache, stopeverything: Stopeverything, Schedulername:schedulername, HARDPODAFFINITYSYMMETRICWEIGHT:HARDPODAF Finitysymmetricweight, Failuredomains:failuredomains,}//PodlistEr and other Lister are created differently, it is ' schedulercache ' c.podlister = schedulercache//Scheduledpodlister saves the already scheduled pod, that is, ' Spe  C.nodename ' NOT null and the state is not Failed or succeeded pod//Informer is a layer of encapsulation of reflector, reflect to update the results of the listwatcher in real time to the store, and
    Informer calls the corresponding handler function each time it is updated. The handler function here updates the pod data in the store to the Schedulercache c.scheduledpodlister.indexer, c.scheduledpodpopulator = CACHE.N Ewindexerinformer (C.CREATEASSIGNEDNONTERMINATEDPODLW (), &api. pod{}, 0, cache.
            resourceeventhandlerfuncs{Addfunc:c.addpodtocache, Updatefunc:c.updatepodincache, Deletefunc:c.deletepodfromcache,}, cache. Indexers{cache. Namespaceindex:cache. Metanamespaceindexfunc},)//IBID., synchronize node data to Schedulercache, c.nodepopulator = CACHE.N Ewinformer (C.createnodelw (), &api. node{}, 0, cache.  resourceeventhandlerfuncs{Addfunc:  C.addnodetocache, Updatefunc:c.updatenodeincache, Deletefunc:c.deletenodefromcache,}, ) ... return C}

Configfactory contains a variety of Lister, which are used to obtain information about various resources in Kubernetes, and Schedulercache Save the most up-to-date information on pods and nodes needed in the scheduling process.

Then, Createconfig (S, configfactory) creates the Config object that is actually scheduler, based on the configuration parameters and configfactory.

Func Createconfig (S *options. Schedulerserver, Configfactory *factory. Configfactory) (*scheduler. Config, error) {
    If _, err: = OS. Stat (S.policyconfigfile); Err = = Nil {
        var (
            policy Schedulerapi)     . Policy
            ConfigData []byte
        configdata, err: = Ioutil. ReadFile (s.policyconfigfile) ...
        If err: = Runtime. Decodeinto (Latestschedulerapi. Codec, ConfigData, &policy); Err!= Nil {return
            nil, FMT. Errorf ("Invalid configuration:%v", err)
        } return
        Configfactory.createfromconfig (policy)
    Return Configfactory.createfromprovider (S.algorithmprovider)

Createconfig has two ways to create scheduler based on different configurations. Config:

Func (f *configfactory) Createfromkeys (Predicatekeys, Prioritykeys sets. String, extenders []algorithm. Schedulerextender) (*scheduler. Config, error) {//Get all predicates functions Predicatefuncs, err: = F.getpredicates (Predicatekeys)//priority returned Not a function, but a priorityconfigs. One is because priority also contains weights, and the second is because the implementation of priority in the way of migration to Map-reduce priorityconfigs, err: = F.getpriorityfunctionconfigs (Priorit
    Ykeys)///Two metaproducer are used to obtain metadata information used in the schedule, such as affinity, toleration,pod ports (port used), resource request (requested resource), etc. Prioritymetaproducer, err: = F.getprioritymetadataproducer () predicatemetaproducer, err: = F.GETPREDICATEMETADATAPR Oducer ()//Run the internal logic of various informer, from Apiserver Synchronize resource data to Lister and cache ()//Construct Scheduleralgorithm object, which is most The core approach is the ' Schedule ' approach, which we'll talk about in the following algo: = Scheduler. Newgenericscheduler (F.schedulercache, Predicatefuncs, Predicatemetaproducer, Priorityconfigs, PriorityMetaProducer , extenders) ...//returns the final Config object return &schEduler. config{Schedulercache:f.schedulercache, Nodelister:f.nodelister.nodecondition (getNodeCondition predicate ()), Algorithm:algo, Binder: &binder{f.client}, Podconditionu Pdater: &podconditionupdater{f.client},//Nextpod is to remove the next unscheduled pod Nextpod:func () *API.P from Podqueue          OD {return f.getnextpod ()},//handling function when scheduling error, will rejoin pod to Podqueue for next scheduled error: F.makedefaulterrorfunc (&podbackoff, F.podqueue), stopeverything:f.stopeverything,}, nil}

Config is defined in the file plugins/pkg/scheduler/scheduler.go. It divides the logic of the scheduler into several components that provide these features:

The Nextpod () method can return the next pod Algorithm.schedule () method that needs to be scheduled to
calculate the result of a pod in the node the
error () method can put pod in the dispatch queue to retry when the error occurs
Schedulercache can temporarily save the pod information in the schedule, occupy the resources required by pod, and ensure the resources will not conflict
Binder.bind after the dispatch is successful, the dispatch results are sent to the Apiserver to save the
Scheduler object is the combination of these logical components to complete the final scheduling task.

In the logical component of Config, the Algorithm.schedule () method is responsible for scheduling the pod. The corresponding value is an implementation of the Scheduler, also the Kube-scheduler default scheduler, which is responsible for the scheduling of a single pod and returns the result: Genericscheduler,genericscheduler


Func Newgenericscheduler (
    cache schedulercache. Cache,
    predicates map[string]algorithm. Fitpredicate,
    predicatemetaproducer algorithm. Metadataproducer,
    prioritizers []algorithm. Priorityconfig,
    prioritymetaproducer algorithm. Metadataproducer,
    extenders []algorithm. Schedulerextender) algorithm. Schedulealgorithm {return
        cache:                 cache,
        predicates:            predicates,
        prioritizers:          prioritizers,
        Prioritymetaproducer:  prioritymetaproducer,
        extenders:             Extenders,
        cachednodeinfomap:     Make (Map[string]*schedulercache. NodeInfo),

There is only one method for the interface of the scheduling algorithm: Schedule, the first parameter is the pod to schedule, and the second is the interface object that can get the node list. It returns the name of a node, indicating that the pod will be dispatched to this node.


Type Schedulealgorithm Interface {
    Schedule (*api. Pod, Nodelister) (Selectedmachine string, err error)

Config is created, is the creation and operation of the scheduler, the implementation of the most core scheduling logic, and continuously for all needs to schedule the pod to select the right node:

Sched: = scheduler. New (config)

run: = Func (_ <-chan struct{}) {
    sched. Run ()
    Select {}

To sum up, the relationship between Configfactory, Config and scheduler is shown in the following illustration:

Configfactory the factory model corresponding to the factory model, generate config According to the different configuration and parameters, of course, the various data of config need are prepared beforehand.
Config is the most important component in the scheduler, which implements the logic of each component of the dispatch.
Scheduler uses the functionality provided by config to complete scheduling
If the schedule is compared to cooking, then the construction of Config is equivalent to preparing ingredients and spices, washing vegetables, pretreatment of the ingredients. Cooking is the process of turning prepared ingredients into delicacies.

2.3 Logic of the dispatching

Then the above analysis, look at scheduler create and run the process. Its corresponding code is in the Plugin/pkg/scheduler/scheduler.go file:

The Scheduler structure itself is very simple, it puts everything into the ' Config ' object, type Scheduler struct {Config *config}//Create Scheduler is to put Config To the struct Func New (c *config) *scheduler {s: = &scheduler{config:c,} return s} func (S *schedule R) Run () {go wait.
    Until (s.scheduleone, 0, S.config.stopeverything)} func (S *scheduler) Scheduleone () {pod: = S.config.nextpod () Dest, err: = S.config.algorithm.schedule (pod, s.config.nodelister) ...//assumed indicates that the host has been selected for Pod, but not yet in AP The pod in the iserver creates the bound//This state is stored separately in the Schedulercache and temporarily occupies the resources on the node assumed: = *pod assumed. Spec.nodename = dest If err: = S.config.schedulercache.assumepod (&assumed); Err!= Nil {return}//asynchronous bind operation to pod go func () {b: = &api. binding{Objectmeta:api. Objectmeta{namespace:pod. Namespace, Name:pod. Name}, Target:api.
 objectreference{Kind: "Node", Name:dest,},       ERR: = S.config.binder.bind (b) If Err!= nil {//bind failed, remove pod information, occupy node resource also released, can allow other Pod using if Err: = S.config.schedulercache.forgetpod (&assumed); Err!= Nil {glog. Errorf ("Scheduler cache Forgetpod failed:%v", Err)} s.config.podconditionupdater.update (pod, &am P;api. podcondition{Type:api. Podscheduled, Status:api. Conditionfalse, Reason: "bindingrejected",}) Return}} ()}

Scheduler. Run is constantly calling Scheduler.scheduleone () to schedule one pod at a time.

The corresponding scheduling logic is shown in the following illustration:

The next step is to decompose and explain.

2.3.1 The next pod to be scheduled

The Nextpod function is the configfactory.getnextpod (), which returns the next pod that should be scheduled by the current scheduler in a queue that has never been scheduled.

It pops out of the configfactory.podqueue a pod that should be scheduled by the current scheduler. The current pod can set the scheduler's name by annotation, and if the dispatcher name finds that the name agrees with itself, the pod should be scheduled by itself. If the corresponding value is NULL, the default scheduler is scheduled.

Podqueue is an advanced first-out queue:
Podqueue:cache. Newfifo (Cache. METANAMESPACEKEYFUNC)
The implementation code for this FIFO is in the Pkg/client/cache/fifo.go file. Podqueue content is reflector from Apiserver real-time synchronization, which holds the need to schedule the pod (Spec.nodename is empty, and the state is not success or failed):

Func (f *configfactory) Run () {
    //Watch and queue pods that need scheduling.
    Cache. Newreflector (F.CREATEUNASSIGNEDNONTERMINATEDPODLW (), &api. pod{}, F.podqueue, 0). Rununtil (f.stopeverything) ...

Func (Factory *configfactory) CREATEUNASSIGNEDNONTERMINATEDPODLW () *cache. Listwatch {
    Selector: = fields. Parseselectorordie ("spec.nodename==" + "" + ", status.phase!=" + string (API). podsucceeded) + ", status.phase!=" + string (API). podfailed)) return
    cache. Newlistwatchfromclient (Factory. Client.core (). Restclient (), "Pods", API. Namespaceall, selector)

2.3.2 Scheduling single pod

After you get the pod, you call the specific scheduling algorithm to select a node.

Dest, err: = S.config.algorithm.schedule (pod, s.config.nodelister)

As has been said above, the default scheduling algorithm is Generic_scheduler,
Its code is in the Plugin/pkg/scheduler/generic_scheduler.go file:

Func (g *genericscheduler) Schedule (pod *api. Pod, Nodelister algorithm.  Nodelister) (string, error) {//First step: Get node information from Nodelister nodes, err: = Nodelister.list () ...// In Schedulercache, the newest data of pod and node are saved, and the ' Cachednodeinfomap ' is updated with the data inside, as reference err = G.cache.updatenodenametoi of node information in the scheduling process. Nfomap (G.CACHEDNODEINFOMAP)//Step Two: Execute predicate, filter nodes that meet the scheduling conditions Filterednodes, failedpredicatemap, err: = Findnodesth  Atfit (pod, g.cachednodeinfomap, nodes, G.predicates, G.extenders, G.predicatemetaproducer) If Len (filterednodes) = 0
        {return "", &fiterror{pod:pod, Failedpredicates:failedpredicatemap, }//Step three: Performs priority, arranging precedence for eligible nodes Metaprioritiesinterface: = G.prioritymetaproducer (pod, g.cachedn ODEINFOMAP) prioritylist, err: = Prioritizenodes (pod, G.cachednodeinfomap, Metaprioritiesinterface, G.prioritizers, fi
Lterednodes, g.extenders) If Err!= nil {return "", err}
    Step Fourth: Select a node from the final result return G.selecthost (prioritylist)} 

The process of scheduling algorithms is divided into four steps:

Get the necessary data, this is of course the pod and nodes information. Pod is passed as a parameter, there are two kinds of nodes, one is the node information obtained through Nodelister, the other is Cachednodeinfomap. The latter kind of node information saves the use of resources, such as how many scheduled pods on the node, the resources that have been requested, and the resources that can be allocated to
perform the filtering operations. According to current pod and nodes information, filter out nodes that are not suitable to run pod to
perform priority sorting operations. Prioritize the nodes that are suitable for pod operation
. Select one of the highest end-priority nodes as a result of pod scheduling

The following sections explain the process of filtering and prioritizing.
2.3.3 Filtering (predicate): removing inappropriate nodes

The scheduler input is a pod (multiple pod schedules can be implemented through traversal) and multiple nodes, and the output is a node that indicates that the pod will be dispatched to this node.

How to find the most suitable node for pod operation. The first step is to remove nodes that do not meet the scheduling conditions, which kubernetes called predicate,

The function of the filter call is Findnodesthatfit, and the code is in the Plugins/pkg/scheduler/generic_scheduler.go file:

Func findnodesthatfit (pod *api. Pod, Nodenametoinfo Map[string]*schedulercache. NodeInfo, nodes []*api. Node, Predicatefuncs Map[string]algorithm. Fitpredicate, extenders []algorithm. Schedulerextender, Metadataproducer algorithm. Metadataproducer,) ([]*api. node, Failedpredicatemap, error) {//filtered save via filtered nodes var filtered []*api. Node//Failedpredicatemap Save filtered nodes that are not suitable for pod operations Failedpredicatemap: = failedpredicatemap{} If Len (predic Atefuncs) = = 0 {filtered = nodes} else {filtered = make ([]*api. Node, Len (nodes)) errs: = []error{} var predicateresultlock sync. Mutex var filteredlen int32//meta functions can query pod and node information meta: = Metadataproducer (pod, Nodenamet Oinfo)//check whether a single node can run a pod checknode: = func (i int) {nodename: = Nodes[i].
   Name fits, failedpredicates, err: = Podfitsonnode (pod, meta, nodenametoinfo[nodename], Predicatefuncs)         ... if fits {filtered[atomic.
                AddInt32 (&filteredlen, 1)-1] = Nodes[i]} else {Predicateresultlock.lock ()
        Failedpredicatemap[nodename] = Failedpredicates predicateresultlock.unlock ()}} Use Workqueue to run the check in parallel, with the maximum number of concurrent workqueue.
            Parallelize (len (nodes), checknode) filtered = Filtered[:filteredlen] If len (errs) > 0 { return []*api. node{}, failedpredicatemap{}, errors. Newaggregate (ERRS)}///On the basis of basic filtering, continue to perform extender filtering logic ... return filtered, Failedpredicatema P, nil}

The main task of the above code is to perform concurrent control, error handling, and result preservation for the pod filtering work. The node information that is not filtered is saved in the Failedpredicatemap dictionary, the key is the node name, the value is the list of reasons for the failure, and the filtered node is saved in the filtered array.

For each pod, check to see if it can be dispatched to all nodes in the cluster (including only the scheduled nodes), and multiple judgment logic is independent, that is, whether the pod can be dispatched to a node and not other node (at least for now, if this assumption is no longer valid, Concurrency to consider coordinated issues), you can use concurrency to improve performance. Concurrency is achieved through workqueue, the maximum concurrent number is 16, this number is hard code.

Whether the pod and node match is called is the Podfitsonnode function to judge:

Func podfitsonnode (pod *api. Pod, Meta interface{}, info *schedulercache. NodeInfo, Predicatefuncs Map[string]algorithm. Fitpredicate) (bool, []algorithm. Predicatefailurereason, error) {
    var failedpredicates []algorithm. Predicatefailurereason
    for _, Predicate: = Range Predicatefuncs {Fit
        , reasons, err: = predicate (pod, meta, info) C3/>if Err!= Nil {
            err: = FMT. Errorf ("Schedulerpredicates failed due to%v, which are unexpected.", err) return
            false, []algorithm. predicatefailurereason{}, Err
        if!fit {
            failedpredicates = append (Failedpredicates, reasons ...)
    len (failedpredicates) = = 0, failedpredicates, nil

It loops through all the Predicatefuncs-defined filtering methods and returns whether the node satisfies the dispatch criteria and possible error messages. The type of each predicate function is this:


Type fitpredicate func (pod *api. Pod, Meta interface{}, NodeInfo *schedulercache. NodeInfo) (bool, []predicatefailurereason, error)

It accepts three parameters:

Pod: Pod meta to schedule
: Getting the function nodeinfo of pod and scheduling parameters during filtering
: node information to filter

The specific predicate implementations are in Plugin/pkg/scheduler/algorithm/predicates/predicates.go:

Whether the novolumezoneconflict:pod requested volume can be used in the Zone where the node resides. By matching node and PV and to determine
Maxebsvolumecount: Whether the requested volumes exceeds the maximum supported by the EBS (Elastic block Store), the default is the

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.