Kubernetesscheduler Module Code learning, Scheduler module in the kubernetes is relatively easy to understand the module, but its work is more important, it is mainly responsible for those who have not found node to run the pod to select the most appropriate node. Its job is to find the right node for the pod and then submit it to apiserver Binder that the pod is already part of the node and that the Kubelet module is responsible for the subsequent work. The scheduler module continues to Apiserver to the pod list of those that have not yet found node, and the available node lists in the current cluster. With these two lists, the process of finding objects (Node) for the pod in the Scheduler module can be divided into the following 2 steps:
3.1 Predicate Strategy
Select the node list from the node list that can be used to run the pod based on the current situation of the pod, known as the predicates process, and predicates's strategy mainly includes:
3.1.1 Nodiskconflict
Check for volume conflicts on this host. If the host has mounted the volume, other pods that use the same volume cannot be dispatched to the host, and the volume of the pod already running on the node to be detected is detected according to the following 3 policies.
A) GCE allows multiple volumes to be mounted simultaneously, as long as the volumes are read-only;
b Amazon EBS does not allow different pods to mount the same volume;
c) Ceph RBD does not allow any two non-read pods to share the same monitor,match pool and image;
3.1.2 Novolumezoneconflict
Check for a given zone restriction to check if there is a volume conflict with POD deployment on this host. Assuming that some volumes may have zone scheduling constraints, Volumezonepredicate evaluates the pod to meet the conditions based on volumes's own needs. The prerequisite is that any volumes zone-labels must exactly match the zone-labels on the node. There can be multiple zone-labels constraints on a node (for example, a hypothetical replication volume might allow for zone-wide access). Currently, this only supports persistentvolumeclaims, and only looks for tags within the Persistentvolume range. It may become more difficult to handle the volumes defined in the properties of the pod (that is, not using persistentvolume), as it is likely that the cloud provider will be called upon to determine the zone of the volume during the scheduling process. The specific process is to traverse the zone of all declared PVC on the pod to see if the associated label exists on the label defined by that node, and it is not appropriate if one does not exist.
3.1.3 Maxebsvolumecount
Ensure that the mounted EBS storage volume does not exceed the maximum set size. The default value is 39. It examines the storage volumes that are used directly, and the PVC that is used indirectly for this type of storage. To calculate the heads of different volumes, the pod cannot be dispatched to this host if the new pod is deployed and the number of volumes exceeds the maximum set.
3.1.4 Maxgcepdvolumecount
Make sure that the mounted GCE storage volume does not exceed the maximum set size. The default value is 16. Rule Ibid.
3.1.5 Podfitsresources
Check that the resources of the host meet the requirements of the pod. Scheduling based on the amount of resources actually allocated, rather than using the amount of resources actually used.
A Allowedpodnumber:node maximum number of available pod (by default, 110) If the node has already run more than the maximum allowable number, the node cannot run new pods.
B Memory: The memory size requested by the Pod on this node and the currently scheduled pod cannot request more memory than the sum of memory that node can allocate. The request size of the pod needs to be configured at the time of creation to be responsible for the default request size of 0, followed by the discovery that the allocated memory for node is the actual size of physical memory instead of the node remaining memory size.
c) The available CPU resources on the Cpu:node are based on the number of CPUs on the node 1000 (if the node has a CPU number of 8 then its available CPU resource is 8000), the required CPU resources are specified when creating pod.
d) GPU: In the experiment when found available GPU resources of 0, should be this is not enabled.
3.1.6 Podfitshost
If there is a specified nodename when creating the pod, the Podfitshost will match the node.name.
3.1.7 Podfitshostports
Check whether the required Hostport for each container in the pod has been occupied by other containers. Pod cannot be dispatched to this host if the required hostport does not meet the requirements.
3.1.8 podselectormatches
Check that node labels satisfies the pod's Nodeselector attribute requirements. Remove its nodeaffinity from the annotations of the pod and check to see if node labels satisfies the pod's nodeaffinity. (nodeaffinity matching is more flexible than nodeselector matching rules, nodeselector is accurate = = matching but nodeaffinity support in, notin,==, <=, exist, Notexist matching rules)
3.1.9 podtoleratesnodetaints
Detects whether Pod can tolerate taints on node. Remove the taints list that pod can tolerate from the pod's annotations, and then determine if all taints on node are more than can be tolerated by pod.
3.1.10 Checknodememorypressure
According to the container of the request for resources (CPU resources, memory resources, GPU resources) (how many requests, not more than limits), the pod is divided into three categories as follows: 1 to ensure that resources are fully reliable (guaranteed ), the container in the pod all define requests and Limits and requests==limits!= 0; Do your best to be unreliable (besteffort) Container in pod are all undefined requests and limits, elastic fluctuations are more reliable (burstable) all pods that do not belong to BestEffort and guaranteed. If the pod enters the besteffort in the above three types, And node's current memorypressure state is true, the pod cannot be dispatched to the node (memory will no longer receive BestEffort pod when the node is more nervous).
3.1.11 checknodediskpressure
No pod scheduling is accepted when node's current diskpressure state is true.
3.1.12 matchinterpodaffinity
This strategy feels useful to some special scheduling requirements, and the matchinterpodaffinity strategy mainly implements pod three affinity (nodeaffinity, podaffinity and Podantiaffinity) Podaffinity and Podantiaffinity. Podantiaffinity is as long as the scheduled pod cannot be deployed with which pod to the same topological domain (where the topological domain is divided according to the node's label attribute, such as NodeName, So there's a pod running on a node that's podantiaffinity to meet the currently scheduled pod. The pod cannot be dispatched to this node, which means that if there is a pod on node that does not want you then you are being able to be dispatched to this node. Podaffinity on the contrary, if the pod to be scheduled has a defined podaffinity, The pod can only be dispatched to the topological domain of the currently running pod satisfying its affinity (if the pod satisfies its affinity, then it must satisfy its affinity otherwise the pod will remain pending). The following is a detailed detection process for this policy:
1 detects whether Pod satisfies the podantiaffinity of the pod that is already running, if it is not satisfied that node cannot be used to dispatch the pod, and returns, otherwise turn 2);
1.1) traversing all current pod operations;
1.2 If the operation of the existingpod is defined podantiaffinity, if there is a traverse podantiaffinity every term;
1.3 If the namespace and selector defined by the term meet the namespace and label definitions of the current pending pod, and the node that the existingpod is running is in the same topological domain as the current node to be detected, which cannot be used to dispatch the pod, and returns;
2 detection of Pod podaffinity, if not satisfied that node can not be used to schedule this pod, return, otherwise turn 3;
2.1) Traversing all the defined term in the podaffinity of the pod;
2.2) Traverse all already running pod;
2.3 If Existingpod's namespace and label definitions satisfy the term, but the topological domain of the node being run by Existingpod is different from the topological domain of the node to be detected, the node cannot be dispatched to dispatch the pod;
3 detects pod podantiaffinity, returns if not satisfied, otherwise the node can be used to dispatch the pod;
3.1) Traversing all the defined term in the podantiaffinity of the pod;
3.2) Traverse all already running pod;
3.3 If there is a Existingpod namespace and the label definition satisfies term in the return at node cannot be used to schedule the pod.
The summary of the 3-step process is as follows: 1 if the pod that is already running does not like the pod, the pod cannot be scheduled to run on its topological domain; 2 The pod can only be dispatched to the topological domain of its affinity pod. 3 pod cannot be dispatched to the topological domain of the pod that it does not like.
Podantiaffinity Application scenarios are:
1 The service pod is dispersed in the topological domain to improve the stability;
2 to spread the pod which may affect each other to different node;
3) in order to ensure access to exclusive resources of the isolation;
Podaffinity Application scenarios are:
1 deploy a specific set of services into the same topological domain;
2) Assuming that a POD1 relies on the services provided by POD2, it is necessary to deploy POD1 and Pod2 to the same topological domain (like a computer room, or on the same switch) in order to reduce network latency;
3.2 Priority Strategy
According to the priority strategy for scoring each node by the node list filtered out in the previous step, each scoring strategy is divided into 0-10, multiplied by its scoring weight and the final score for the item, and the scoring of all points added to the final score of the node. The higher scoring node is then selected as the pod's scheduling node, and the Robin strategy is used to select a node when the highest scoring node exists. The default scoring strategy is as follows:
2.2.1 Leastrequestedpriority
If the new pod is to be assigned to a node, the priority of the node is determined by the ratio of the node's idle portion to the total capacity (i.e. total capacity-the capacity of the pod on the node-the capacity of the new POD)/Total capacity. CPU and memory weights are equivalent, the ratio of the largest node score the highest. It is important to note that this priority function has played the role of allocating pods across nodes according to resource consumption (note: When creating pod, no request memory and CPU resource size are specified, and the CPU and memory resources of these pods are calculated by default of 100 and 200x respectively. 1024x1024). The calculation formula is as follows:
CPU ((Capacity–sum (requested)) * 10/capacity) + memory ((Capacity–sum (requested)) * 10/capacity)/2
3.2.2 Balancedresourceallocation
Try to choose machines that have more balanced resources after the deployment of pod. It calculates the CPU and memory on the host, and the value of the host is determined by the proportion of CPU and the distance of memory (ditto if the required CPU and memory resource size are not specified when the pod is created) The default CPU is 100,memory 200x 1024x1024). The calculation formula is as follows:
Cpufraction = (PODREQUESTCPU+NODEREQUESTEDCPU)/totalallocatablecpu
Memoryfraction = (podrequestmemory+noderequestedmemory)/totalallocatablememory
Score = 10–abs (cpufraction-memoryfraction) AoE
3.2.3 Selectorspreadpriority
For the same service, replication controller pod, as far as possible scattered on different hosts. If you specify an area, you will try to spread the pod across different hosts in different regions. When scheduling a pod, find the service or replication controller for the pod, and then look for the pod that already exists in the service or replication controller, the less existing pod is running on the host, The higher the rating of the host, the calculation method is as follows (the following algorithm shows that the region score accounted for 2/3):
A to obtain the pod-owned service list, RC list;
b) The selector of RC and service obtained in the merging of a);
c) to iterate over all node to calculate the number of all its running pods calculations belonging to the selector;
d) to traverse the number of pods belonging to selector on each node for the maximum number Maxcountbynodename
(e) Calculating the number of pod-belonging selector in each area respectively;
f) To find the number of the largest number of maxcountbyzone;
g) Fscore = * (float32 maxcountbynodename-countsbynodename[node. Name])/float32 (Maxcountbynodename))
h) If there are defined areas:
Zonescore: = * (float32 (Maxcountbyzone-countsbyzone[zoneid))/float32 (Maxcountbyzone))
zoneweighting = 2.0/3.0
Fscore = (Fscore * (1.0-zoneweighting)) + (zoneweighting * zonescore)
3.2.4 nodepreferavoidpodspriority
Node's annotations maintains a list of RC (Replicationcontroller) and RS (replicaset) that you want to avoid scheduling to run on. If the scheduled pod's RC or RS are rated relatively low in this list (in order to avoid other scoring strategies that affect the weight of this scoring strategy is 1000). The specific marking process is:
If the pod does not belong to any RC or RS, all node points are scored at 10;
Remove the Preferavoidpods list you want to avoid from the annotations of node;
Traverse Preferavoidpods List to see if the pod is included in the RC or RS, if there is the node to play 10 points, or hit 10 points;
3.2.5 Nodeaffinitypriority
Affinity mechanism in kubernetes scheduling. Node selectors (the pod is limited to the specified node at dispatch) and supports a variety of operators (in, Notin, Exists, Doesnotexist, Gt, Lt), not limited to exact matching of node labels. In addition, Kubernetes supports two types of selectors, one is the "hard (requiredduringschedulingignoredduringexecution)" selector, which guarantees that the selected host must meet all of the pod's rule requirements for the host. This selector is more like the previous nodeselector, adding more appropriate performance syntax on the basis of nodeselector. The other is the "soft (preferresduringschedulingignoredduringexecution)" selector, which serves as a hint to the scheduler that the scheduler will try, but not guarantee, to meet all nodeselector requirements. Each preferresscheduling item has a weight. Each node that satisfies the item gets the appropriate weight. The specific algorithm is as follows:
(a) obtain its preferresscheduling Items from the annotations of the pod;
b traverse each item in preferresscheduling items;
c) Traversing each node, if the node's labels satisfies the item's selector, the node's weight counts[node. Name] plus the weight of the item, and update Maxcount if the node's weight and greater than maxcount;
D After the above execution begins to compute the final score for each node Fscore = Length (counts[node. Name]/maxcount)
3.2.6 tainttolerationpriority
The tainteffectnoschedule of the podtoleratesnodetaints used in the predicate strategy (which is not able to be scheduled) is different here with Tainteffectprefernoschedule (try to meet, The more you meet the number of items, the higher the score, the specific scoring process is as follows:
Remove the Tainteffectprefernoschedule list from the annotations of the pod;
Iterate through each node in the node list;
Remove taints list from node annotations;
The number of pods not tolerated in the taints of node is saved to Counts[node. Name], and if Counts[node. Name] is greater than maxcount to update maxcount.
After the above execution, iterate through each node and calculate its score separately if maxcount>0
Fscore= (1.0-counts[node. Name]/maxcount) * 10; otherwise fscore=10;
3.2.7 interpodaffinitypriority
Filter out a node list that can be scheduled and then rate each node in the Filtered node list via priority, as shown on the predicates strategy. The last node ranked higher in terms of scoring rankings, selecting the highest scoring node and then binder the pod through Apiserver to the node.