[TOC]
Introduction to Kubernetes Dispatching
In addition to having the Kubernetes cluster scheduler automatically select a node for the pod resource (the default schedule is that the resources are sufficient and the load is as average as possible), there are situations where we want to have more control over how the pod should be dispatched. For example, some of the machines in the cluster are better configured (SSDs, better memory, etc.), we want to compare the core services (such as databases) to run on them, or the network of a two service is very frequent, we hope they are best on the same machine, or the same computer room.
This scheduling is divided into two categories in kubernetes: node affinity and pod affinity
Node Selector
In fact, this is the most common way to use label, to assign a specific tag to a node, and then to specify the label of the node to be dispatched to by Nodeselector when the pod is started.
To tag node:
kubectl label nodes <node-name> <label-key>=<label-value>
Example:
kubectl label nodes k8s-node1 envir=live
apiVersion: v1kind: Podmetadata: name: nginx labels: env: testspec: containers: - name: nginx image: nginx imagePullPolicy: IfNotPresent nodeSelector: envir: live
It should be explained that the Nodeselector way is simple and intuitive, but not flexible enough, and later, it will be replaced by node affinity.
Node Affinity
Affinity translated into Chinese is "affinity", it corresponds to the anti-affinity, we translated into "mutual exclusion." These two words compare the image, can the pod chooses node's process analogy to the magnet attraction and mutually exclusive, except the simple positive and negative pole, the pod and node's attraction and the mutual exclusion can be flexibly configured.
Advantages of Affinity:
- Matches have more logical combinations, not just full equality of strings
- Scheduling is divided into soft policies (soft) and hard policies, under soft policy, if there are no nodes that meet the scheduling conditions, the pod ignores this rule and continues to complete the dispatch.
Currently the main node affinity:
Requiredduringschedulingignoredduringexecution
Indicates that the pod must be deployed to a node that meets the criteria, and retries are not kept if there are no nodes that meet the criteria. Where ignoreduringexecution indicates that the pod will continue to run if the node label changes and no longer satisfies the conditions specified by the pod when the pod is deployed.
Requiredduringschedulingrequiredduringexecution
Indicates that the pod must be deployed to a node that meets the criteria, and retries are not kept if there are no nodes that meet the criteria. Where requiredduringexecution indicates that when the pod is deployed, if the node label changes and no longer satisfies the conditions specified by the pod, the node that meets the requirements is re-selected.
Preferredduringschedulingignoredduringexecution
Represents a priority deployment to a node that meets the criteria, and if there are no nodes that meet the criteria, these conditions are ignored and deployed according to normal logic.
Preferredduringschedulingrequiredduringexecution
Represents a priority deployment to a node that meets the criteria, and if there are no nodes that meet the criteria, these conditions are ignored and deployed according to normal logic. Where requiredduringexecution indicates that if the subsequent node label changes and the condition is met, the node that satisfies the condition is re-dispatched.
The distinction between soft and hard policies is useful, and the hard policy applies to situations where the pod must be running on a certain node, otherwise there will be problems, such as the structure of the nodes in the cluster, and the services that are running must depend on the functionality provided by a certain architecture; But it is better to meet the conditions, such as service best run in a certain area, reduce network transmission and so on. This distinction is determined by the specific needs of the user, and there is no absolute technical dependency.
The following is an official example:
apiVersion: v1kind: Podmetadata: name: with-node-affinityspec: affinity: nodeAffinity: requiredDuringSchedulingIgnoredDuringExecution: nodeSelectorTerms: - matchExpressions: - key: kubernetes.io/e2e-az-name operator: In values: - e2e-az1 - e2e-az2 preferredDuringSchedulingIgnoredDuringExecution: - weight: 1 preference: matchExpressions: - key: another-node-label-key operator: In values: - another-node-label-value containers: - name: with-node-affinity image: gcr.io/google_containers/pause:2.0
This pod also defines two types of requiredduringschedulingignoredduringexecution and preferredduringschedulingignoredduringexecution. Nodeaffinity. The first requirement is that the pod run on a specific AZ node, and the second wants the node to have a corresponding another-node-label-key:another-node-label-value tag.
The matching logic here is the label in a list, and the optional operators are:
- The value of the In:label is in a list
- The value of Notin:label is not in a list
- Exists: a label exists
- Doesnotexist: A label does not exist
- Gt:label value is greater than a value (string comparison)
- Lt:label value is less than a value (string comparison)
If Nodeselector has multiple options in nodeaffinity, the node satisfies either condition, or if matchexpressions has multiple options, the node must meet these options to run the pod.
It should be explained that node does not anti-affinity this kind of thing, because Notin and doesnotexist can provide similar functions.
Pod Affinity
With node affinity, we know how to make pods choose node flexibly when scheduling, but there are times when we want the scheduler to consider the relationship between pods, not just the pod-to-node relationship. The pod affinity was introduced at the time of Kubernetes 1.4.
Why is there such a demand? For example, our system services A and service B as far as possible to deploy in the same host, room, city, because they have more network communication, for example, our system data Service C and data Service D as far as possible, because if they are assigned together, and then the host or room problem, will cause the application is completely unavailable, If they are separate, the application is still available, although it has an impact.
Pod Affinity can be understood this way: when scheduling (or not selecting) such node N, these nodes have already run to satisfy the condition X. The condition X is a set of label selectors that must indicate the namespace of the action (and can also be used for all namespace) because the pod is running in a namespace.
The x here refers to the concepts of nodes, racks, and regions in the cluster, which are declared by kubernetes the key in the built-in node tag. The name of this key is Topologykey, meaning the topology range to which the node belongs:
- Kubernetes.io/hostname
- Failure-domain.beta.kubernetes.io/zone
- Failure-domain.beta.kubernetes.io/region
Similar to node affinity, pod affinity also has requiredduringschedulingignoredduringexecution and Preferredduringschedulingignoredduringexecution, the meaning is the same as before. If you are using affinity, add the podaffinity field below affinity, and if you want to use mutexes, add the Podantiaffinity field below affinity.
Define a reference target pod first:
apiVersion: v1kind: Podmetadata: name: pod-flag labels: security: "S1" app: "nginx"spec: containers: - name: nginx image: nginx
Pod affinity scheduling
The following is an example of an affinity schedule
apiVersion: v1kind: Podmetadata: name: pod-affinityspec: affinity: podAffinity: requiredDuringSchedulingIgnoredDuringExecution: - labelSelector: matchExpressions: - key: security operator: In values: - S1 topologyKey: kubernetes.io/hostname containers: - name: with-pod-affinity image: gcr.io/google_containers/pause:2.0
Once created, you can see that this pod is located on the same node as the above reference pod, and if you kill the kubernetes.io/hostname tag on this node, you will find that the pod will always be in the pending state. This is because you cannot find the node that satisfies the criteria.
Pod Mutex scheduling
The following is an example of a mutex schedule:
apiVersion: v1kind: Podmetadata: name: with-pod-affinityspec: affinity: podAffinity: requiredDuringSchedulingIgnoredDuringExecution: - labelSelector: matchExpressions: - key: security operator: In values: - S1 topologyKey: "failure-domain.beta.kubernetes.io/zone" podAntiAffinity: preferredDuringSchedulingIgnoredDuringExecution: - weight: 100 podAffinityTerm: labelSelector: matchExpressions: - key: security operator: In values: - S2 topologyKey: kubernetes.io/hostname containers: - name: with-pod-affinity image: gcr.io/google_containers/pause:2.0
This example requires that the new pod be the same zone as the SECURITY=S1 pod, but not the same node as the Security=s2 pod.
In principle, Topologykey can use any valid label key assignment, but for performance and security reasons, the Topologykey has the following limitations:
- Empty topologykey are not allowed in the definition of pod affinity and pod mutex for requiredduringscheduling
- If the admission controller contains limitpodhardantiaffinitytopology, The pod mutex definition for requiredduringscheduling is limited to kubernetes.io/hostname, and if you want to use a custom topologykey, you need to overwrite or disable the controller
- In the perferredduringscheduling-type pod mutex definition, an empty topologykey is interpreted as kubernetes.io/hostname, Combination of Failure-domain.beta.kubernetes.io/zone and failure-domain.beta.kubernetes.io/region
Considerations for Podaffinity Rule settings:
- In Labelselector and Topologykey, you can also define a namespaces list, which shows which pods match which namespace, by default, matches the namespace of the defined pod, and if this field is defined, But its value is empty, then all namespaces are matched.
- After all the matchexpressions of the associated requiredduringschedulingignoredduringexecution are satisfied, the system can dispatch the pod to a node.
Taints and Tolerations (stain and tolerance)
The nodeaffinity node affinity, described earlier, is a property defined on the pod that allows the pod to be dispatched to some node to run. Taint, on the other hand, lets node reject the pod's operation.
Taint needs to be used in conjunction with toleration, allowing the pod to avoid inappropriate node. After you set up one or more taint on node, you cannot run on these node unless the pod explicitly declares that it can tolerate these "stains". Toleration is the pod's attribute, allowing the pod to (note, just can, not have to be) run on node labeled taint.
The following is a simple example:
Add a taint to the Node1, the taint key, and the Value,taint effect is noschedule. This means that unless the pod explicitly declares that it can tolerate this taint, it will not be dispatched to Node1:
kubectl taint nodes node1 key=value:NoSchedule
Then you need to declare the toleration on the pod. The following toleration is set to tolerate node with this taint so that the pod can be dispatched to Node1:
tolerations:- key: "key" operator: "Equal" value: "value" effect: "NoSchedule"
Can also be written as follows:
tolerations:- key: "key" operator: "Exists" effect: "NoSchedule"
The key and effect in the pod's toleration declaration need to be consistent with the settings of taint and meet one of the following conditions:
- The value of operator is exists, and you do not need to specify value
- The value of operator is equal and value is equal
If you do not specify operator, the default value is equal.
There are two additional exceptions:
- An empty key mate with the exists operator to match all keys and values
- The empty effect matches all the effect
Effect description
In the above example, the value of effect is noschedule, the following is a brief description of the value of effect:
- noschedule: If a pod is not declared to tolerate this taint, then the system will not dispatch the pod to a soft limited version of
- Prefernoschedule:noschedule on node with this taint. If a pod does not claim to tolerate this taint, the system will try to avoid dispatching this pod to this node, but it is not mandatory.
- NoExecute: Defines the eviction behavior of the pod in response to node failure. NoExecute This taint effect has the following effect on the pods that are running on the node:
- pods that are not set toleration are immediately evicted
- configured pod for toleration, if not for Tolerati Onseconds is assigned, it remains in this node
- is configured with the pod corresponding to toleration and the Tolerationseconds value is specified, and the
- is expelled after the specified time from the Kubernetes Version 1.6 begins with the introduction of an alpha version of the feature, which marks the node fault as taint (currently only for node unreachable and node not ready, the corresponding nodecondition "ready" Values are unknown and false). After activating the Taintbasedevictions function (adding taintbasedevictions=true in the--feature-gates parameter), Nodecontroller automatically sets the taint for node, and the status is " The normal eviction logic that was previously set on the "Ready" node will be disabled. Note that in the event of node failure, in order to maintain the current speed limit setting for Pod eviction, the system will gradually set the taint to node in the speed limit mode, which will prevent the large number of pods being evicted in certain situations (such as master temporarily missing). This feature is compatible with Tolerationseconds, allowing the pod to define how long the node fails to be evicted.
The system allows you to set multiple taint on the same node, or you can set multiple toleration on the pod. The Kubernetes Scheduler handles multiple taint and toleration matching portions, and the rest of the taint that are not ignored is the effect on the pod. Here are a few special cases:
- If Effect=noschedule is present in the remaining taint, the scheduler does not dispatch the pod to this node.
- If there is no noschedule effect in the remaining taint, but there is a prefernoschedule effect, the scheduler will try not to assign the pod to this node
- If the effect of the remaining taint is noexecute and the pod is already running on that node, it will be evicted and will no longer be dispatched to the node if it is not running on that node.
Here is an example:
kubectl taint nodes node1 key1=value1:NoSchedulekubectl taint nodes node1 key1=value1:NoExecutekubectl taint nodes node1 key2=value2:NoSchedule
Set two toleration on the pod:
tolerations:- key: "key1" operator: "Equal" value: "value1" effect: "NoSchedule"- key: "key1" operator: "Equal" value: "value1" effect: "NoExecute"
The result is that the pod cannot be dispatched to Node1 because the third taint does not have a matching toleration. However, if the pod is already running on Node1, the third taint is set at runtime, and it continues to run because the pod can tolerate the first two taint.
In general, if you add Effect=noexecute taint to node, all the pods that are running on that node will be immediately evicted, and pods with the corresponding toleration will never be evicted. However, the system allows you to add an optional tolerationseconds field to the toleration with noexecute effect, which indicates how long the pod can run on this node after taint is added to node (s):
tolerations:- key: "key1" operator: "Equal" value: "value1" effect: "NoSchedule" tolerationSeconds: 3600
The above example means that if the pod is running and the node is joined to a matching taint, the pod will continue to survive 3600s on this node and be evicted. If the taint is removed within this grace period, the eviction event is not triggered.
Common Application Scenario node exclusive
If you want to take out a subset of the nodes and use them specifically for specific applications, you can add such taint to the node:
kubectl taint nodes nodename dedicated=groupName:NoSchedule
Then add the appropriate toleration to the pod of these apps, and the pod with the appropriate toleration will be allowed to use the same taint node as the other nodes. These nodes are then labeled with the specified label, and, by nodeselector or affinity scheduling, require that the pods be run on the node of the specified tag.
Nodes with special hardware devices
In a cluster, a small number of nodes may have special hardware devices, such as GPU chips. Users will naturally want to exclude pods that do not need to occupy this type of hardware. To ensure that the pods required for this type of hardware can be dispatched to these nodes smoothly. You can use the following command to set taint for a node:
kubectl taint nodes nodename special=true:NoSchedulekubectl taint nodes nodename special=true:PreferNoSchedule
The corresponding toleration are then used in the pod to ensure that specific pods are able to use specific hardware. Then again, we can use tags or some other feature to judge these pods and dispatch them to the servers on those specific hardware.
Coping with node failures
As mentioned earlier, in the case of a node failure, the node can be automatically set taint by the Taintbasedevictions function, and then the pod will be evicted. However, in some scenarios, such as the network failure caused by the Master and node, and this node running a lot of local state of the application, even if the network failure, still want to be able to continue to run on the node, expect the network to recover quickly, so as to avoid being expelled from this node. The toleration of the pod can be defined like this:
tolerations:- key: "node.alpha.kubernetes.io/unreachable" operator: "Exists" effect: "NoExecute" tolerationSeconds: 6000
For node not ready state, you can set the key to node.alpha.kubernetes.io/notReady
.
If it is not specified for the pod node.alpha.kubernetes.io/noReady
Toleration
, then kubernetes automatically joins tolerationSeconds=300的node.alpha.kubernetes.io/notReady
the type for the pod toleration
.
Similarly, if the pod is not node.alpha.kubernetes.io/unreachable
specified Toleration
, then kubernetes automatically joins tolerationSeconds=300的node.alpha.kubernetes.io/unreachable
the type for the pod toleration
.
These systems automatically set the toleration for the pod to be able to run 5min before the eviction when node discovers the problem. These two default toleration are automatically added by the admission Controller "Defaulttolerationseconds".
Kubernetes Scheduling Policy