This is a creation in Article, where the information may have evolved or changed.
As soon as I arrived this morning, I received a "complaint" from my colleague: a node in the Kubernetes cluster on the private cloud seemed to be out of work because the application that was specifically deployed on that node was hung up and was not recovered for a long time. This company private cloud Kubernetes cluster is the v1.7.5 version, deployed before the two-day holiday. Recently feel k8s development significantly accelerated, continuous release version, as of Press, the latest release is v1.8.1. This cluster has been running relatively stable, what is this anomaly today? Then opened the terminal, began the investigation of the problem.
First, the problem phenomenon
We have a total of three kubernetes Node in this small cluster. First, I looked at all the pods states in the cluster and found that pods on both Node1 and Node2 were normal (running state), but the three node3 on pods were "Pending" states, and the three pods were WEAVE-NET-RH6R4, kube-proxy-v4d1p and portal-3613605798-txq4l, where portal-3613605798-txq4l is our app pod. K8s its own components kube-proxy are abnormal, obviously node3 node problem. If you try to check the status of these pods at this point, you will probably fail because the pod is restarted frequently, and the newly created pod of the 1-2s clock will be kubectl, causing you to be unable to view its status.
I looked directly at the state of Node3, sure enough, and I got some warning events:
# Kubectl Describe ubuntu-k8s-3 ... Events:firstseen lastseen Count from Subobjectpath Type Reason Message------- -----------------------------------------------------51m 51m 1 Kubelet, ubuntu-k8s-3 Normal nodenotschedulable Node ubuntu-k8s-3 status is Now:nodenotschedulab Le 9d 51m 49428 kubelet, ubuntu-k8s-3 Warning Evictionthresholdmet attempting to re Claim Nodefs 5m 5m 1 Kubelet, ubuntu-k8s-3 Normal starting starting Kubelet. 5m 5m 2 Kubelet, ubuntu-k8s-3 Normal nodehassufficientdisk Node ubuntu-k8s-3 Status is Now:nodehassufficientdisk 5m 5m 2 Kubelet, ubuntu-k8s-3 Normal nodehassufficientme Mory Node ubuntu-k8s-3 status is now:nodehassufficientmemory 5m 5m 2 Kubelet, UBUNTU-K8S-3 Normal nodehasnodiskpressure Node ubuntu-k8s-3 status is now:nodehasnodiskpressure 5m 5m 1 Kubelet, ubuntu-k8s-3 Normal nodeallocatableenforced Updated Node allocatable Limit Acro SS Pods 5m 5m 1 Kubelet, ubuntu-k8s-3 Normal nodehasdiskpressure Node ubuntu-k8s-3 Status is now:nodehasdiskpressure 5m 14s kubelet, ubuntu-k8s-3 Warning evictionthr Esholdmet attempting to reclaim Nodefs
Two points worth of content:
1. Node ubuntu-k8s-3 status is Now:nodehasdiskpressure
2, Warning: "Evictionthresholdmet attempting to reclaim Nodefs"
From the above content can be roughly determined that the node3 is in a state of insufficient disk space, and the node Kubelet daemon to determine the eviction threshold, trying to reclaim disk space (through some kind of killing pod, I Guess).
Now that we've mentioned Kubelet, let's take a look at the log for this backend service:
# journalctl-u KUBELET-F10 month 09:50:55 ubuntu-k8s-3 kubelet[17144]: W1016 09:50:55.056703 17144 Eviction_ MANAGER.GO:331] Eviction manager:attempting to reclaim Nodefs10 month 09:50:55 ubuntu-k8s-3 kubelet[17144]: I1016 09:50:55. 057322 17144 eviction_manager.go:345] Eviction manager:must evict pod (s) to reclaim Nodefs10 month 09:50:55 ubuntu-k8s-3 KUBELET[17144]: E1016 09:50:55.058307 17144 eviction_manager.go:356] Eviction manager:eviction thresholds have been met , but no pods is active to evict ... October 09:54:14 ubuntu-k8s-3 kubelet[12844]: W1016 09:54:14.823152 12844 eviction_manager.go:142] Failed to admit pod we Ave-net-3svfg_kube-system (E5A5D474-B214-11E7-A98B-0650CC001A5B)-node has conditions: [diskpressure]10 month 16 09:54:14 Ubuntu-k8s-3 kubelet[12844]: W1016 09:54:14.824246 12844 eviction_manager.go:142] Failed to admit pod Kube-proxy-d9lk0_k Ube-system (E5FF8FDE-B214-11E7-A98B-0650CC001A5B)-node has conditions: [diskpressure]
Kubelet Log also confirms the above: Node3 because the disk is not enough to participate in pod scheduling, but when trying to reclaim disk space, but found that no active pod can kill!
Second, reason analysis
Now that we've mentioned a disk shortage, let's look at the disk usage:
# df -h文件系统 容量 已用 可用 已用% 挂载点udev 2.0G 0 2.0G 0% /devtmpfs 396M 46M 350M 12% /run/dev/sda1 5.8G 5.1G 448M 92% /tmpfs 2.0G 288K 2.0G 1% /dev/shmtmpfs 5.0M 0 5.0M 0% /run/locktmpfs 2.0G 0 2.0G 0% /sys/fs/cgroup/dev/sdb1 99G 5.2G 89G 6% /datatmpfs 396M 0 396M 0% /run/user/0... ...
We saw that the root partition had a disk occupancy rate of 92%, leaving only 500M of space to use. The Ubuntu VM templates provided by our private cloud are too rigid (cannot be customized), and each VM can mount a root partition that is only 6G and not much more. This makes the root partition occupancy rate high after installing some of the necessary software. To do this, we've also deliberately mounted a dedicated disk (/DEV/SDB1) to store the relevant image and container running data for Docker and to migrate the original Docker data to a new location (/data/docker).
Attached: Docker Runtime Data Migration method (for Docker 1.12.x later versions):
A) Create/etc/docker/daemon.json
The contents of the file are as follows:
{
"Graph": "/data/docker",
"Storage-driver": "Aufs"
}
b) Stop Docker and migrate data
Systemctl Stop Docker
Mv/var/lib/docker/data
c) Restart Docker
Systemctl Daemon-reload
Systemctl Restart Docker
For some reason, our portal pod must be running on that node (by Nodeselector the way you select node). In the case where the root partition size cannot be expanded, we can only further "squeeze" node in order to temporarily resume pod operation. So our idea is to let node recover healthy by adjusting node's eviction threshold value.
Third, the solution
To solve this problem, we need to read K8s's official note about "eviction Policy". The general meaning is: Each node on the Kubelet is responsible for the regular collection of resource consumption data and compared with the preset threshold value, if the threshold value is exceeded, Kubelet will try to kill some pods to reclaim related resources, to protect node. Kubelet The resource indicators of concern threshold about the following:
- memory.available- nodefs.available- nodefs.inodesFree- imagefs.available- imagefs.inodesFree
Each threshold is divided into two sets of values, Eviction-soft and Eviction-hard. The difference between soft and hard is that the former gives the pod an elegant exit when it reaches the threshold value, while the latter advocates "violence", killing the pod directly, without any chance of graceful exit. Here also mention the difference between Nodefs and imagefs:
- Nodefs: Refers to Node's own storage, storage daemon running logs, etc., generally referred to as the root partition/;
- Imagefs: Refers to the disk that Docker daemon uses to store image and container writable layers (writable layer);
In our case, our imagefs is/DEV/SDB1, the disk occupancy rate is low, and the nodefs, the/partition occupancy rate is high (92%).
Let's restart the Kubelet to see the current values of these threshold (viewed through journalctl-u kubelet-f):
October 09:54:09 ubuntu-k8s-3 systemd[1]: Started kubelet:the Kubernetes Node agent.10 month 09:54:09 ubuntu-k8s-3 kubelet[1 2844]: I1016 09:54:09.381711 12844 feature_gate.go:144] feature gates:map[]10 month 09:54:09 ubuntu-k8s-3 kubelet[12844]: I1016 09:54:09.437470 12844 client.go:72] connecting to Docker on Unix:///var/run/docker.sock10 month 09:54:09 ubuntu-k8s -3 kubelet[12844]: I1016 09:54:09.438075 12844 client.go:92] Start Docker client with request TIMEOUT=2M0S10 Month 16 09:54:0 9 ubuntu-k8s-3 kubelet[12844]: I1016 09:54:09.471485 12844 manager.go:143] cadvisor running in container: "/system.slice /kubelet.service "... October 09:54:09 ubuntu-k8s-3 kubelet[12844]: I1016 09:54:09.615818 12844 container_manager_linux.go:246] Container manag Er verified user specified cgroup-root exists:/October 09:54:09 ubuntu-k8s-3 kubelet[12844]: I1016 09:54:09.616263 12844 CONTAINER_MANAGER_LINUX.GO:251] Creating container Manager object based on Node Config: {runtimecgroupsname: SystemcgroupSName:KubeletCgroupsName:ContainerRuntime:docker cgroupsperqos:true cgrouproot:/ Cgroupdriver:cgroupfs Protectkerneldefaults:false Nodeallocatableconfig:{kubereservedcgroupname: systemreservedcgroupname:enforcenodeallocatable:map[pods:{}] kubereserved:map[] SystemReserved:map[] Hardevictionthresholds:[{signal:memory.available Operator:lessthan value:{quantity:100mi Percentage:0} GracePeriod : 0s minreclaim:<nil>} {Signal:nodefs.available Operator:lessthan value:{quantity:<nil> percentage:0.1} graceperiod:0s Minreclaim:<nil>} {Signal:nodefs.inodesFree Operator:lessthan value:{quantity:<nil> percentage:0.05} graceperiod:0s Minreclaim:<nil>}]} experimentalqosreserved:map[]}10 month 09:54:09 ubuntu-k8s-3 KUBELET[12844]: I1016 09:54:09.617680 12844 kubelet.go:263] Adding manifest file:/etc/kubernetes/manifests10 month 16 09:54: Ubuntu-k8s-3 kubelet[12844]: I1016 09:54:09.618196 12844 kubelet.go:273] watching apiserver ...
Reformat the information related to threshold:
HardEvictionThresholds: [ { Signal: memory.availableOperator: LessThanValue: { Quantity: 100MiPercentage: 0 }GracePeriod: 0sMinReclaim: <nil> }{ Signal: nodefs.availableOperator: LessThanValue: { Quantity: <nil>Percentage: 0.1 }GracePeriod: 0sMinReclaim: <nil> }{ Signal: nodefs.inodesFreeOperator: LessThanValue: { Quantity: <nil>Percentage: 0.05 }GracePeriod: 0sMinReclaim: <nil> } ]
We see that initially, Kubelet did not set the soft eviction, just set the hard eviction threshold value for memory and Nodefs. Here is the most worthy of our attention is: nodefs.available percentage:0.1. This means that when the nodefs free space is less than 10%, the Kubelet on the node will perform the eviction action. The remaining free space in our root partition is 8%, which obviously satisfies this condition, so the problem occurs.
All we have to do is change this value temporarily, and you can set it to <5%.
Iv. Steps of the solution
We need to reset the nodefs.available threshold value for Kubelet. How do you do it?
The Kubelet is a daemon that runs on each kubernetes node and is pulled from Systemd at System boot:
root@ubuntu-k8s-3:~# ps -ef|grep kubeletroot 5718 5695 0 16:38 pts/3 00:00:00 grep --color=auto kubeletroot 13640 1 4 10:25 ? 00:17:25 /usr/bin/kubelet --kubeconfig=/etc/kubernetes/kubelet.conf --require-kubeconfig=true --pod-manifest-path=/etc/kubernetes/manifests --allow-privileged=true --network-plugin=cni --cni-conf-dir=/etc/cni/net.d --cni-bin-dir=/opt/cni/bin --cluster-dns=10.96.0.10 --cluster-domain=cluster.local --authorization-mode=Webhook --client-ca-file=/etc/kubernetes/pki/ca.crt --cadvisor-port=0
Check the status of the Kubelet service:
root@ubuntu-k8s-3:~# systemctl status kubelet● kubelet.service - kubelet: The Kubernetes Node Agent Loaded: loaded (/lib/systemd/system/kubelet.service; enabled; vendor preset: enabled) Drop-In: /etc/systemd/system/kubelet.service.d └─10-kubeadm.conf Active: active (running) since 一 2017-10-16 10:25:09 CST; 6h ago Docs: http://kubernetes.io/docs/ Main PID: 13640 (kubelet) Tasks: 18 Memory: 62.0M CPU: 18min 15.235s CGroup: /system.slice/kubelet.service ├─13640 /usr/bin/kubelet --kubeconfig=/etc/kubernetes/kubelet.conf --require-kubeconfig=true --pod-manifest-path=/etc/kubernetes/manifests --allow-privileged=true -- └─13705 journalctl -k -f.... ...
With the status output, we see about the Kubelet service having two SYSTEMD service profiles associated with their startup:
- /lib/systemd/system/kubelet.serviceDrop-In: /etc/systemd/system/kubelet.service.d └─10-kubeadm.conf
/lib/systemd/system/kubelet.service is relatively simple:
[Unit]Description=kubelet: The Kubernetes Node AgentDocumentation=http://kubernetes.io/docs/[Service]ExecStart=/usr/bin/kubeletRestart=alwaysStartLimitInterval=0RestartSec=10[Install]WantedBy=multi-user.target
/etc/systemd/system/kubelet.service.d/10-kubeadm.conf is used in SYSTEMD for override Kubelet.service the drop-in files that are partially configured in the Kubelet, the startup configuration is here:
[Service]Environment="KUBELET_KUBECONFIG_ARGS=--kubeconfig=/etc/kubernetes/kubelet.conf --require-kubeconfig=true"Environment="KUBELET_SYSTEM_PODS_ARGS=--pod-manifest-path=/etc/kubernetes/manifests --allow-privileged=true"Environment="KUBELET_NETWORK_ARGS=--network-plugin=cni --cni-conf-dir=/etc/cni/net.d --cni-bin-dir=/opt/cni/bin"Environment="KUBELET_DNS_ARGS=--cluster-dns=10.96.0.10 --cluster-domain=cluster.local"Environment="KUBELET_AUTHZ_ARGS=--authorization-mode=Webhook --client-ca-file=/etc/kubernetes/pki/ca.crt"Environment="KUBELET_CADVISOR_ARGS=--cadvisor-port=0"ExecStart=ExecStart=/usr/bin/kubelet $KUBELET_KUBECONFIG_ARGS $KUBELET_SYSTEM_PODS_ARGS $KUBELET_NETWORK_ARGS $KUBELET_DNS_ARGS $KUBELET_AUTHZ_ARGS $KUBELET_CADVISOR_ARGS $KUBELET_EXTRA_ARGS
SYSTEMD Kubelet will be overwritten with Execstart in 10-kubeadm.conf when starting the/lib/systemd/system/ Kubelet.service in the Execstart, so we can see the above kubelet behind the long-run command line startup parameters. All we have to do is add the threshold value of the nodefs.available we want to set up after this line of startup parameters.
For configuration-style consistency, we define a new environment Var, for example, called: Kubelet_eviction_policy_args:
Environment="KUBELET_EVICTION_POLICY_ARGS=--eviction-hard=nodefs.available<5%"ExecStart=ExecStart=/usr/bin/kubelet $KUBELET_KUBECONFIG_ARGS $KUBELET_SYSTEM_PODS_ARGS $KUBELET_NETWORK_ARGS $KUBELET_DNS_ARGS $KUBELET_AUTHZ_ARGS $KUBELET_CADVISOR_ARGS $KUBELET_EXTRA_ARGS $KUBELET_EVICTION_POLICY_ARGS
To restart Kubelet, let's look at the log to see if the new value of threshold is valid:
10月 16 16:56:10 ubuntu-k8s-3 kubelet[7394]: I1016 16:56:10.840914 7394 container_manager_linux.go:251] Creating Container Manager object based on Node Config: {RuntimeCgroupsName: SystemCgroupsName: KubeletCgroupsName: ContainerRuntime:docker CgroupsPerQOS:true CgroupRoot:/ CgroupDriver:cgroupfs ProtectKernelDefaults:false NodeAllocatableConfig:{KubeReservedCgroupName: SystemReservedCgroupName: EnforceNodeAllocatable:map[pods:{}] KubeReserved:map[] SystemReserved:map[] HardEvictionThresholds:[{Signal:nodefs.available Operator:LessThan Value:{Quantity:<nil> Percentage:0.05} GracePeriod:0s MinReclaim:<nil>}]} ExperimentalQOSReserved:map[]}
We see the following line indicating that the new configuration is in effect:
Signal:nodefs.available Operator:LessThan Value:{Quantity:<nil> Percentage:0.05}
The problem was resolved by viewing the pods status, where the three pods that were originally in the pending state became "running" state.
V. References
1) "Handling out of Resource Errors"
2) "Configure out of Resource handling"
3) "Systemd Introductory Tutorial: The actual combat chapter"
4) "System bootup process"
5) "Systemd for upstart Users-ubuntu wiki"
Weibo: @tonybai_cn
Public Number: Iamtonybai
Github.com:https://github.com/bigwhite
, Bigwhite. All rights reserved.
Related Posts:
- Exploration of kubernetes cluster installed in KUBEADM mode
- Step-by-step to create a highly available Kubernetes cluster based on Kubeadm-part I.
- Step-by-step to create a highly available Kubernetes cluster based on KUBEADM-Part II
- Installing Kubernetes-part2 with Kubeadm
- Troubleshoot issues that Kubernetes 1.6.4 dashboard cannot access