Processing of kubernetes node resource exhaustion State

Source: Internet
Author: User
Tags webhook k8s cadvisor
This is a creation in Article, where the information may have evolved or changed.

As soon as I arrived this morning, I received a "complaint" from my colleague: a node in the Kubernetes cluster on the private cloud seemed to be out of work because the application that was specifically deployed on that node was hung up and was not recovered for a long time. This company private cloud Kubernetes cluster is the v1.7.5 version, deployed before the two-day holiday. Recently feel k8s development significantly accelerated, continuous release version, as of Press, the latest release is v1.8.1. This cluster has been running relatively stable, what is this anomaly today? Then opened the terminal, began the investigation of the problem.

First, the problem phenomenon

We have a total of three kubernetes Node in this small cluster. First, I looked at all the pods states in the cluster and found that pods on both Node1 and Node2 were normal (running state), but the three node3 on pods were "Pending" states, and the three pods were WEAVE-NET-RH6R4, kube-proxy-v4d1p and portal-3613605798-txq4l, where portal-3613605798-txq4l is our app pod. K8s its own components kube-proxy are abnormal, obviously node3 node problem. If you try to check the status of these pods at this point, you will probably fail because the pod is restarted frequently, and the newly created pod of the 1-2s clock will be kubectl, causing you to be unable to view its status.

I looked directly at the state of Node3, sure enough, and I got some warning events:

# Kubectl Describe ubuntu-k8s-3 ... Events:firstseen lastseen Count from Subobjectpath Type Reason Message-------     -----------------------------------------------------51m 51m 1 Kubelet, ubuntu-k8s-3 Normal nodenotschedulable Node ubuntu-k8s-3 status is Now:nodenotschedulab Le 9d 51m 49428 kubelet, ubuntu-k8s-3 Warning Evictionthresholdmet attempting to re  Claim Nodefs 5m 5m 1 Kubelet, ubuntu-k8s-3 Normal starting starting Kubelet. 5m 5m 2 Kubelet, ubuntu-k8s-3 Normal nodehassufficientdisk Node ubuntu-k8s-3 Status is Now:nodehassufficientdisk 5m 5m 2 Kubelet, ubuntu-k8s-3 Normal nodehassufficientme Mory Node ubuntu-k8s-3 status is now:nodehassufficientmemory 5m 5m 2 Kubelet, UBUNTU-K8S-3 Normal nodehasnodiskpressure Node ubuntu-k8s-3 status is now:nodehasnodiskpressure 5m 5m 1 Kubelet, ubuntu-k8s-3 Normal nodeallocatableenforced Updated Node allocatable Limit Acro SS Pods 5m 5m 1 Kubelet, ubuntu-k8s-3 Normal nodehasdiskpressure Node ubuntu-k8s-3 Status is now:nodehasdiskpressure 5m 14s kubelet, ubuntu-k8s-3 Warning evictionthr Esholdmet attempting to reclaim Nodefs

Two points worth of content:
1. Node ubuntu-k8s-3 status is Now:nodehasdiskpressure
2, Warning: "Evictionthresholdmet attempting to reclaim Nodefs"

From the above content can be roughly determined that the node3 is in a state of insufficient disk space, and the node Kubelet daemon to determine the eviction threshold, trying to reclaim disk space (through some kind of killing pod, I Guess).

Now that we've mentioned Kubelet, let's take a look at the log for this backend service:

  # journalctl-u KUBELET-F10 month 09:50:55 ubuntu-k8s-3 kubelet[17144]: W1016 09:50:55.056703 17144 Eviction_ MANAGER.GO:331] Eviction manager:attempting to reclaim Nodefs10 month 09:50:55 ubuntu-k8s-3 kubelet[17144]: I1016 09:50:55. 057322 17144 eviction_manager.go:345] Eviction manager:must evict pod (s) to reclaim Nodefs10 month 09:50:55 ubuntu-k8s-3 KUBELET[17144]: E1016 09:50:55.058307 17144 eviction_manager.go:356] Eviction manager:eviction thresholds have been met , but no pods is active to evict ... October 09:54:14 ubuntu-k8s-3 kubelet[12844]: W1016 09:54:14.823152 12844 eviction_manager.go:142] Failed to admit pod we Ave-net-3svfg_kube-system (E5A5D474-B214-11E7-A98B-0650CC001A5B)-node has conditions: [diskpressure]10 month 16 09:54:14 Ubuntu-k8s-3 kubelet[12844]: W1016 09:54:14.824246 12844 eviction_manager.go:142] Failed to admit pod Kube-proxy-d9lk0_k Ube-system (E5FF8FDE-B214-11E7-A98B-0650CC001A5B)-node has conditions: [diskpressure]  

Kubelet Log also confirms the above: Node3 because the disk is not enough to participate in pod scheduling, but when trying to reclaim disk space, but found that no active pod can kill!

Second, reason analysis

Now that we've mentioned a disk shortage, let's look at the disk usage:

# df -h文件系统        容量  已用  可用 已用% 挂载点udev            2.0G     0  2.0G    0% /devtmpfs           396M   46M  350M   12% /run/dev/sda1       5.8G  5.1G  448M   92% /tmpfs           2.0G  288K  2.0G    1% /dev/shmtmpfs           5.0M     0  5.0M    0% /run/locktmpfs           2.0G     0  2.0G    0% /sys/fs/cgroup/dev/sdb1        99G  5.2G   89G    6% /datatmpfs           396M     0  396M    0% /run/user/0... ...

We saw that the root partition had a disk occupancy rate of 92%, leaving only 500M of space to use. The Ubuntu VM templates provided by our private cloud are too rigid (cannot be customized), and each VM can mount a root partition that is only 6G and not much more. This makes the root partition occupancy rate high after installing some of the necessary software. To do this, we've also deliberately mounted a dedicated disk (/DEV/SDB1) to store the relevant image and container running data for Docker and to migrate the original Docker data to a new location (/data/docker).

Attached: Docker Runtime Data Migration method (for Docker 1.12.x later versions):
A) Create/etc/docker/daemon.json

The contents of the file are as follows:
"Graph": "/data/docker",
"Storage-driver": "Aufs"

b) Stop Docker and migrate data
Systemctl Stop Docker

c) Restart Docker
Systemctl Daemon-reload
Systemctl Restart Docker

For some reason, our portal pod must be running on that node (by Nodeselector the way you select node). In the case where the root partition size cannot be expanded, we can only further "squeeze" node in order to temporarily resume pod operation. So our idea is to let node recover healthy by adjusting node's eviction threshold value.

Third, the solution

To solve this problem, we need to read K8s's official note about "eviction Policy". The general meaning is: Each node on the Kubelet is responsible for the regular collection of resource consumption data and compared with the preset threshold value, if the threshold value is exceeded, Kubelet will try to kill some pods to reclaim related resources, to protect node. Kubelet The resource indicators of concern threshold about the following:

- memory.available- nodefs.available- nodefs.inodesFree- imagefs.available- imagefs.inodesFree

Each threshold is divided into two sets of values, Eviction-soft and Eviction-hard. The difference between soft and hard is that the former gives the pod an elegant exit when it reaches the threshold value, while the latter advocates "violence", killing the pod directly, without any chance of graceful exit. Here also mention the difference between Nodefs and imagefs:

    • Nodefs: Refers to Node's own storage, storage daemon running logs, etc., generally referred to as the root partition/;
    • Imagefs: Refers to the disk that Docker daemon uses to store image and container writable layers (writable layer);

In our case, our imagefs is/DEV/SDB1, the disk occupancy rate is low, and the nodefs, the/partition occupancy rate is high (92%).

Let's restart the Kubelet to see the current values of these threshold (viewed through journalctl-u kubelet-f):

October 09:54:09 ubuntu-k8s-3 systemd[1]: Started kubelet:the Kubernetes Node agent.10 month 09:54:09 ubuntu-k8s-3 kubelet[1  2844]: I1016 09:54:09.381711 12844 feature_gate.go:144] feature gates:map[]10 month 09:54:09 ubuntu-k8s-3 kubelet[12844]: I1016 09:54:09.437470 12844 client.go:72] connecting to Docker on Unix:///var/run/docker.sock10 month 09:54:09 ubuntu-k8s -3 kubelet[12844]: I1016 09:54:09.438075 12844 client.go:92] Start Docker client with request TIMEOUT=2M0S10 Month 16 09:54:0 9 ubuntu-k8s-3 kubelet[12844]: I1016 09:54:09.471485 12844 manager.go:143] cadvisor running in container: "/system.slice /kubelet.service "... October 09:54:09 ubuntu-k8s-3 kubelet[12844]: I1016 09:54:09.615818 12844 container_manager_linux.go:246] Container manag  Er verified user specified cgroup-root exists:/October 09:54:09 ubuntu-k8s-3 kubelet[12844]: I1016 09:54:09.616263 12844 CONTAINER_MANAGER_LINUX.GO:251] Creating container Manager object based on Node Config: {runtimecgroupsname: SystemcgroupSName:KubeletCgroupsName:ContainerRuntime:docker cgroupsperqos:true cgrouproot:/ Cgroupdriver:cgroupfs Protectkerneldefaults:false Nodeallocatableconfig:{kubereservedcgroupname: systemreservedcgroupname:enforcenodeallocatable:map[pods:{}] kubereserved:map[] SystemReserved:map[] Hardevictionthresholds:[{signal:memory.available Operator:lessthan value:{quantity:100mi Percentage:0} GracePeriod : 0s minreclaim:<nil>} {Signal:nodefs.available Operator:lessthan value:{quantity:<nil> percentage:0.1} graceperiod:0s Minreclaim:<nil>} {Signal:nodefs.inodesFree Operator:lessthan value:{quantity:<nil> percentage:0.05} graceperiod:0s Minreclaim:<nil>}]} experimentalqosreserved:map[]}10 month 09:54:09 ubuntu-k8s-3 KUBELET[12844]: I1016 09:54:09.617680 12844 kubelet.go:263] Adding manifest file:/etc/kubernetes/manifests10 month 16 09:54: Ubuntu-k8s-3 kubelet[12844]: I1016 09:54:09.618196 12844 kubelet.go:273] watching apiserver ...

Reformat the information related to threshold:

    HardEvictionThresholds: [        {            Signal: memory.availableOperator: LessThanValue: {                Quantity: 100MiPercentage: 0            }GracePeriod: 0sMinReclaim: <nil>        }{            Signal: nodefs.availableOperator: LessThanValue: {                Quantity: <nil>Percentage: 0.1            }GracePeriod: 0sMinReclaim: <nil>        }{            Signal: nodefs.inodesFreeOperator: LessThanValue: {                Quantity: <nil>Percentage: 0.05            }GracePeriod: 0sMinReclaim: <nil>        }    ]

We see that initially, Kubelet did not set the soft eviction, just set the hard eviction threshold value for memory and Nodefs. Here is the most worthy of our attention is: nodefs.available percentage:0.1. This means that when the nodefs free space is less than 10%, the Kubelet on the node will perform the eviction action. The remaining free space in our root partition is 8%, which obviously satisfies this condition, so the problem occurs.

All we have to do is change this value temporarily, and you can set it to <5%.

Iv. Steps of the solution

We need to reset the nodefs.available threshold value for Kubelet. How do you do it?

The Kubelet is a daemon that runs on each kubernetes node and is pulled from Systemd at System boot:

root@ubuntu-k8s-3:~# ps -ef|grep kubeletroot      5718  5695  0 16:38 pts/3    00:00:00 grep --color=auto kubeletroot     13640     1  4 10:25 ?        00:17:25 /usr/bin/kubelet --kubeconfig=/etc/kubernetes/kubelet.conf --require-kubeconfig=true --pod-manifest-path=/etc/kubernetes/manifests --allow-privileged=true --network-plugin=cni --cni-conf-dir=/etc/cni/net.d --cni-bin-dir=/opt/cni/bin --cluster-dns= --cluster-domain=cluster.local --authorization-mode=Webhook --client-ca-file=/etc/kubernetes/pki/ca.crt --cadvisor-port=0

Check the status of the Kubelet service:

root@ubuntu-k8s-3:~# systemctl status kubelet● kubelet.service - kubelet: The Kubernetes Node Agent   Loaded: loaded (/lib/systemd/system/kubelet.service; enabled; vendor preset: enabled)  Drop-In: /etc/systemd/system/kubelet.service.d           └─10-kubeadm.conf   Active: active (running) since 一 2017-10-16 10:25:09 CST; 6h ago     Docs: Main PID: 13640 (kubelet)    Tasks: 18   Memory: 62.0M      CPU: 18min 15.235s   CGroup: /system.slice/kubelet.service           ├─13640 /usr/bin/kubelet --kubeconfig=/etc/kubernetes/kubelet.conf --require-kubeconfig=true --pod-manifest-path=/etc/kubernetes/manifests --allow-privileged=true --           └─13705 journalctl -k -f.... ...

With the status output, we see about the Kubelet service having two SYSTEMD service profiles associated with their startup:

- /lib/systemd/system/kubelet.serviceDrop-In: /etc/systemd/system/kubelet.service.d           └─10-kubeadm.conf

/lib/systemd/system/kubelet.service is relatively simple:

[Unit]Description=kubelet: The Kubernetes Node AgentDocumentation=[Service]ExecStart=/usr/bin/kubeletRestart=alwaysStartLimitInterval=0RestartSec=10[Install]

/etc/systemd/system/kubelet.service.d/10-kubeadm.conf is used in SYSTEMD for override Kubelet.service the drop-in files that are partially configured in the Kubelet, the startup configuration is here:

[Service]Environment="KUBELET_KUBECONFIG_ARGS=--kubeconfig=/etc/kubernetes/kubelet.conf --require-kubeconfig=true"Environment="KUBELET_SYSTEM_PODS_ARGS=--pod-manifest-path=/etc/kubernetes/manifests --allow-privileged=true"Environment="KUBELET_NETWORK_ARGS=--network-plugin=cni --cni-conf-dir=/etc/cni/net.d --cni-bin-dir=/opt/cni/bin"Environment="KUBELET_DNS_ARGS=--cluster-dns= --cluster-domain=cluster.local"Environment="KUBELET_AUTHZ_ARGS=--authorization-mode=Webhook --client-ca-file=/etc/kubernetes/pki/ca.crt"Environment="KUBELET_CADVISOR_ARGS=--cadvisor-port=0"ExecStart=ExecStart=/usr/bin/kubelet $KUBELET_KUBECONFIG_ARGS $KUBELET_SYSTEM_PODS_ARGS $KUBELET_NETWORK_ARGS $KUBELET_DNS_ARGS $KUBELET_AUTHZ_ARGS $KUBELET_CADVISOR_ARGS $KUBELET_EXTRA_ARGS

SYSTEMD Kubelet will be overwritten with Execstart in 10-kubeadm.conf when starting the/lib/systemd/system/ Kubelet.service in the Execstart, so we can see the above kubelet behind the long-run command line startup parameters. All we have to do is add the threshold value of the nodefs.available we want to set up after this line of startup parameters.

For configuration-style consistency, we define a new environment Var, for example, called: Kubelet_eviction_policy_args:


To restart Kubelet, let's look at the log to see if the new value of threshold is valid:

10月 16 16:56:10 ubuntu-k8s-3 kubelet[7394]: I1016 16:56:10.840914    7394 container_manager_linux.go:251] Creating Container Manager object based on Node Config: {RuntimeCgroupsName: SystemCgroupsName: KubeletCgroupsName: ContainerRuntime:docker CgroupsPerQOS:true CgroupRoot:/ CgroupDriver:cgroupfs ProtectKernelDefaults:false NodeAllocatableConfig:{KubeReservedCgroupName: SystemReservedCgroupName: EnforceNodeAllocatable:map[pods:{}] KubeReserved:map[] SystemReserved:map[] HardEvictionThresholds:[{Signal:nodefs.available Operator:LessThan Value:{Quantity:<nil> Percentage:0.05} GracePeriod:0s MinReclaim:<nil>}]} ExperimentalQOSReserved:map[]}

We see the following line indicating that the new configuration is in effect:

Signal:nodefs.available Operator:LessThan Value:{Quantity:<nil> Percentage:0.05}

The problem was resolved by viewing the pods status, where the three pods that were originally in the pending state became "running" state.

V. References

1) "Handling out of Resource Errors"
2) "Configure out of Resource handling"
3) "Systemd Introductory Tutorial: The actual combat chapter"
4) "System bootup process"
5) "Systemd for upstart Users-ubuntu wiki"

Weibo: @tonybai_cn
Public Number: Iamtonybai

, Bigwhite. All rights reserved.

Related Posts:

    1. Exploration of kubernetes cluster installed in KUBEADM mode
    2. Step-by-step to create a highly available Kubernetes cluster based on Kubeadm-part I.
    3. Step-by-step to create a highly available Kubernetes cluster based on KUBEADM-Part II
    4. Installing Kubernetes-part2 with Kubeadm
    5. Troubleshoot issues that Kubernetes 1.6.4 dashboard cannot access
Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.