kubernetes: Container runtime is down,PLEG is not healthy

What happened:
I have aks with one kubernetes cluster having 2 nodes. Each node has about 6-7 pod running with 2 containers for each pod. One container is my docker image and the other is created by istio for its service mesh. But after about 10 hours the nodes become ‘not ready’ and the node describe shows me 2 errors: 1.container runtime is down,PLEG is not healthy: pleg was lastseen active 1h32m35.942907195s ago; threshold is 3m0s. 2.rpc error: code = DeadlineExceeded desc = context deadline exceeded, Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?

When I restart the node, it works fine but, the node goes back to ‘NOT READY’ after a while. Started facing this issue since adding in istio, but could not find any documents relating the two. Next step is to try and upgrade kubernetes

What you expected to happen: Container to run smoothly

How to reproduce it (as minimally and precisely as possible): Not clear what causes the issue, so cannot reproduce precisly, However the pod keeps running fine for some time after with the issue is encountered again. And nodes cannot restart and stays in not ready state.

Anything else we need to know?: Installed istio.

Environment:

Kubernetes version (use kubectl version): Client version = Client Version: version.Info{Major:“1”, Minor:“11”, GitVersion:“v1.11.0”, GitCommit:“91e7b4fd31fcd3d5f436da26c980becec37ceefe”, GitTreeState:“clean”, BuildDate:“2018-06-27T20:17:28Z”, GoVersion:“go1.10.2”, Compiler:“gc”, Platform:“linux/amd64”} Server Version: version.Info{Major:“1”, Minor:“11”, GitVersion:“v1.11.3”, GitCommit:“a4529464e4629c21224b3d52edfe0ea91b072862”, GitTreeState:“clean”, BuildDate:“2018-09-09T17:53:03Z”, GoVersion:“go1.10.3”, Compiler:“gc”, Platform:“linux/amd64”}
Cloud provider or hardware configuration: AKS
OS (e.g. from /etc/os-release): Linux Ubuntu
Kernel (e.g. uname -a):
Install tools:Linux cc-70bc1fbb-7c659cfcbf-fqrbp 4.15.0-1035-azure #36~16.04.1-Ubuntu SMP Fri Nov 30 15:25:49 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
Others: Istio installed on top of kubernetes

Deployment File: apiVersion: extensions/v1beta1 kind: Deployment metadata: creationTimestamp: null name: emailgistics-pod spec: minReadySeconds: 10 replicas: 1 strategy: rollingUpdate: maxSurge: 1 maxUnavailable: 1 type: RollingUpdate template: metadata: annotations: sidecar.istio.io/status: ‘{“version”:“ebf16d3ea0236e4b5cb4d3fc0f01da62e2e6265d005e58f8f6bd43a4fb672fdd”,“initContainers”:[“istio-init”],“containers”:[“istio-proxy”],“volumes”:[“istio-envoy”,“istio-certs”],“imagePullSecrets”:null}’ creationTimestamp: null labels: app: emailgistics-pod spec: containers: - image: xxxxxxxxxxxxxxxxxxxxx/emailgistics_pod:xxxxxx imagePullPolicy: Always name: emailgistics-pod ports: - containerPort: 80 resources: {} - args: - proxy - sidecar - --configPath - /etc/istio/proxy - --binaryPath - /usr/local/bin/envoy - --serviceCluster - emailgistics-pod - --drainDuration - 45s - --parentShutdownDuration - 1m0s - --discoveryAddress - istio-pilot.istio-system:15005 - --discoveryRefreshDelay - 1s - --zipkinAddress - zipkin.istio-system:9411 - --connectTimeout - 10s - --proxyAdminPort - “15000” - --controlPlaneAuthPolicy - MUTUAL_TLS env: - name: POD_NAME valueFrom: fieldRef: fieldPath: metadata.name - name: POD_NAMESPACE valueFrom: fieldRef: fieldPath: metadata.namespace - name: INSTANCE_IP valueFrom: fieldRef: fieldPath: status.podIP - name: ISTIO_META_POD_NAME valueFrom: fieldRef: fieldPath: metadata.name - name: ISTIO_META_INTERCEPTION_MODE value: REDIRECT - name: ISTIO_METAJSON_LABELS value: | {“app”:“emailgistics-pod”} image: docker.io/istio/proxyv2:1.0.4 imagePullPolicy: IfNotPresent name: istio-proxy ports: - containerPort: 15090 name: http-envoy-prom protocol: TCP resources: requests: cpu: 10m securityContext: readOnlyRootFilesystem: true runAsUser: 1337 volumeMounts: - mountPath: /etc/istio/proxy name: istio-envoy - mountPath: /etc/certs/ name: istio-certs readOnly: true imagePullSecrets: - name: ga.secretname initContainers: - args: - -p - “15001” - -u - “1337” - -m - REDIRECT - -i - ‘*’ - -x - “” - -b - “80” - -d - “” image: docker.io/istio/proxy_init:1.0.4 imagePullPolicy: IfNotPresent name: istio-init resources: {} securityContext: capabilities: add: - NET_ADMIN privileged: true volumes: - emptyDir: medium: Memory name: istio-envoy - name: istio-certs secret: optional: true secretName: istio.default status: {}

Error log Name: aks-agentpool-22124581-0 Roles: agent Labels: agentpool=agentpool beta.kubernetes.io/arch=amd64 beta.kubernetes.io/instance-type=Standard_B2s beta.kubernetes.io/os=linux failure-domain.beta.kubernetes.io/region=eastus failure-domain.beta.kubernetes.io/zone=1 kubernetes.azure.com/cluster=MC_XXXXXXXXX kubernetes.io/hostname=aks-XXXXXXXXX kubernetes.io/role=agent node-role.kubernetes.io/agent= storageprofile=managed storagetier=Premium_LRS Annotations: aks.microsoft.com/remediated=3 node.alpha.kubernetes.io/ttl=0 volumes.kubernetes.io/controller-managed-attach-detach=true CreationTimestamp: Thu, 25 Oct 2018 14:46:53 +0000 Taints: <none> Unschedulable: false Conditions: Type Status LastHeartbeatTime LastTransitionTime Reason Message

NetworkUnavailable False Thu, 25 Oct 2018 14:49:06 +0000 Thu, 25 Oct 2018 14:49:06 +0000 RouteCreated RouteController created a route OutOfDisk False Wed, 19 Dec 2018 19:28:55 +0000 Wed, 19 Dec 2018 19:27:24 +0000 KubeletHasSufficientDisk kubelet has sufficient disk space available MemoryPressure False Wed, 19 Dec 2018 19:28:55 +0000 Wed, 19 Dec 2018 19:27:24 +0000 KubeletHasSufficientMemory kubelet has sufficient memory available DiskPressure False Wed, 19 Dec 2018 19:28:55 +0000 Wed, 19 Dec 2018 19:27:24 +0000 KubeletHasNoDiskPressure kubelet has no disk pressure PIDPressure False Wed, 19 Dec 2018 19:28:55 +0000 Thu, 25 Oct 2018 14:46:53 +0000 KubeletHasSufficientPID kubelet has sufficient PID available Ready False Wed, 19 Dec 2018 19:28:55 +0000 Wed, 19 Dec 2018 19:27:24 +0000 KubeletNotReady container runtime is down,PLEG is not healthy: pleg was lastseen active 1h32m35.942907195s ago; threshold is 3m0s Addresses: Hostname: aks-XXXXXXXXX Capacity: cpu: 2 ephemeral-storage: 30428648Ki hugepages-1Gi: 0 hugepages-2Mi: 0 memory: 4040536Ki pods: 110 Allocatable: cpu: 1940m ephemeral-storage: 28043041951 hugepages-1Gi: 0 hugepages-2Mi: 0 memory: 3099480Ki pods: 110 System Info: Machine ID: XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX System UUID: XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX Boot ID: XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX Kernel Version: 4.15.0-1035-azure OS Image: Ubuntu 16.04.5 LTS Operating System: linux Architecture: amd64 Container Runtime Version: docker://Unknown Kubelet Version: v1.11.3 Kube-Proxy Version: v1.11.3 PodCIDR: 10.244.0.0/24 ProviderID: azure:///subscriptions/9XXXXXXXXXXX/resourceGroups/MC_XXXXXXXXXXXXXXXXXXXXXXXXXXXX/providers/Microsoft.Compute/virtualMachines/aks-XXXXXXXXXXXX Non-terminated Pods: (42 in total) Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits

default emailgistics-graph-monitor-6477568564-q98p2 default emailgistics-message-handler-7d default emailgistics-reports-aggregator default emailgistics-rules-844b77f46-5lrkw default emailgistics-scheduler-754884b566-mwgvp default emailgistics-subscription-token default mollified-kiwi-cert-manager-665c5d9c8c-2ld59 istio-system grafana-59b787b9b-dzdtc istio-system istio-citadel-5d8956cc6-x55vk istio-system istio-egressgateway-f48fc7fbb-szpwp istio-system istio-galley-6975b6bd45-g7lsc istio-system istio-ingressgateway-c6c4bcdbf-bbgcw istio-system istio-pilot-d9b5b9b7c-ln75n istio-system istio-policy-6b465cd4bf-92l57 istio-system istio-policy-6b465cd4bf-b2z85 istio-system istio-policy-6b465cd4bf-j59r4 istio-system istio-policy-6b465cd4bf-s9pdm istio-system istio-sidecar-injector-575597f5cf-npkcz istio-system istio-telemetry-6944cd768-9794j istio-system istio-telemetry-6944cd768-g7gh5 istio-system istio-telemetry-6944cd768-gd88n istio-system istio-telemetry-6944cd768-px8qb istio-system istio-telemetry-6944cd768-xzslh istio-system istio-tracing-7596597bd7-hjtq2 istio-system prometheus-76db5fddd5-d6dxs istio-system servicegraph-758f96bf5b-c9sqk kube-system addon-http-application-routing- kube-system addon-http-application-routing- kube-system addon-http-application-routing- kube-system heapster-5d6f9b846c-m4kfp kube-system kube-dns-v20-7c7d7d4c66-qqkfm kube-system kube-dns-v20-7c7d7d4c66-wrxjm kube-system kube-proxy-2tb68 kube-system kube-svc-redirect-d6gqm kube-system kubernetes-dashboard-68f468887f-l9x46 kube-system metrics-server-5cbc77f79f-x55cs kube-system omsagent-mhrqm kube-system omsagent-rs-d688cdf68-pjpmj kube-system tiller-deploy-7f4974b9c8-flkjm kube-system tunnelfront-7f766dd857-kgqps kube-systems-dev nginx-ingress-dev-controller-7f kube-systems-dev nginx-ingress-dev-default-backe Allocated resources: (Total limits may be over 100 percent, i.e., overcommitted Resource Requests Limits 10m (0%) 0 (0%) 0 (0%) 0 (0%) f4566b6f-mh255 10m (0%) 0 (0%) 0 (0%) 0 (0%) -5fd96b94cb-b5vbn 10m (0%) 0 (0%) 0 (0%) 0 (0%) 10m (0%) 0 (0%) 0 (0%) 0 (0%) 10m (0%) 0 (0%) 0 (0%) 0 (0%) -manager-7974558985-f2t49 10m (0%) 0 (0%) 0 (0%) 0 (0%) 0 (0%) 0 (0%) 0 (0%) 0 (0%) 10m (0%) 0 (0%) 0 (0%) 0 (0%) 10m (0%) 0 (0%) 0 (0%) 0 (0%) 10m (0%) 0 (0%) 0 (0%) 0 (0%) 10m (0%) 0 (0%) 0 (0%) 0 (0%) 10m (0%) 0 (0%) 0 (0%) 0 (0%) 510m (26%) 0 (0%) 2Gi (67%) 0 (0%) 20m (1%) 0 (0%) 0 (0%) 0 (0%) 20m (1%) 0 (0%) 0 (0%) 0 (0%) 20m (1%) 0 (0%) 0 (0%) 0 (0%) 20m (1%) 0 (0%) 0 (0%) 0 (0%) 10m (0%) 0 (0%) 0 (0%) 0 (0%) 20m (1%) 0 (0%) 0 (0%) 0 (0%) 20m (1%) 0 (0%) 0 (0%) 0 (0%) 20m (1%) 0 (0%) 0 (0%) 0 (0%) 20m (1%) 0 (0%) 0 (0%) 0 (0%) 20m (1%) 0 (0%) 0 (0%) 0 (0%) 10m (0%) 0 (0%) 0 (0%) 0 (0%) 10m (0%) 0 (0%) 0 (0%) 0 (0%) 10m (0%) 0 (0%) 0 (0%) 0 (0%) default-http-backend-5ccb95zgfm8 10m (0%) 10m (0%) 20Mi (0%) 20Mi (0%) external-dns-59d8698886-h8xds 0 (0%) 0 (0%) 0 (0%) 0 (0%) nginx-ingress-controller-ff49qc7 0 (0%) 0 (0%) 0 (0%) 0 (0%) 130m (6%) 130m (6%) 230Mi (7%) 230Mi (7%) 120m (6%) 0 (0%) 140Mi (4%) 220Mi (7%) 120m (6%) 0 (0%) 140Mi (4%) 220Mi (7%) 100m (5%) 0 (0%) 0 (0%) 0 (0%) 10m (0%) 0 (0%) 34Mi (1%) 0 (0%) 100m (5%) 100m (5%) 50Mi (1%) 300Mi (9%) 0 (0%) 0 (0%) 0 (0%) 0 (0%) 50m (2%) 150m (7%) 150Mi (4%) 300Mi (9%) 50m (2%) 150m (7%) 100Mi (3%) 500Mi (16%) 0 (0%) 0 (0%) 0 (0%) 0 (0%) 10m (0%) 0 (0%) 64Mi (2%) 0 (0%) 78f6c8f9-csct4 0 (0%) 0 (0%) 0 (0%) 0 (0%) nd-95fbc75b7-lq9tw 0 (0%) 0 (0%) 0 (0%) 0 (0%) .)

cpu 1540m (79%) 540m (27%) memory 2976Mi (98%) 1790Mi (59%) Events: Type Reason Age From Message

Warning ContainerGCFailed 48m (x43 over 19h) kubelet, aks-agentpool-22124581-0 rpc error: code = DeadlineExceeded desc = context deadline exceeded Warning ImageGCFailed 29m (x57 over 18h) kubelet, aks-agentpool-22124581-0 failed to get image stats: rpc error: code = Unknown desc = Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running? Warning ContainerGCFailed 2m (x237 over 18h) kubelet, aks-agentpool-22124581-0 rpc error: code = Unknown desc = Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?

/kind bug

About this issue

Original URL
State: closed
Created 5 years ago
Comments: 24 (9 by maintainers)

Most upvoted comments

This sounds like a memory leak to me. This would explain why the hang occurs 10 hours in. And many of your pods don’t have the same resource requests and resource limits, with istio-pilot-d9b5b9b7c-ln75n seemingly having the biggest discrepancy; it could be leaking memory without getting killed.

I would suggest monitoring memory usage over time and storing it externally to check if that’s what’s happening. You could probably mitigate this by making resource requests and resource limits identical.

Since this is running on AKS, looping in @kubernetes/sig-aws-misc for additional advice.

Shnatsel on Jan 21, 2019