kubernetes: Container runtime is down,PLEG is not healthy

What happened:
I have aks with one kubernetes cluster having 2 nodes. Each node has about 6-7 pod running with 2 containers for each pod. One container is my docker image and the other is created by istio for its service mesh. But after about 10 hours the nodes become ‘not ready’ and the node describe shows me 2 errors: 1.container runtime is down,PLEG is not healthy: pleg was lastseen active 1h32m35.942907195s ago; threshold is 3m0s. 2.rpc error: code = DeadlineExceeded desc = context deadline exceeded, Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?

When I restart the node, it works fine but, the node goes back to ‘NOT READY’ after a while. Started facing this issue since adding in istio, but could not find any documents relating the two. Next step is to try and upgrade kubernetes

What you expected to happen: Container to run smoothly

How to reproduce it (as minimally and precisely as possible): Not clear what causes the issue, so cannot reproduce precisly, However the pod keeps running fine for some time after with the issue is encountered again. And nodes cannot restart and stays in not ready state.

Anything else we need to know?: Installed istio.

Environment:

  • Kubernetes version (use kubectl version): Client version = Client Version: version.Info{Major:“1”, Minor:“11”, GitVersion:“v1.11.0”, GitCommit:“91e7b4fd31fcd3d5f436da26c980becec37ceefe”, GitTreeState:“clean”, BuildDate:“2018-06-27T20:17:28Z”, GoVersion:“go1.10.2”, Compiler:“gc”, Platform:“linux/amd64”} Server Version: version.Info{Major:“1”, Minor:“11”, GitVersion:“v1.11.3”, GitCommit:“a4529464e4629c21224b3d52edfe0ea91b072862”, GitTreeState:“clean”, BuildDate:“2018-09-09T17:53:03Z”, GoVersion:“go1.10.3”, Compiler:“gc”, Platform:“linux/amd64”}

  • Cloud provider or hardware configuration: AKS

  • OS (e.g. from /etc/os-release): Linux Ubuntu

  • Kernel (e.g. uname -a):

  • Install tools:Linux cc-70bc1fbb-7c659cfcbf-fqrbp 4.15.0-1035-azure #36~16.04.1-Ubuntu SMP Fri Nov 30 15:25:49 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

  • Others: Istio installed on top of kubernetes

Deployment File: apiVersion: extensions/v1beta1 kind: Deployment metadata: creationTimestamp: null name: emailgistics-pod spec: minReadySeconds: 10 replicas: 1 strategy: rollingUpdate: maxSurge: 1 maxUnavailable: 1 type: RollingUpdate template: metadata: annotations: sidecar.istio.io/status: ‘{“version”:“ebf16d3ea0236e4b5cb4d3fc0f01da62e2e6265d005e58f8f6bd43a4fb672fdd”,“initContainers”:[“istio-init”],“containers”:[“istio-proxy”],“volumes”:[“istio-envoy”,“istio-certs”],“imagePullSecrets”:null}’ creationTimestamp: null labels: app: emailgistics-pod spec: containers: - image: xxxxxxxxxxxxxxxxxxxxx/emailgistics_pod:xxxxxx imagePullPolicy: Always name: emailgistics-pod ports: - containerPort: 80 resources: {} - args: - proxy - sidecar - --configPath - /etc/istio/proxy - --binaryPath - /usr/local/bin/envoy - --serviceCluster - emailgistics-pod - --drainDuration - 45s - --parentShutdownDuration - 1m0s - --discoveryAddress - istio-pilot.istio-system:15005 - --discoveryRefreshDelay - 1s - --zipkinAddress - zipkin.istio-system:9411 - --connectTimeout - 10s - --proxyAdminPort - “15000” - --controlPlaneAuthPolicy - MUTUAL_TLS env: - name: POD_NAME valueFrom: fieldRef: fieldPath: metadata.name - name: POD_NAMESPACE valueFrom: fieldRef: fieldPath: metadata.namespace - name: INSTANCE_IP valueFrom: fieldRef: fieldPath: status.podIP - name: ISTIO_META_POD_NAME valueFrom: fieldRef: fieldPath: metadata.name - name: ISTIO_META_INTERCEPTION_MODE value: REDIRECT - name: ISTIO_METAJSON_LABELS value: | {“app”:“emailgistics-pod”} image: docker.io/istio/proxyv2:1.0.4 imagePullPolicy: IfNotPresent name: istio-proxy ports: - containerPort: 15090 name: http-envoy-prom protocol: TCP resources: requests: cpu: 10m securityContext: readOnlyRootFilesystem: true runAsUser: 1337 volumeMounts: - mountPath: /etc/istio/proxy name: istio-envoy - mountPath: /etc/certs/ name: istio-certs readOnly: true imagePullSecrets: - name: ga.secretname initContainers: - args: - -p - “15001” - -u - “1337” - -m - REDIRECT - -i - ‘*’ - -x - “” - -b - “80” - -d - “” image: docker.io/istio/proxy_init:1.0.4 imagePullPolicy: IfNotPresent name: istio-init resources: {} securityContext: capabilities: add: - NET_ADMIN privileged: true volumes: - emptyDir: medium: Memory name: istio-envoy - name: istio-certs secret: optional: true secretName: istio.default status: {}

Error log Name: aks-agentpool-22124581-0 Roles: agent Labels: agentpool=agentpool beta.kubernetes.io/arch=amd64 beta.kubernetes.io/instance-type=Standard_B2s beta.kubernetes.io/os=linux failure-domain.beta.kubernetes.io/region=eastus failure-domain.beta.kubernetes.io/zone=1 kubernetes.azure.com/cluster=MC_XXXXXXXXX kubernetes.io/hostname=aks-XXXXXXXXX kubernetes.io/role=agent node-role.kubernetes.io/agent= storageprofile=managed storagetier=Premium_LRS Annotations: aks.microsoft.com/remediated=3 node.alpha.kubernetes.io/ttl=0 volumes.kubernetes.io/controller-managed-attach-detach=true CreationTimestamp: Thu, 25 Oct 2018 14:46:53 +0000 Taints: <none> Unschedulable: false Conditions: Type Status LastHeartbeatTime LastTransitionTime Reason Message


NetworkUnavailable False Thu, 25 Oct 2018 14:49:06 +0000 Thu, 25 Oct 2018 14:49:06 +0000 RouteCreated RouteController created a route OutOfDisk False Wed, 19 Dec 2018 19:28:55 +0000 Wed, 19 Dec 2018 19:27:24 +0000 KubeletHasSufficientDisk kubelet has sufficient disk space available MemoryPressure False Wed, 19 Dec 2018 19:28:55 +0000 Wed, 19 Dec 2018 19:27:24 +0000 KubeletHasSufficientMemory kubelet has sufficient memory available DiskPressure False Wed, 19 Dec 2018 19:28:55 +0000 Wed, 19 Dec 2018 19:27:24 +0000 KubeletHasNoDiskPressure kubelet has no disk pressure PIDPressure False Wed, 19 Dec 2018 19:28:55 +0000 Thu, 25 Oct 2018 14:46:53 +0000 KubeletHasSufficientPID kubelet has sufficient PID available Ready False Wed, 19 Dec 2018 19:28:55 +0000 Wed, 19 Dec 2018 19:27:24 +0000 KubeletNotReady container runtime is down,PLEG is not healthy: pleg was lastseen active 1h32m35.942907195s ago; threshold is 3m0s Addresses: Hostname: aks-XXXXXXXXX Capacity: cpu: 2 ephemeral-storage: 30428648Ki hugepages-1Gi: 0 hugepages-2Mi: 0 memory: 4040536Ki pods: 110 Allocatable: cpu: 1940m ephemeral-storage: 28043041951 hugepages-1Gi: 0 hugepages-2Mi: 0 memory: 3099480Ki pods: 110 System Info: Machine ID: XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX System UUID: XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX Boot ID: XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX Kernel Version: 4.15.0-1035-azure OS Image: Ubuntu 16.04.5 LTS Operating System: linux Architecture: amd64 Container Runtime Version: docker://Unknown Kubelet Version: v1.11.3 Kube-Proxy Version: v1.11.3 PodCIDR: 10.244.0.0/24 ProviderID: azure:///subscriptions/9XXXXXXXXXXX/resourceGroups/MC_XXXXXXXXXXXXXXXXXXXXXXXXXXXX/providers/Microsoft.Compute/virtualMachines/aks-XXXXXXXXXXXX Non-terminated Pods: (42 in total) Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits


default emailgistics-graph-monitor-6477568564-q98p2 10m (0%) 0 (0%) 0 (0%) 0 (0%) default emailgistics-message-handler-7df4566b6f-mh255 10m (0%) 0 (0%) 0 (0%) 0 (0%) default emailgistics-reports-aggregator-5fd96b94cb-b5vbn 10m (0%) 0 (0%) 0 (0%) 0 (0%) default emailgistics-rules-844b77f46-5lrkw 10m (0%) 0 (0%) 0 (0%) 0 (0%) default emailgistics-scheduler-754884b566-mwgvp 10m (0%) 0 (0%) 0 (0%) 0 (0%) default emailgistics-subscription-token-manager-7974558985-f2t49 10m (0%) 0 (0%) 0 (0%) 0 (0%) default mollified-kiwi-cert-manager-665c5d9c8c-2ld59 0 (0%) 0 (0%) 0 (0%) 0 (0%) istio-system grafana-59b787b9b-dzdtc 10m (0%) 0 (0%) 0 (0%) 0 (0%) istio-system istio-citadel-5d8956cc6-x55vk 10m (0%) 0 (0%) 0 (0%) 0 (0%) istio-system istio-egressgateway-f48fc7fbb-szpwp 10m (0%) 0 (0%) 0 (0%) 0 (0%) istio-system istio-galley-6975b6bd45-g7lsc 10m (0%) 0 (0%) 0 (0%) 0 (0%) istio-system istio-ingressgateway-c6c4bcdbf-bbgcw 10m (0%) 0 (0%) 0 (0%) 0 (0%) istio-system istio-pilot-d9b5b9b7c-ln75n 510m (26%) 0 (0%) 2Gi (67%) 0 (0%) istio-system istio-policy-6b465cd4bf-92l57 20m (1%) 0 (0%) 0 (0%) 0 (0%) istio-system istio-policy-6b465cd4bf-b2z85 20m (1%) 0 (0%) 0 (0%) 0 (0%) istio-system istio-policy-6b465cd4bf-j59r4 20m (1%) 0 (0%) 0 (0%) 0 (0%) istio-system istio-policy-6b465cd4bf-s9pdm 20m (1%) 0 (0%) 0 (0%) 0 (0%) istio-system istio-sidecar-injector-575597f5cf-npkcz 10m (0%) 0 (0%) 0 (0%) 0 (0%) istio-system istio-telemetry-6944cd768-9794j 20m (1%) 0 (0%) 0 (0%) 0 (0%) istio-system istio-telemetry-6944cd768-g7gh5 20m (1%) 0 (0%) 0 (0%) 0 (0%) istio-system istio-telemetry-6944cd768-gd88n 20m (1%) 0 (0%) 0 (0%) 0 (0%) istio-system istio-telemetry-6944cd768-px8qb 20m (1%) 0 (0%) 0 (0%) 0 (0%) istio-system istio-telemetry-6944cd768-xzslh 20m (1%) 0 (0%) 0 (0%) 0 (0%) istio-system istio-tracing-7596597bd7-hjtq2 10m (0%) 0 (0%) 0 (0%) 0 (0%) istio-system prometheus-76db5fddd5-d6dxs 10m (0%) 0 (0%) 0 (0%) 0 (0%) istio-system servicegraph-758f96bf5b-c9sqk 10m (0%) 0 (0%) 0 (0%) 0 (0%) kube-system addon-http-application-routing-default-http-backend-5ccb95zgfm8 10m (0%) 10m (0%) 20Mi (0%) 20Mi (0%) kube-system addon-http-application-routing-external-dns-59d8698886-h8xds 0 (0%) 0 (0%) 0 (0%) 0 (0%) kube-system addon-http-application-routing-nginx-ingress-controller-ff49qc7 0 (0%) 0 (0%) 0 (0%) 0 (0%) kube-system heapster-5d6f9b846c-m4kfp 130m (6%) 130m (6%) 230Mi (7%) 230Mi (7%) kube-system kube-dns-v20-7c7d7d4c66-qqkfm 120m (6%) 0 (0%) 140Mi (4%) 220Mi (7%) kube-system kube-dns-v20-7c7d7d4c66-wrxjm 120m (6%) 0 (0%) 140Mi (4%) 220Mi (7%) kube-system kube-proxy-2tb68 100m (5%) 0 (0%) 0 (0%) 0 (0%) kube-system kube-svc-redirect-d6gqm 10m (0%) 0 (0%) 34Mi (1%) 0 (0%) kube-system kubernetes-dashboard-68f468887f-l9x46 100m (5%) 100m (5%) 50Mi (1%) 300Mi (9%) kube-system metrics-server-5cbc77f79f-x55cs 0 (0%) 0 (0%) 0 (0%) 0 (0%) kube-system omsagent-mhrqm 50m (2%) 150m (7%) 150Mi (4%) 300Mi (9%) kube-system omsagent-rs-d688cdf68-pjpmj 50m (2%) 150m (7%) 100Mi (3%) 500Mi (16%) kube-system tiller-deploy-7f4974b9c8-flkjm 0 (0%) 0 (0%) 0 (0%) 0 (0%) kube-system tunnelfront-7f766dd857-kgqps 10m (0%) 0 (0%) 64Mi (2%) 0 (0%) kube-systems-dev nginx-ingress-dev-controller-7f78f6c8f9-csct4 0 (0%) 0 (0%) 0 (0%) 0 (0%) kube-systems-dev nginx-ingress-dev-default-backend-95fbc75b7-lq9tw 0 (0%) 0 (0%) 0 (0%) 0 (0%) Allocated resources: (Total limits may be over 100 percent, i.e., overcommitted.) Resource Requests Limits


cpu 1540m (79%) 540m (27%) memory 2976Mi (98%) 1790Mi (59%) Events: Type Reason Age From Message


Warning ContainerGCFailed 48m (x43 over 19h) kubelet, aks-agentpool-22124581-0 rpc error: code = DeadlineExceeded desc = context deadline exceeded Warning ImageGCFailed 29m (x57 over 18h) kubelet, aks-agentpool-22124581-0 failed to get image stats: rpc error: code = Unknown desc = Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running? Warning ContainerGCFailed 2m (x237 over 18h) kubelet, aks-agentpool-22124581-0 rpc error: code = Unknown desc = Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?

/kind bug

About this issue

  • Original URL
  • State: closed
  • Created 5 years ago
  • Comments: 24 (9 by maintainers)

Most upvoted comments

This sounds like a memory leak to me. This would explain why the hang occurs 10 hours in. And many of your pods don’t have the same resource requests and resource limits, with istio-pilot-d9b5b9b7c-ln75n seemingly having the biggest discrepancy; it could be leaking memory without getting killed.

I would suggest monitoring memory usage over time and storing it externally to check if that’s what’s happening. You could probably mitigate this by making resource requests and resource limits identical.

Since this is running on AKS, looping in @kubernetes/sig-aws-misc for additional advice.