kubernetes: Pods stuck on terminating

What happened:

Pods stuck on terminating

What you expected to happen:

Pods to Terminated after failing Readiness and Liveliness Probe

How to reproduce it (as minimally and precisely as possible):

Create a Deployment.
kubelet fails to delete and recreate the pod after terminationGracePeriodSeconds: 300
Had to forcefully delete the pod after it was stuck in Terminating State.

Anything else we need to know?:

BElow is the deployment.yaml (I have redacted confidential info) apiVersion: apps/v1 kind: Deployment metadata: name: ABC namespace: default spec: replicas: 1 selector: matchLabels: app: ABC strategy: type: Recreate template: metadata: annotations: checksum/config: xxxxxxxxxxxxxxx labels: app: ABC spec: affinity: podAffinity: preferredDuringSchedulingIgnoredDuringExecution: - weight: 1 podAffinityTerm: labelSelector: matchExpressions: - key: app operator: In values: - ABC namespaces: [“default”] topologyKey: failure-domain.beta.kubernetes.io/zone containers: - name: abc image: abc:latest workingDir: /work resources: requests: cpu: 0 memory: 1Gi limits: cpu: 2 memory: 1Gi readinessProbe: tcpSocket: port: 6565 initialDelaySeconds: 10 timeoutSeconds: 5 livenessProbe: tcpSocket: port: 6565 initialDelaySeconds: 30 periodSeconds: 5 volumeMounts: - name: config mountPath: /abc subPath: config.conf readOnly: true - name: config mountPath: /abc subPath: config.json readOnly: true - name: archive mountPath: /archive subPath: /abc - name: telegraf image: telegraf:latest resources: requests: cpu: 0 memory: 96Mi limits: cpu: 1 memory: 96Mi terminationGracePeriodSeconds: 300 volumes: - name: config configMap: name: ABC - name: archive nfs: server: “archive-server” path: /

Environment:

Kubernetes version (use kubectl version):
kubectl version Client Version: version.Info{Major:“1”, Minor:“18”, GitVersion:“v1.18.6”, GitCommit:“dff82dc0de47299ab66c83c626e08b245ab19037”, GitTreeState:“clean”, BuildDate:“2020-07-15T16:58:53Z”, GoVersion:“go1.13.9”, Compiler:“gc”, Platform:“linux/amd64”}
Docker Version
docker version Client: Docker Engine - Community Version: 19.03.14 API version: 1.40 Go version: go1.13.15 Git commit: 5eb3275d40 Built: Tue Dec 1 19:20:42 2020 OS/Arch: linux/amd64 Experimental: false

Server: Docker Engine - Community Engine: Version: 19.03.14 API version: 1.40 (minimum version 1.12) Go version: go1.13.15 Git commit: 5eb3275d40 Built: Tue Dec 1 19:19:17 2020 OS/Arch: linux/amd64 Experimental: false containerd: Version: 1.4.3 GitCommit: 269548fa27e0089a8b8278fc4fc781d7f65a939b runc: Version: 1.0.0-rc92 GitCommit: ff819c7e9184c13b7c2607fe6c30ae19403a7aff docker-init: Version: 0.18.0 GitCommit: fec3683 -Containerd Version containerd --version containerd containerd.io 1.4.3 269548fa27e0089a8b8278fc4fc781d7f65a939b

Cloud provider or hardware configuration:
AWS / m5d.8xlarge
OS (e.g: cat /etc/os-release):
CentOS Linux release 7.9.2009 (Core)
Kernel (e.g. uname -a):
Linux SERVERNAME 4.4.245-1.el7.elrepo.x86_64 #1 SMP Fri Nov 20 09:39:52 EST 2020 x86_64 x86_64 x86_64 GNU/Linux
Install tools:
Network plugin and version (if this is a network-related bug):
Others: Here are the logs

kubelet.log snippet as it was trying to delete the container `E0328 00:05:15.236267 17532 pod_workers.go:191] Error syncing pod 75ba2b6a-b39c-4745-b6a6-e2bf4d02afda (“XXXXXXXXX(75ba2b6a-b39c-4745-b6a6-e2bf4d02afda)”), skipping: failed to “KillContainer” for “ABC” with KillContainerError: “rpc error: code = Unknown desc = operation timeout: context deadline exceeded” E0328 00:10:16.234249 17532 pod_workers.go:191] Error syncing pod 75ba2b6a-b39c-4745-b6a6-e2bf4d02afda (“XXXXXXXXX(75ba2b6a-b39c-4745-b6a6-e2bf4d02afda)”), skipping: failed to “KillContainer” for “ABC” with KillContainerError: “rpc error: code = Unknown desc = operation timeout: context deadline exceeded”

STUCK FOR 16 HOURS 43 MINUTES

E0328 16:43:16.796493 17532 kubelet.go:1576] error killing pod: failed to “KillContainer” for “ABC” with KillContainerError: “rpc error: code = Unknown desc = operation timeout: context deadline exceeded” E0328 16:43:16.796513 17532 pod_workers.go:191] Error syncing pod 75ba2b6a-b39c-4745-b6a6-e2bf4d02afda (“XXXXXXXXXXXXXXXX(75ba2b6a-b39c-4745-b6a6-e2bf4d02afda)”), skipping: error killing pod: failed to “KillContainer” for “ABC” with KillContainerError: “rpc error: code = Unknown desc = operation timeout: context deadline exceeded”`

================== DOCKER DAEMON LOGS FOR NEXT DAY WHEN THE POD WAS LONG GONE , FORCEFULLY DELETED BY US

Mar 29 00:00:03 SERVERNAME dockerd[15521]: time="2021-03-29T00:00:03.402804589Z" level=error msg="Handler for GET /containers/92895962c9d859b3dee39914660f9d74925d2f373251db7e19005c4445e8fa99/json returned error: write unix /var/run/docker.sock->@: write: broken pipe" Mar 29 00:00:03 SERVERNAME dockerd[15521]: time="2021-03-29T00:00:03.403058811Z" level=error msg="Handler for GET /containers/92895962c9d859b3dee39914660f9d74925d2f373251db7e19005c4445e8fa99/json returned error: write unix /var/run/docker.sock->@: write: broken pipe" Mar 29 00:00:03 SERVERNAME dockerd[15521]: time="2021-03-29T00:00:03.403283589Z" level=error msg="Handler for GET /containers/92895962c9d859b3dee39914660f9d74925d2f373251db7e19005c4445e8fa99/json returned error: write unix /var/run/docker.sock->@: write: broken pipe" Mar 29 00:00:03 SERVERNAME dockerd[15521]: time="2021-03-29T00:00:03.403525118Z" level=error msg="Handler for GET /containers/92895962c9d859b3dee39914660f9d74925d2f373251db7e19005c4445e8fa99/json returned error: write unix /var/run/docker.sock->@: write: broken pipe"

===================== CONTAINERD LOG – containerd deleted the container when requested, but as seen above dockerd is still quering its status till next day. and kubelet is stuck cause its waiting on dockerd to return the successful deletion of the pod.

/var/log/messages-20210328:Mar 28 00:00:04 SERVERNAME containerd: time="2021-03-28T00:00:04.927077043Z" level=info msg="shim reaped" id=92895962c9d859b3dee39914660f9d74925d2f373251db7e19005c4445e8fa99

About this issue

Original URL
State: open
Created 3 years ago
Reactions: 3
Comments: 19 (4 by maintainers)

Most upvoted comments

This is a duplicate issue https://github.com/kubernetes/kubernetes/pull/98507

wzshiming on Apr 1, 2021

Below are the liveliness and readiness probe failure (which are for tcp port 6565), if you see closely the container successfully restart on Mar 25,26 and Mar 27 , but it failed to restart on Mar 28 , see attached relevant logs kubelet.log

Hitesh-Agrawal on Mar 31, 2021