kubernetes: liveness/readiness probe is executed and failed while pod is terminated

What happened: liveness/readiness probe fails while pod is terminated. Also it happened only once during the pod termination. The issue started happening after upgrading version to v1.7 from v1.6.X

How to reproduce it (as minimally and precisely as possible): execute kubectl delete pod nginx-A1 to delete pod, so status of the nginx-podA1 is changed to Terminating, right after that it seems Liveness and Readiness Probe is executed and failed, but only once. Nginx reverse proxy is running in the pod. so I just use httpGetmethod for liveness and readiness

Here is my Deployment config.

   ...
   spec:
      terminationGracePeriodSeconds: 60
        ...
        livenessProbe:
          httpGet:
            path: /healthz
            port: 8080
          initialDelaySeconds: 15
          timeoutSeconds: 3
        readinessProbe:
          httpGet:
            path: /healthz
            port: 8080
          initialDelaySeconds: 15
          timeoutSeconds: 3

Here is Events log by kubectl describe pod nginx-A1

Events:
  FirstSeen	LastSeen	Count	From					SubObjectPath			Type		Reason		Message
  ---------	--------	-----	----					-------------			--------	------		-------
  14s		14s		1	kubelet, *****	spec.containers{dnsmasq}	Normal		Killing		Killing container with id docker://dnsmasq:Need to kill Pod
  9s		9s		1	kubelet, *****	spec.containers{nginx}		Warning		Unhealthy	Liveness probe failed: Get http://100.*.*.*:8080/healthz: dial tcp 100.*.*.*:8080: getsockopt: connection refused
  9s		9s		1	kubelet, *****	spec.containers{nginx}		Warning		Unhealthy	Readiness probe failed: Get http://100.*.*.*:8080/healthz: dial tcp 100.*.*.*:8080: getsockopt: connection refused

Environment:

  • Kubernetes version: 1.7.2

About this issue

  • Original URL
  • State: closed
  • Created 7 years ago
  • Reactions: 23
  • Comments: 41 (16 by maintainers)

Most upvoted comments

@matthyx we’re running 1.16 and being hit by this continuously when some of our elixir apps shutdown so cherry-picking it in 1.18 would at least put it closer in our update path

https://github.com/kubernetes/kubernetes/pull/100525 https://github.com/kubernetes/kubernetes/pull/100526 https://github.com/kubernetes/kubernetes/pull/100527

/reopen /remove-lifecycle rotten

We see this consistently with all pods who define a liveness or readiness probe. Whenever we roll out a new deployment, the pods who are terminated will emit a failed liveness/readiness probe AFTER they have been terminated. We have considered adding a preStop hook that just sleeps for 2-3 seconds, but it seems like a band-aid solution to something that should not happen in the first place.

Is this an impossible-to-solve race condition between kubernetes moving parts?

Should I consider a cherry-pick for 1.20 and 1.19? (maybe 1.18 too?)

Since the upgrade to 1.7, it seems our deployment rollouts have a higher failure rate. Occasionally, the pod would come up, but no readiness probe ever gets started. It stays in that state, blocking the entire deployment. I usually have to delete the pod so it is rescheduled, and a new readiness probe is fired to check.

I wonder if these are related issues here.

@matthyx we’re running 1.16 and being hit by this continuously when some of our elixir apps shutdown so cherry-picking it in 1.18 would at least put it closer in our update path 👼

I am also facing this issue. We have pods which have a lot of cleanup to do during shutdown, it can take up to 5 mins to terminate gracefully. During this time the livelinessProbe is detecting failure and restarting the pod. not really what we want. I am unable to prevent the service that handles the liveliness check from stopping while the cleanup is happening. It would be better if the pod was immediately removed from the service and the probes stopped while the shutdown is performed. Basically this ends up that k8s never actually is able to terminate the pod.

kubernetes_version 1.19.3. Same issue.