serving: Initial Revisions (at least) with crash-looping pods take a long time to terminate/clean up Pods

What version of Knative?

v1.0 at least

Expected Behavior

When creating a Revision with a pod which exits immediately, the Revision should (fairly quickly) report that Ready is False and terminate the Pods.

Actual Behavior

The pods stick around in CrashLoopBackOff for many restarts, and the Revision remains in “unknown” status for many minutes and eventually times out.

Steps to Reproduce the Problem

In one shell:

kn service create crasher --image nicolaka/netshoot  # or even a "bash" image

Watch this stall out, check on the pods with kubectl get po, etc

In a second shell:

kn service update crasher --image projects.registry.vmware.com/tanzu_serverless/hello-yeti

The first kn service create will complete, and the service will be ready to serve!

BUT

The first Revision will still be in unknown status, and the Pod will still be present in CrashLoopBackOff, even many minutes after the failure.

After approximately 10 minutes, the Pod will finally be cleaned up, but the reported status.desiredScale for the KPA resource is still -1 at the end of that time.

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Comments: 37 (23 by maintainers)

Most upvoted comments

hey @andrew-delph so I retested the repro steps mentioned in the first comment - and I believe this was fixed in the latest release via the PR https://github.com/knative/serving/pull/14309

It’s resolved since we propagate reachability appropriately and don’t do weird things like reading the Active conditions. Thus older revisions that aren’t pointed to by a route will scale down correctly now.

I’m going to close this one out - but if you’re testing a similar scenario that you’re trying to address can you create a new issue.

Also a related issue that someone is working on is here - https://github.com/knative/serving/issues/13677

That one is about requests coming in and timing out doesn’t trigger scale down as quickly as they could

related: I didn’t realize maxUnavailable affects the Available status condition

https://github.com/kubernetes/kubernetes/issues/106697#issuecomment-1369672284