serving: Initial Revisions (at least) with crash-looping pods take a long time to terminate/clean up Pods
What version of Knative?
v1.0 at least
Expected Behavior
When creating a Revision with a pod which exits immediately, the Revision should (fairly quickly) report that Ready is False and terminate the Pods.
Actual Behavior
The pods stick around in CrashLoopBackOff for many restarts, and the Revision remains in “unknown” status for many minutes and eventually times out.
Steps to Reproduce the Problem
In one shell:
kn service create crasher --image nicolaka/netshoot # or even a "bash" image
Watch this stall out, check on the pods with kubectl get po, etc
In a second shell:
kn service update crasher --image projects.registry.vmware.com/tanzu_serverless/hello-yeti
The first kn service create will complete, and the service will be ready to serve!
BUT
The first Revision will still be in unknown status, and the Pod will still be present in CrashLoopBackOff, even many minutes after the failure.
After approximately 10 minutes, the Pod will finally be cleaned up, but the reported status.desiredScale for the KPA resource is still -1 at the end of that time.
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Comments: 37 (23 by maintainers)
hey @andrew-delph so I retested the repro steps mentioned in the first comment - and I believe this was fixed in the latest release via the PR https://github.com/knative/serving/pull/14309
It’s resolved since we propagate reachability appropriately and don’t do weird things like reading the
Activeconditions. Thus older revisions that aren’t pointed to by a route will scale down correctly now.I’m going to close this one out - but if you’re testing a similar scenario that you’re trying to address can you create a new issue.
Also a related issue that someone is working on is here - https://github.com/knative/serving/issues/13677
That one is about requests coming in and timing out doesn’t trigger scale down as quickly as they could
/assign @keshavcodex /unassign @itsdarshankumar
related: I didn’t realize
maxUnavailableaffects theAvailablestatus conditionhttps://github.com/kubernetes/kubernetes/issues/106697#issuecomment-1369672284