serving: Readiness probe fails if the container restarts after a liveness probe fail

What version of Knative?

v1.2

Expected Behavior

After a liveness probe fail, the container should restart and ideally should start serving the traffic again, just like how it would happen for a k8s deployment.

Actual Behavior

After the restart from liveness probe fail, the pod starts to fail with readiness probe, and serves no traffic.

Steps to Reproduce the Problem

ksvc:

apiVersion: serving.knative.dev/v1
kind: Service
metadata:
  name: healthprobes-go
spec:
  template:
    spec:
      containers:
        - image: shashankft/healthprobes@sha256:cc1cb4323e4d1cdee62e3ae7623ff6a213db72e89c17ad4772c39fe841b98fb9
          env:
            - name: TARGET
              value: "Go Sample v1"
          livenessProbe:
            httpGet:
              path: /healthz/liveness
              port: 0
            periodSeconds: 15
            failureThreshould: 1

the image here is produced from the code here: https://github.com/knative/serving/blob/db4c85b641703f94a700ada8cc074d28b423b5eb/test/test_images/healthprobes/health_probes.go

Hit the /start-failing path

relevant events of the ksvc pod:

  Normal   Created    2m                 kubelet            Created container queue-proxy
  Normal   Started    2m                 kubelet            Started container queue-proxy
  Normal   Killing    60s                kubelet            Container user-container failed liveness probe, will be restarted
  Warning  Unhealthy  60s (x3 over 90s)  kubelet            Liveness probe failed: HTTP probe failed with statuscode: 500
  Normal   Started    9s (x2 over 2m)    kubelet            Started container user-container
  Normal   Created    9s (x2 over 2m)    kubelet            Created container user-container
  Normal   Pulled     9s                 kubelet            Container image "shashankft/healthprobes@sha256:cc1cb4323e4d1cdee62e3ae7623ff6a213db72e89c17a1b98fb9" already present on machine
  Warning  Unhealthy  0s (x6 over 50s)   kubelet            Readiness probe failed: HTTP probe failed with statuscode: 503

I tried to run a k8s deployment with the same image and liveness probe configuration, and it seemed to behave as I hoped, whenever I explicitly hit the /start-failing path, in some seconds liveness probe will report failure and the container will restart and start serving the traffic again.

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Reactions: 4
  • Comments: 18 (12 by maintainers)

Commits related to this issue

Most upvoted comments

Thanks @dprotaso! Until then … if anyone knows a workaround is well accepted. Currently we found only a “workaround”: to convince the developers to build services where the liveness never fails … 😅

@pramenn afaik @Shashankft9 was planning to do a PR to restart the timer as a candidate fix.

The Quiet period is overriden at the queue proxy side:

The knative/pkg sets this by default to be 45 here: https://github.com/knative/serving/blob/52f07b02254b42f18fa0f71dbb0462410a5bc7b1/vendor/knative.dev/pkg/network/handlers/drain.go#L123

However queue main sets that to 30s https://github.com/knative/serving/blob/52f07b02254b42f18fa0f71dbb0462410a5bc7b1/cmd/queue/main.go#L64 https://github.com/knative/serving/blob/52f07b02254b42f18fa0f71dbb0462410a5bc7b1/cmd/queue/main.go

Some more context:

The drain call here is what we experience. https://github.com/knative/serving/blob/52f07b02254b42f18fa0f71dbb0462410a5bc7b1/cmd/queue/main.go#L354 Line 16 in the queue proxy logs: https://gist.github.com/skonto/749b5a34b464a786d2d3f473da0453d2 This drain is called because we use preStop handlers on the user container. https://github.com/knative/serving/blob/9fb7372faad456c23274e0a62fc9d15e382c801f/pkg/reconciler/revision/resources/deploy.go#L80 The user container was restarted.

Kubernetes sends the preStop event immediately before the Container is terminated. Kubernetes’ management of the Container blocks until the preStop handler completes, unless the Pod’s grace period expires. So we have a blocking situation because preStop returns at some point (as described next) actually but does not set the condition to be non draining or timer set to nil.

Now when we drain for the first time we set a timer, that timer is continously being reset if there is any user request in the mean time. This is the key idea: // When the Drainer is told to Drain, it will immediately start to fail // probes with a “500 shutting down”, and the call will block until no // requests have been received for QuietPeriod (defaults to // network.DefaultDrainTimeout).

Now if we set that timer to nil after draining is done I suspect that we will be able to move on with readiness probes. The invariant is that pods should finish the requests before shutting down and in our case it should be draining all requests before restarting and not accepting any new before container is ready. I am wondering if there is a design doc for the behavior.