client-go: Informers do not surface API server request failures to callers

When I call cache.NewInformer to create an Informer and then call cache.Controller.Run on the returned value, the controller periodically contacts the API server to list the objects of the given type. At some point the API server rejects these requests, causing the controller to log messages like the following:

Failed to list *v1beta1.Ingress: the server has asked for the client to provide credentials (get ingresses.extensions)\n",“stream”:“stderr”,“time”:“2017-03-21T14:14:37.397681364Z”}

Failed to list *v1.Service: the server has asked for the client to provide credentials (get services)\n",“stream”:“stderr”,“time”:“2017-03-21T14:14:37.398311081Z”}

Failed to list *v1.Endpoints: the server has asked for the client to provide credentials (get endpoints)\n",“stream”:“stderr”,“time”:“2017-03-21T14:14:37.399217817Z”}

These failures repeat periodically, swelling the log file, and it may take many days before we notice that our controller is ostensibly still running, but only inertly; it can’t get its job done without talking to the API server. Sometimes the server changes its mind and starts fulfilling the requests again, but these failure periods can persist for days.

A caller of cache.Controller.Run should have some way of detecting that these failures are occurring in order to declare the process unhealthy. Retrying automatically to smooth over intermittent network trouble is a nice feature, but with neither the ability to control it nor detect its ongoing failure makes it dangerous.

I would be happy with either of the following two improvements:

Accept a callback (perhaps via a new sibling method for Controller.Run) that tells a caller when these request failures arrive.
It could also accept a caller-provided channel, and push errors into the channel when they arise, dropping errors that can’t be delivered synchronously.
Provide a way to integrate a controller into a “healthz” handler.
That leaves the health criteria opaque to callers—and probably begs for some way to configure the thresholds—but still allows a calling process to indicate that it’s in dire shape.

We discussed this gap in the “kubernetes-dev” channel in the “Kubernetes” Slack team.

About this issue

Original URL
State: closed
Created 7 years ago
Reactions: 17
Comments: 27 (4 by maintainers)

Commits related to this issue

cache: add error handling to informers When creating an informer, this adds a way to add custom error handling, so that Kubernetes tooling can properly surface the errors to the end user. Fixes http... — committed to tilt-dev/kubernetes by nicks 4 years ago
cache: add error handling to informers When creating an informer, this adds a way to add custom error handling, so that Kubernetes tooling can properly surface the errors to the end user. Fixes http... — committed to kubernetes/client-go by nicks 4 years ago
Remove the WaitCacheSync block on informers The WaitForCacheSync waits forever and never returns in the case a persisten error occurs. On the other hand, it looks like there is no way in the current ... — committed to 3scale-ops/marin3r by roivaz 4 years ago

Most upvoted comments

This problem has been causing a lot of problems for us. Informers get into an access denied error loop and spew out an overwhelming amount of error logs. For details, see https://github.com/windmilleng/tilt/issues/2702

I proposed a PR upstream that will address our problems, but I don’t know if it addresses some of the other ideas on this thread around healthz: https://github.com/kubernetes/kubernetes/pull/87329

nicks on Jan 17, 2020