client-go: Informers do not surface API server request failures to callers

When I call cache.NewInformer to create an Informer and then call cache.Controller.Run on the returned value, the controller periodically contacts the API server to list the objects of the given type. At some point the API server rejects these requests, causing the controller to log messages like the following:

Failed to list *v1beta1.Ingress: the server has asked for the client to provide credentials (get ingresses.extensions)\n",“stream”:“stderr”,“time”:“2017-03-21T14:14:37.397681364Z”}

Failed to list *v1.Service: the server has asked for the client to provide credentials (get services)\n",“stream”:“stderr”,“time”:“2017-03-21T14:14:37.398311081Z”}

Failed to list *v1.Endpoints: the server has asked for the client to provide credentials (get endpoints)\n",“stream”:“stderr”,“time”:“2017-03-21T14:14:37.399217817Z”}

These failures repeat periodically, swelling the log file, and it may take many days before we notice that our controller is ostensibly still running, but only inertly; it can’t get its job done without talking to the API server. Sometimes the server changes its mind and starts fulfilling the requests again, but these failure periods can persist for days.

A caller of cache.Controller.Run should have some way of detecting that these failures are occurring in order to declare the process unhealthy. Retrying automatically to smooth over intermittent network trouble is a nice feature, but with neither the ability to control it nor detect its ongoing failure makes it dangerous.

I would be happy with either of the following two improvements:

  • Accept a callback (perhaps via a new sibling method for Controller.Run) that tells a caller when these request failures arrive.
    It could also accept a caller-provided channel, and push errors into the channel when they arise, dropping errors that can’t be delivered synchronously.
  • Provide a way to integrate a controller into a “healthz” handler.
    That leaves the health criteria opaque to callers—and probably begs for some way to configure the thresholds—but still allows a calling process to indicate that it’s in dire shape.

We discussed this gap in the “kubernetes-dev” channel in the “Kubernetes” Slack team.

About this issue

  • Original URL
  • State: closed
  • Created 7 years ago
  • Reactions: 17
  • Comments: 27 (4 by maintainers)

Commits related to this issue

Most upvoted comments

This problem has been causing a lot of problems for us. Informers get into an access denied error loop and spew out an overwhelming amount of error logs. For details, see https://github.com/windmilleng/tilt/issues/2702

I proposed a PR upstream that will address our problems, but I don’t know if it addresses some of the other ideas on this thread around healthz: https://github.com/kubernetes/kubernetes/pull/87329