kubernetes-client: Service watch silently stops receiving events

I have yet to track down a root cause.

Starting a watch in a straight-forward way:

kubernetesClient.services().inAnyNamespace().withLabels(Map.of("some", "label")).watch(new ListOptionsBuilder()
                .withWatch(Boolean.TRUE)
                .withResourceVersion(lastResourceVersion)
                .withTimeoutSeconds(null)
                .build(), watcher);

There are no abnormal onClose events yet we simply stop receiving Service events after approximately 110 minutes in the latest occurrence. We have seen this behavior in the Watches used by informers as well.

I cannot find any existing bug reference for this behavior and it does seem related to Service events - other watches continue to function normally.

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Comments: 16 (15 by maintainers)

Most upvoted comments

I guess the issue will affect any other client “implementation”

Yes any client using a websocket for a Service watch.

I don’t know how we could “capture this as a known issue”, do you mean a FAQ/troubleshooting entry?

Exactly. Unless it’s simple task to update the ResourceHandler logic, that’s all I would do at this point.

It appears from the upstream issue that this is a regression in 1.20 specifically for Service websocket watches, so the workaround can be pretty narrow if desired.

I don’t know how we could “capture this as a known issue”, do you mean a FAQ/troubleshooting entry?

I assumed that a document like that existed, but I don’t see anything. I’ll just close this issue - between this entry and the upstream issue a user should be able to find more on the root cause and see that the only workaround is to restart the watches / informers.

Have you tried this in different Kubernetes versions?

No, effectively just in 1.20.

Trying to fix this by scheduling automatic reconnects every n period doesn’t sound like something reasonable for me.

That is effectively what the api server is supposed to do now - it’s just that we’re relying on the server side default / handling, rather client side handling. The only issue with introducing our own default is that it would need to be guaranteed shorter than the server side.

There must be an underlying cause that should get fixed.

I’ll wait a bit to get more feedback on the upstream issue, so that the scope of the problem is more understood, before adding a pr for a workaround here.