k8s-sidecar: Watching stops working after 10 minutes

Watching a set of configmaps fails to be alerted of new changes after 10 minutes of no changes.

##Repro steps

  1. Install the following configmap into a cluster
apiVersion: v1
kind: ConfigMap
metadata:
  name: grafana-test-dashboard
  labels:
    grafana_dashboard: "1"
data:
  cm-test.json: {}
  1. Install the sidecar into the cluster
apiVersion: v1
kind: Pod
metadata:
  name: test
spec:
  containers:
    - env:
        - name: METHOD
        - name: LABEL
          value: grafana_dashboard
        - name: FOLDER
          value: /tmp/dashboards
        - name: RESOURCE
          value: both
      image: kiwigrid/k8s-sidecar:0.1.178
      imagePullPolicy: IfNotPresent
      name: grafana-sc-dashboard
  1. Wait 10 minutes
  2. Make a change to the config map and update in the cluster

Expected Behaviour

Will see a modification occur

Actual Behaviour

Nothing


Done on AKS with kubernetes version 1.16.10

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Reactions: 14
  • Comments: 44 (17 by maintainers)

Commits related to this issue

Most upvoted comments

Cross posting from kubernetes/kube-state-metrics#694:

I started to think this issue to be mainly happening on to hosted Kubernetes services like AKS and EKS. There are many here using it without any issue, and there is another set of users who are repeatedly experiencing issues.

Also see this weird issue stops suddenly stopping for some set of users in a python implementation. kiwigrid/k8s-sidecar#85

Maybe the issue is lying in the way how the apiserver is hosted behind a managed load balancer.

Actually lets do do the same quick poll here among AKS/EKS users, can you vote this entry with:

  • upvote (+1 ) if you are experiencing issue on AKS/EKS
  • downvote (-1 ) if you are not experincing this issue on AKS/EKS

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

I’m also seeing this issue. Running quay.io/kiwigrid/k8s-sidecar:1.12.3 on AKS. Did you all work around this issue somehow or should the issue be reopened?

On our side we have been using a new sidecar https://github.com/OmegaVVeapon/kopf-k8s-sidecar

Works like a charm so far, only main difference is the added RBAC to allow the sidecar patching the resources.

Could you share your workaround/solution, please?

Default behaviour is to watch resources. This can be changed to SLEEP which is a polling. See README.md - Env Vars METHOD

Drawback of SLEEP: Deletions are not recognized.

@djsly We noticed exactly the same thing about dashboards not getting deleted.

With a specific use of this sidecar (Grafana) earlier today, this behaviour under SLEEP mode caused a denial-of-service on one of our Grafana environments when a config map was effectively renamed (same content, one ConfigMap deleted, another one added). As a result of the old name dashboard not being deleted; and then the new one being detected as not having a unique uid led to https://github.com/grafana/grafana/issues/14674. Apart from hundreds of thousands of logged Grafana errors, this seemingly caused the dashboard version to churn in the Grafana DB between two different versions; bloating our DB and eventually running out of space for the grafana DB on our PVs. Cool. 😃

So we went back to WATCH mode; and upgraded the sidecar to 1.3.0 which includes the kube client bump.

In our case WATCH mode was only occasionally losing connectivity/stopping watching that we had noticed, rather than the every 10 minutes/few hours that some people have observed here. Since we run on AWS EKS, one theory was that it was during control plane/master upgrades by AWS that the watches might get terminated and not re-established reliably, but that was just a theory given how infrequent we had experienced the issues with WATCH. Will see how we go.

No, as you can see there was no feedback if it works with image 0.1.209.

I have tried it with image 0.1.209, and it doesn’t work

I have same issue with aks 1.16.10 and sidecar 0.1.151. When the sidecar startup, it ran ok, however the loop failed after 30 minutes with this error: [2020-08-27 15:25:47] ProtocolError when calling kubernetes: (“Connection broken: ConnectionResetError(104, ‘Connection reset by peer’)”, ConnectionResetError(104, ‘Connection reset by peer’))

Ah, I always thought this repo is in golang. Now I checked the code for the first time with @vsliouniaev’s https://github.com/kubernetes-client/python/issues/1148#issuecomment-626184613 comment in mind. And I see that we can test this out pretty easily.

Apparently, details about these are now covered in https://github.com/kubernetes-client/python/blob/master/examples/watch/timeout-settings.md. Server-side timeout has a default (a random value between 1800 and 3600 seconds) in the python library, but the client-side timeout seems to have None as default:

We can give the below change a try here.

Update this part: https://github.com/kiwigrid/k8s-sidecar/blob/cbb48df6e75d75efea9b94b65405596419ce12ed/sidecar/resources.py#L193-L199

As this:

    additional_args = {
        'label_selector': label_selector,

        # Tune default timeouts as outlined in
        # https://github.com/kubernetes-client/python/issues/1148#issuecomment-626184613
        # https://github.com/kubernetes-client/python/blob/master/examples/watch/timeout-settings.md
        # I picked 60 and 66 due to https://github.com/nolar/kopf/issues/847#issuecomment-971651446

        # 60 is a polite request to the server, asking it to cleanly close the connection after that.
        # If you have a network outage, this does nothing.
        # You can set this number much higher, maybe to 3600 seconds (1h).
        'timeout_seconds': os.environ.get(WATCH_SERVER_TIMEOUT, 60),

        # 66 is a client-side timeout, configuring your local socket.
        # If you have a network outage dropping all packets with no RST/FIN,
        # this is how long your client waits before realizing & dropping the connection.
        # You can keep this number low, maybe 60 seconds.
        '_request_timeout': os.environ.get(WATCH_CLIENT_TIMEOUT, 66),
    }
    ...
    stream = watch.Watch().stream(getattr(v1, _list_namespace[namespace][resource]), **additional_args)

This is also effectively what the alternative kopf based implementation does here, also see https://github.com/nolar/kopf/issues/585 on the historical context on these settings: https://github.com/OmegaVVeapon/kopf-k8s-sidecar/blob/main/app/sidecar_settings.py#L58-L70:

And, ironically the kopf (which OmegaVVeapon/kopf-k8s-sidecar is based on) project has https://github.com/nolar/kopf/issues/847 currently open which seems to be related, but I guess that’s another edge case. We have been using kopf in our AKS clusters without regular issues. But this particular kiwigrid/k8s-sidecar issue is quite frequent.

I’ll try to give my suggestion above a chance if I can reserve some time, but given that we already have a workaround in place (https://github.com/grafana/helm-charts/issues/18#issuecomment-776160566), it won’t likely be soon.

I don’t think it should matter how the API server is hosted. We’re running this sidecar as part of Grafana in 1000+ clusters, all of which were set up by us (not a cloud provider) and were seeing intermittent failures until we just switched to polling every 1m. I can’t say whether this was after 10m or longer as reloading was mostly irrelevant in our use-case.

I believe the cause would be an intermittent network connection error where a watch hasn’t timed out but just dropped and the client does not attempt a reconnect. I believe the fix is described here: https://github.com/kubernetes-client/python/issues/1148#issuecomment-626184613

Thanks for merging. I updated the deployment yesterday to the docker image tag 0.1.259 and this morning, 15 hours later, it still detects modifications on configmaps 👍 There is also a change in the log. About every 30 to 60 minutes there’s an entry like:

[2020-10-28 07:29:07] ApiException when calling kubernetes: (410) Reason: Expired: too old resource version: 194195784 (199007602)

And so the resource watcher gets restarted.

BTW, the tag 1.2.0 had a build error, that’s why I used 0.1.259 from the CI build.

@monotek Could you please take care of it?