k8s-sidecar: Watching stops working after 10 minutes
Watching a set of configmaps fails to be alerted of new changes after 10 minutes of no changes.
##Repro steps
- Install the following configmap into a cluster
apiVersion: v1
kind: ConfigMap
metadata:
name: grafana-test-dashboard
labels:
grafana_dashboard: "1"
data:
cm-test.json: {}
- Install the sidecar into the cluster
apiVersion: v1
kind: Pod
metadata:
name: test
spec:
containers:
- env:
- name: METHOD
- name: LABEL
value: grafana_dashboard
- name: FOLDER
value: /tmp/dashboards
- name: RESOURCE
value: both
image: kiwigrid/k8s-sidecar:0.1.178
imagePullPolicy: IfNotPresent
name: grafana-sc-dashboard
- Wait 10 minutes
- Make a change to the config map and update in the cluster
Expected Behaviour
Will see a modification occur
Actual Behaviour
Nothing
Done on AKS with kubernetes version 1.16.10
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Reactions: 14
- Comments: 44 (17 by maintainers)
Commits related to this issue
- fixes #85 #minor Add proposed timeout configuration parameters. — committed to kiwigrid/k8s-sidecar by jekkel 3 years ago
- fixes #85 #minor Add proposed timeout configuration parameters. — committed to kiwigrid/k8s-sidecar by jekkel 3 years ago
- fixes #85 #minor Add proposed timeout configuration parameters. — committed to kiwigrid/k8s-sidecar by jekkel 3 years ago
- Merge pull request #150 from kiwigrid/timeout-issue85 fixes #85 #minor — committed to kiwigrid/k8s-sidecar by jekkel 3 years ago
Cross posting from kubernetes/kube-state-metrics#694:
Actually lets do do the same quick poll here among AKS/EKS users, can you vote this entry with:
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
I’m also seeing this issue. Running
quay.io/kiwigrid/k8s-sidecar:1.12.3on AKS. Did you all work around this issue somehow or should the issue be reopened?On our side we have been using a new sidecar https://github.com/OmegaVVeapon/kopf-k8s-sidecar
Works like a charm so far, only main difference is the added RBAC to allow the sidecar patching the resources.
Default behaviour is to watch resources. This can be changed to
SLEEPwhich is a polling. See README.md - Env VarsMETHODDrawback of
SLEEP: Deletions are not recognized.@djsly We noticed exactly the same thing about dashboards not getting deleted.
With a specific use of this sidecar (Grafana) earlier today, this behaviour under
SLEEPmode caused a denial-of-service on one of our Grafana environments when a config map was effectively renamed (same content, one ConfigMap deleted, another one added). As a result of the old name dashboard not being deleted; and then the new one being detected as not having a uniqueuidled to https://github.com/grafana/grafana/issues/14674. Apart from hundreds of thousands of logged Grafana errors, this seemingly caused the dashboard version to churn in the Grafana DB between two different versions; bloating our DB and eventually running out of space for the grafana DB on our PVs. Cool. 😃So we went back to
WATCHmode; and upgraded the sidecar to1.3.0which includes the kube client bump.In our case
WATCHmode was only occasionally losing connectivity/stopping watching that we had noticed, rather than the every 10 minutes/few hours that some people have observed here. Since we run on AWS EKS, one theory was that it was during control plane/master upgrades by AWS that the watches might get terminated and not re-established reliably, but that was just a theory given how infrequent we had experienced the issues withWATCH. Will see how we go.I have tried it with image 0.1.209, and it doesn’t work
I have same issue with aks 1.16.10 and sidecar 0.1.151. When the sidecar startup, it ran ok, however the loop failed after 30 minutes with this error: [2020-08-27 15:25:47] ProtocolError when calling kubernetes: (“Connection broken: ConnectionResetError(104, ‘Connection reset by peer’)”, ConnectionResetError(104, ‘Connection reset by peer’))
Ah, I always thought this repo is in golang. Now I checked the code for the first time with @vsliouniaev’s https://github.com/kubernetes-client/python/issues/1148#issuecomment-626184613 comment in mind. And I see that we can test this out pretty easily.
Apparently, details about these are now covered in https://github.com/kubernetes-client/python/blob/master/examples/watch/timeout-settings.md. Server-side timeout has a default (a random value between 1800 and 3600 seconds) in the python library, but the client-side timeout seems to have
Noneas default:We can give the below change a try here.
Update this part: https://github.com/kiwigrid/k8s-sidecar/blob/cbb48df6e75d75efea9b94b65405596419ce12ed/sidecar/resources.py#L193-L199
As this:
This is also effectively what the alternative kopf based implementation does here, also see https://github.com/nolar/kopf/issues/585 on the historical context on these settings: https://github.com/OmegaVVeapon/kopf-k8s-sidecar/blob/main/app/sidecar_settings.py#L58-L70:
And, ironically the kopf (which OmegaVVeapon/kopf-k8s-sidecar is based on) project has https://github.com/nolar/kopf/issues/847 currently open which seems to be related, but I guess that’s another edge case. We have been using kopf in our AKS clusters without regular issues. But this particular kiwigrid/k8s-sidecar issue is quite frequent.
I’ll try to give my suggestion above a chance if I can reserve some time, but given that we already have a workaround in place (https://github.com/grafana/helm-charts/issues/18#issuecomment-776160566), it won’t likely be soon.
I don’t think it should matter how the API server is hosted. We’re running this sidecar as part of Grafana in 1000+ clusters, all of which were set up by us (not a cloud provider) and were seeing intermittent failures until we just switched to polling every 1m. I can’t say whether this was after 10m or longer as reloading was mostly irrelevant in our use-case.
I believe the cause would be an intermittent network connection error where a watch hasn’t timed out but just dropped and the client does not attempt a reconnect. I believe the fix is described here: https://github.com/kubernetes-client/python/issues/1148#issuecomment-626184613
Thanks for merging. I updated the deployment yesterday to the docker image tag 0.1.259 and this morning, 15 hours later, it still detects modifications on configmaps 👍 There is also a change in the log. About every 30 to 60 minutes there’s an entry like:
And so the resource watcher gets restarted.
BTW, the tag 1.2.0 had a build error, that’s why I used 0.1.259 from the CI build.
@monotek Could you please take care of it?