image-automation-controller: Controller stops reconciling, needs restart
Reported here: https://github.com/fluxcd/flux2/discussions/2219
Having an automation that should reconcile every 7 minutes:
apiVersion: image.toolkit.fluxcd.io/v1beta1
kind: ImageUpdateAutomation
metadata:
name: flux-system
namespace: flux-system
spec:
git:
checkout:
ref:
branch: master
commit:
author:
email: me@example.com
name: me
messageTemplate: '{{range .Updated.Images}}{{println .}}{{end}}'
push:
branch: master
interval: 7m0s
sourceRef:
kind: GitRepository
name: flux-system
update:
path: ./staging
strategy: Setters
The reconciliation stoped two days ago for unknown reasons:
$ date
Fri Dec 17 16:25:48 EET 2021
$ flux get image update
NAME READY MESSAGE LAST RUN SUSPENDED
flux-system True no updates made; last commit 8574614 at 2021-12-14T22:47:08Z 2021-12-15T08:15:01-07:00 False
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Reactions: 4
- Comments: 38 (14 by maintainers)
Closing this for lack of activity. Similarly reported issues have been confirmed to be fixed.
Now with Managed Transport enforcing timeouts for Git operations, this should be resolved.
If it reoccurs, given the sheer amount of changes that happened on the Git implementation in the last 6 months, we are better off creating a new issue, linking back to this one.
@maxbrunet thank you for the quick response. Would you be able to collect a profile and share either through here or slack please?
The image-automation controller version v0.21.0 introduces an experimental transport that fixes the issue in which the controller stops working in some specific scenarios.
The experimental transport needs to be opted-in by setting the environment variable
EXPERIMENTAL_GIT_TRANSPORTtotruein the controller’s Deployment. Once this feature has been tested extensively it may later become enabled by default.Due to changes on other Flux components, it is recommended that all components are deployed on their latest versions. The recommended approach is via
flux bootstrapusing the flux cli versionv0.28.0which will be released tomorrow.It would be great if users experiencing this issue could test it again with the experimental transport enabled and let us know whether the issue persists.
@ahisette yes the libgit2 timeout callback could be the reason, please try out the image from #297 and see if it problem goes away.
I’ve gone to some lengths to try reproducing this issue, I ran image-automation-controller with a larger than average gitrepo (stuffed with several mp4 video files), and ramped up all of the unfavorable network conditions (packet loss, latency) with Chaos Mesh, reconfigured liveness checks so that image-automation-controller wouldn’t be restarted due to network reasons, (which was tricky because it actually needs the network in order to perform the leader election)
With all webhooks configured as receivers for image and git events to make sure everything happens quickly after each commit/image release, ran this for several hours with updates every 45 seconds, and I wasn’t able to get the image-automation-controller into any stuck or hanging state. I was able to cause it to stop working due to heavy packet loss, but nothing I did seemed to induce any sort of hanging behavior. (When the unfavorable conditions abated, the controller always recovered and went back to committing and pushing changes for me.)
If anyone knows what type of network issue or abnormal response from GitHub triggers the condition, then surely I can reproduce it and make progress on this issue, but right now I have not made significant progress on it.
Hello,
In my case, on stuck controller, in /tmp , I have a directory named like GitRepository source of frozen ImageUpdateAutomation.
And a simple restart of the automation controller is enough to unblock the frozen ImageUpdateAutomation.
You mean checking the directory contents inside “stucked” controller pod, don’t you? (in “working OK” pod I’ve checked it and it’s empty).
I can’t be 100% sure as I couldn’t get to any logs or metric confirming what was actually happening.
At first two cases restarting image-automation-controller was enough and new images were applied to cluster just seconds after the restart. But then - on the third occurence - we restarted image-automation-controller and nothing happened for over 10 minutes. So it was just blind shot by my colleague to restart source-controller, after which everything started working.
The situation repeated exactly like the above for one more time.
Hi, While waiting for permanent fix, are there any advice how to detect “stucked” image-automation-controller case? Are there any metrics or logs we should monitor? We have been suffering the issue (~couple of times per week) for some time and the only alert is from users who pushed their images to container registry and didn’t notice cluster deployment for quite some time.
Another observation is that in most cases restarting image-automation-controller is sufficient, but there where two times when we needed to also restart source-controller.