image-automation-controller: image-automation-controller not reconnecting after operation timed out
Describe the bug
image-automation-controller doesn’t reconnect to github after operation timed out. I have to delete the pod to restart. Below is the log from image-automation-controller.
{"level":"error","ts":"2021-07-22T14:57:58.859Z","logger":"controller-runtime.manager.controller.imageupdateautomation","msg":"Reconciler error","reconciler group":"image.toolkit.fluxcd.io","reconciler kind":"ImageUpdateAutomation","name":"flux-system","namespace":"flux-system","error":"unable to clone 'https://github.com/CoverGo/k8s-fleet', error: failed to connect to github.com: Operation timed out"}
{"level":"error","ts":"2021-07-22T15:03:04.528Z","logger":"controller-runtime.manager.controller.imageupdateautomation","msg":"Reconciler error","reconciler group":"image.toolkit.fluxcd.io","reconciler kind":"ImageUpdateAutomation","name":"flux-system","namespace":"flux-system","error":"unable to clone 'https://github.com/CoverGo/k8s-fleet', error: failed to connect to github.com: Operation timed out"}
Steps to reproduce
I don’t know how to reproduce because operation timed out can happen anytime
Expected behavior
image-automation-controller can reconnect automatically.
Screenshots and recordings
No response
OS / Distro
N/A
Flux version
0.16.1
Flux check
► checking prerequisites ✔ kubectl 1.21.0 >=1.18.0-0 ✔ Kubernetes 1.18.8-aliyun.1 >=1.16.0-0 ► checking controllers ✔ helm-controller: deployment ready ► ghcr.io/fluxcd/helm-controller:v0.11.1 ✔ image-automation-controller: deployment ready ► ghcr.io/fluxcd/image-automation-controller:v0.14.0 ✔ image-reflector-controller: deployment ready ► ghcr.io/fluxcd/image-reflector-controller:v0.11.0 ✔ kustomize-controller: deployment ready ► ghcr.io/fluxcd/kustomize-controller:v0.13.2 ✔ notification-controller: deployment ready ► ghcr.io/fluxcd/notification-controller:v0.15.0 ✔ source-controller: deployment ready ► ghcr.io/fluxcd/source-controller:v0.15.3 ✔ all checks passed
Git provider
github
Container Registry provider
No response
Additional context
No response
Code of Conduct
- I agree to follow this project’s Code of Conduct
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Comments: 36 (15 by maintainers)
We have a new release candidate that further improve the controller:
ghcr.io/fluxcd/image-automation-controller:rc-48bcca59Two important changes a) Managed Transport is enabled by default and context timeouts are now enforced.
The image-automation controller version v0.21.0 introduces an experimental transport that fixes the issue in which the controller stops working in some specific scenarios.
The experimental transport needs to be opted-in by setting the environment variable
EXPERIMENTAL_GIT_TRANSPORTtotruein the controller’s Deployment.This will require a redeploy of all components so I would recommend doing so via
flux bootstrapusing the flux cli versionv0.28.0which will be released tomorrow.Can you test it again with the experimental transport enabled and let us know how you get on please?
It is worth keeping an eye out for #326, which if all goes according to plan, will be out next week.
@demisx we have now released the official version, and followed it up with a few patches. Please give it a try against version v0.28.4 or newer.
That’s correct!
@demisx thanks for providing us with all the details thus far. Would you be able to provide a trace profile as well next time you notice the freeze (and before killing the container)? This can be done by running the following: