image-automation-controller: image-automation-controller not reconnecting after operation timed out

Describe the bug

image-automation-controller doesn’t reconnect to github after operation timed out. I have to delete the pod to restart. Below is the log from image-automation-controller.

{"level":"error","ts":"2021-07-22T14:57:58.859Z","logger":"controller-runtime.manager.controller.imageupdateautomation","msg":"Reconciler error","reconciler group":"image.toolkit.fluxcd.io","reconciler kind":"ImageUpdateAutomation","name":"flux-system","namespace":"flux-system","error":"unable to clone 'https://github.com/CoverGo/k8s-fleet', error: failed to connect to github.com: Operation timed out"}
{"level":"error","ts":"2021-07-22T15:03:04.528Z","logger":"controller-runtime.manager.controller.imageupdateautomation","msg":"Reconciler error","reconciler group":"image.toolkit.fluxcd.io","reconciler kind":"ImageUpdateAutomation","name":"flux-system","namespace":"flux-system","error":"unable to clone 'https://github.com/CoverGo/k8s-fleet', error: failed to connect to github.com: Operation timed out"}

Steps to reproduce

I don’t know how to reproduce because operation timed out can happen anytime

Expected behavior

image-automation-controller can reconnect automatically.

Screenshots and recordings

No response

OS / Distro

N/A

Flux version

0.16.1

Flux check

► checking prerequisites ✔ kubectl 1.21.0 >=1.18.0-0 ✔ Kubernetes 1.18.8-aliyun.1 >=1.16.0-0 ► checking controllers ✔ helm-controller: deployment ready ► ghcr.io/fluxcd/helm-controller:v0.11.1 ✔ image-automation-controller: deployment ready ► ghcr.io/fluxcd/image-automation-controller:v0.14.0 ✔ image-reflector-controller: deployment ready ► ghcr.io/fluxcd/image-reflector-controller:v0.11.0 ✔ kustomize-controller: deployment ready ► ghcr.io/fluxcd/kustomize-controller:v0.13.2 ✔ notification-controller: deployment ready ► ghcr.io/fluxcd/notification-controller:v0.15.0 ✔ source-controller: deployment ready ► ghcr.io/fluxcd/source-controller:v0.15.3 ✔ all checks passed

Git provider

github

Container Registry provider

No response

Additional context

No response

Code of Conduct

  • I agree to follow this project’s Code of Conduct

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Comments: 36 (15 by maintainers)

Most upvoted comments

We have a new release candidate that further improve the controller: ghcr.io/fluxcd/image-automation-controller:rc-48bcca59

Two important changes a) Managed Transport is enabled by default and context timeouts are now enforced.

The image-automation controller version v0.21.0 introduces an experimental transport that fixes the issue in which the controller stops working in some specific scenarios.

The experimental transport needs to be opted-in by setting the environment variable EXPERIMENTAL_GIT_TRANSPORT to true in the controller’s Deployment.

This will require a redeploy of all components so I would recommend doing so via flux bootstrap using the flux cli version v0.28.0 which will be released tomorrow.

Can you test it again with the experimental transport enabled and let us know how you get on please?

It is worth keeping an eye out for #326, which if all goes according to plan, will be out next week.

@demisx we have now released the official version, and followed it up with a few patches. Please give it a try against version v0.28.4 or newer.

That’s correct!

@demisx thanks for providing us with all the details thus far. Would you be able to provide a trace profile as well next time you notice the freeze (and before killing the container)? This can be done by running the following:

$ kubectl port-forward -n <namespace> deploy/<component> 8080
$ curl -Sk -v http://localhost:8080/debug/pprof/trace?seconds=10 > trace.out