image-automation-controller: Controller stops reconciling, needs restart

Reported here: https://github.com/fluxcd/flux2/discussions/2219

Having an automation that should reconcile every 7 minutes:

apiVersion: image.toolkit.fluxcd.io/v1beta1
kind: ImageUpdateAutomation
metadata:
  name: flux-system
  namespace: flux-system
spec:
  git:
    checkout:
      ref:
        branch: master
    commit:
      author:
        email: me@example.com
        name: me
      messageTemplate: '{{range .Updated.Images}}{{println .}}{{end}}'
    push:
      branch: master
  interval: 7m0s
  sourceRef:
    kind: GitRepository
    name: flux-system
  update:
    path: ./staging
    strategy: Setters

The reconciliation stoped two days ago for unknown reasons:

$ date
Fri Dec 17 16:25:48 EET 2021

$ flux get image update

NAME       	READY	MESSAGE                                                     	LAST RUN                 	SUSPENDED
flux-system	True 	no updates made; last commit 8574614 at 2021-12-14T22:47:08Z	2021-12-15T08:15:01-07:00	False

About this issue

Original URL
State: closed
Created 3 years ago
Reactions: 4
Comments: 38 (14 by maintainers)

Most upvoted comments

Closing this for lack of activity. Similarly reported issues have been confirmed to be fixed.

Now with Managed Transport enforcing timeouts for Git operations, this should be resolved.

If it reoccurs, given the sheer amount of changes that happened on the Git implementation in the last 6 months, we are better off creating a new issue, linking back to this one.

pjbgf on Jun 9, 2022

@maxbrunet thank you for the quick response. Would you be able to collect a profile and share either through here or slack please?

pjbgf on Mar 22, 2022

The image-automation controller version v0.21.0 introduces an experimental transport that fixes the issue in which the controller stops working in some specific scenarios.

The experimental transport needs to be opted-in by setting the environment variable EXPERIMENTAL_GIT_TRANSPORT to true in the controller’s Deployment. Once this feature has been tested extensively it may later become enabled by default.

Due to changes on other Flux components, it is recommended that all components are deployed on their latest versions. The recommended approach is via flux bootstrap using the flux cli version v0.28.0 which will be released tomorrow.

It would be great if users experiencing this issue could test it again with the experimental transport enabled and let us know whether the issue persists.

pjbgf on Mar 22, 2022

@ahisette yes the libgit2 timeout callback could be the reason, please try out the image from #297 and see if it problem goes away.

stefanprodan on Jan 21, 2022

I’ve gone to some lengths to try reproducing this issue, I ran image-automation-controller with a larger than average gitrepo (stuffed with several mp4 video files), and ramped up all of the unfavorable network conditions (packet loss, latency) with Chaos Mesh, reconfigured liveness checks so that image-automation-controller wouldn’t be restarted due to network reasons, (which was tricky because it actually needs the network in order to perform the leader election)

With all webhooks configured as receivers for image and git events to make sure everything happens quickly after each commit/image release, ran this for several hours with updates every 45 seconds, and I wasn’t able to get the image-automation-controller into any stuck or hanging state. I was able to cause it to stop working due to heavy packet loss, but nothing I did seemed to induce any sort of hanging behavior. (When the unfavorable conditions abated, the controller always recovered and went back to committing and pushing changes for me.)

If anyone knows what type of network issue or abnormal response from GitHub triggers the condition, then surely I can reproduce it and make progress on this issue, but right now I have not made significant progress on it.

kingdonb on Jan 11, 2022

Hello,

can you please exec into the controller pod and see if there is anything left in tmp by running `ls -lah /tmp

In my case, on stuck controller, in /tmp , I have a directory named like GitRepository source of frozen ImageUpdateAutomation.

And a simple restart of the automation controller is enough to unblock the frozen ImageUpdateAutomation.

ahisette on Jan 10, 2022

@bondido @jwerre can you please exec into the controller pod and see if there is anything left in tmp by running ls -lah /tmp and du -sh /tmp/*. Thanks!

You mean checking the directory contents inside “stucked” controller pod, don’t you? (in “working OK” pod I’ve checked it and it’s empty).

bondido on Jan 5, 2022

Do you have good evidence that restarting source-controller is exactly what unblocks image-automation-controller; or could it be a sort of “reliable coincidence”?

I can’t be 100% sure as I couldn’t get to any logs or metric confirming what was actually happening.

At first two cases restarting image-automation-controller was enough and new images were applied to cluster just seconds after the restart. But then - on the third occurence - we restarted image-automation-controller and nothing happened for over 10 minutes. So it was just blind shot by my colleague to restart source-controller, after which everything started working.

The situation repeated exactly like the above for one more time.

bondido on Jan 5, 2022

Hi, While waiting for permanent fix, are there any advice how to detect “stucked” image-automation-controller case? Are there any metrics or logs we should monitor? We have been suffering the issue (~couple of times per week) for some time and the only alert is from users who pushed their images to container registry and didn’t notice cluster deployment for quite some time.

Another observation is that in most cases restarting image-automation-controller is sufficient, but there where two times when we needed to also restart source-controller.

bondido on Dec 30, 2021