source-controller: Azure DevOps: Source controller getting stuck
Hello.
We have 3 AKS clusters, all running the exact same versions of flux (0.16.1) in two different Azure regions (North Europe and East US).
The source-controller version is 0.15.3.
❯ k describe deploy source-controller -n flux-system --context aks-stag-eun | grep -i image
Image: ghcr.io/fluxcd/source-controller:v0.15.3
❯ k describe deploy source-controller -n flux-system --context aks-stag-ue | grep -i image
Image: ghcr.io/fluxcd/source-controller:v0.15.3
Both clusters are synching with the same Azure DevOps git repositories (gitImplementation: libgit2).
Everything is working great on East US clusters but in North Europe source-controller gets stuck multiple times a day and only killing it seems to make the sources to reconcile again (we’ve created a cronjob to restart source-controllerevery half a hour).
Even restarting every half a hour we’re still getting a lot of gaps where there’s no source reconciliation.
In this state, any manual reconciliation also gets stuck and never finishes:
> flux reconcile source git core -n core --context aks-stag-eun
► annotating GitRepository core in core namespace
✔ GitRepository annotated
◎ waiting for GitRepository reconciliation
There’s no logs on source-controller when it’s in this lock state …
I’m pretty sure it’s a connectivity problem to Azure DevOps or something not directly related to source-controller, but maybe it should recover or timeout from whatever it’s trying to do (?)
I’ve also increased concurrent from the default 2 to 6 but it seems to not be doing anything differently.
Thanks!
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Reactions: 5
- Comments: 47 (16 by maintainers)
@pjbgf after two days, we got 0 restarts and still reconciling with no issues.
@mfamador as you mentioned, the restarts are orthogonal to the initially reported issue so I created a new issue for that one, whilst I will be closing this one.
Thank you so much for all the help getting this resolved.
@mfamador that’s great news, thank you for helping us through this. 🙇
We will release a new patch with this fix later on this week.
@mfamador after some additional changes I think we have a RC that may also mitigate the restarting issue. The changes improve the connection management and resolves a leak that we were experiencing on specific scenarios. Would you mind giving it a try please?
ghcr.io/fluxcd/source-controller:rc-a00d0edc@mfamador that’s absolutely fine, thank you for all the information. We will be releasing a minor patch today including a potential fix for this under version
v0.22.5.If forgot to apply the patch with the new env var with the experimental managed transport. Until adding it, the version
v0.22.4actually worked pretty nice and didn’t block for almost 4 hours. After applying the patch with the EXPERIMENTAL_GIT_TRANSPORT env var it’s breaking now:The source-controller deployment:
@kingdonb I now restarted source-controller and it started to pull the latest version of the code. FYI, we run flux from single repo based on path separation for multiple environments/clusters. I think this has occurred as we had a git revert on the path specific to this cluster after which it never pulled. Restarting source-controller fixed it. Thanks for the heads up will upgrade to the latest Flux.