source-controller: Azure DevOps: Source controller getting stuck

Hello.

We have 3 AKS clusters, all running the exact same versions of flux (0.16.1) in two different Azure regions (North Europe and East US).

The source-controller version is 0.15.3.

❯ k describe deploy source-controller -n flux-system --context aks-stag-eun | grep -i image
    Image:       ghcr.io/fluxcd/source-controller:v0.15.3
❯ k describe deploy source-controller -n flux-system --context aks-stag-ue | grep -i image
    Image:       ghcr.io/fluxcd/source-controller:v0.15.3

Both clusters are synching with the same Azure DevOps git repositories (gitImplementation: libgit2).

Everything is working great on East US clusters but in North Europe source-controller gets stuck multiple times a day and only killing it seems to make the sources to reconcile again (we’ve created a cronjob to restart source-controllerevery half a hour).

Even restarting every half a hour we’re still getting a lot of gaps where there’s no source reconciliation.

In this state, any manual reconciliation also gets stuck and never finishes:

>  flux reconcile source git core -n core --context aks-stag-eun

► annotating GitRepository core in core namespace
✔ GitRepository annotated
◎ waiting for GitRepository reconciliation

There’s no logs on source-controller when it’s in this lock state …

I’m pretty sure it’s a connectivity problem to Azure DevOps or something not directly related to source-controller, but maybe it should recover or timeout from whatever it’s trying to do (?)

I’ve also increased concurrent from the default 2 to 6 but it seems to not be doing anything differently.

Thanks!

About this issue

Original URL
State: closed
Created 3 years ago
Reactions: 5
Comments: 47 (16 by maintainers)

Most upvoted comments

@pjbgf after two days, we got 0 restarts and still reconciling with no issues.

❯ k describe deploy source-controller -n flux-system --context stag-eun | grep -i image
    Image:       ghcr.io/fluxcd/source-controller:rc-a00d0edc
❯ k describe deploy source-controller -n flux-system --context prod-eun | grep -i image
    Image:       ghcr.io/fluxcd/source-controller:rc-a00d0edc

❯ k get pod -n flux-system -l app=source-controller --context stag-eun
NAME                                 READY   STATUS    RESTARTS   AGE
source-controller-57bff99765-lz9zq   1/1     Running   0          2d20h
❯ k get pod -n flux-system -l app=source-controller --context prod-eun
NAME                                 READY   STATUS    RESTARTS   AGE
source-controller-57bff99765-pjt4q   1/1     Running   0          2d20h

❯ flux reconcile source git data -n data --context stag-eun
► annotating GitRepository data in data namespace
✔ GitRepository annotated
◎ waiting for GitRepository reconciliation
✔ fetched revision master/fd49c50dadae2a2ca451b12b98600651903b8e7e
❯ flux reconcile source git data -n data --context prod-eun
► annotating GitRepository data in data namespace
✔ GitRepository annotated
◎ waiting for GitRepository reconciliation
✔ fetched revision master/fd49c50dadae2a2ca451b12b98600651903b8e7e

mfamador on Jun 13, 2022

@mfamador as you mentioned, the restarts are orthogonal to the initially reported issue so I created a new issue for that one, whilst I will be closing this one.

Thank you so much for all the help getting this resolved.

pjbgf on Jul 1, 2022

@mfamador that’s great news, thank you for helping us through this. 🙇

We will release a new patch with this fix later on this week.

pjbgf on Jun 13, 2022

@mfamador after some additional changes I think we have a RC that may also mitigate the restarting issue. The changes improve the connection management and resolves a leak that we were experiencing on specific scenarios. Would you mind giving it a try please?

ghcr.io/fluxcd/source-controller:rc-a00d0edc

pjbgf on Jun 9, 2022

@mfamador that’s absolutely fine, thank you for all the information. We will be releasing a minor patch today including a potential fix for this under version v0.22.5.

pjbgf on Mar 30, 2022

If forgot to apply the patch with the new env var with the experimental managed transport. Until adding it, the version v0.22.4 actually worked pretty nice and didn’t block for almost 4 hours. After applying the patch with the EXPERIMENTAL_GIT_TRANSPORT env var it’s breaking now:

❯ k describe deploy source-controller -n flux-system | grep -i image
    Image:       ghcr.io/fluxcd/source-controller:v0.22.4
❯ stern source-controller -n flux-system --tail 0

+ source-controller-7776dc897b-npppk › manager
source-controller-7776dc897b-npppk manager panic: runtime error: invalid memory address or nil pointer dereference
source-controller-7776dc897b-npppk manager [signal SIGSEGV: segmentation violation code=0x1 addr=0x18 pc=0x1a7cf53]
source-controller-7776dc897b-npppk manager
source-controller-7776dc897b-npppk manager goroutine 534 [running]:
source-controller-7776dc897b-npppk manager github.com/fluxcd/source-controller/pkg/git/libgit2/managed.(*sshSmartSubtransport).Close(0xc00efd3900)
source-controller-7776dc897b-npppk manager 	github.com/fluxcd/source-controller/pkg/git/libgit2/managed/ssh.go:268 +0x93
source-controller-7776dc897b-npppk manager github.com/libgit2/git2go/v33.smartSubtransportCloseCallback(0x404e06, 0xc000603ba0)
source-controller-7776dc897b-npppk manager 	github.com/libgit2/git2go/v33@v33.0.9/transport.go:409 +0x6f
source-controller-7776dc897b-npppk manager github.com/libgit2/git2go/v33._Cfunc_git_clone(0xc000a5edb8, 0x7fd51cfefbb0, 0x7fd51cfefc00, 0xc005d79520)
source-controller-7776dc897b-npppk manager 	_cgo_gotypes.go:3244 +0x4c
source-controller-7776dc897b-npppk manager github.com/libgit2/git2go/v33.Clone.func3(0xc005d79520, 0xc001b3ac60, 0xc0192420c0, 0x1b4db45)
source-controller-7776dc897b-npppk manager 	github.com/libgit2/git2go/v33@v33.0.9/clone.go:43 +0x91
source-controller-7776dc897b-npppk manager github.com/libgit2/git2go/v33.Clone({0xc0008b50c0, 0xc009690a80}, {0xc002322d40, 0x3d}, 0xc001b3ac60)
source-controller-7776dc897b-npppk manager 	github.com/libgit2/git2go/v33@v33.0.9/clone.go:43 +0x19e
source-controller-7776dc897b-npppk manager github.com/fluxcd/source-controller/pkg/git/libgit2.(*CheckoutTag).Checkout(0xc019242060, {0x27ca660, 0xc009690a80}, {0xc002322d40, 0x3d}, {0xc0008b50c0, 0x3e}, 0x0)
source-controller-7776dc897b-npppk manager 	github.com/fluxcd/source-controller/pkg/git/libgit2/checkout.go:97 +0x1e5
source-controller-7776dc897b-npppk manager github.com/fluxcd/source-controller/controllers.(*GitRepositoryReconciler).reconcileSource(0xc0006c0d70, {0x27ca698, 0xc00272c750}, 0xc001a15200, 0xc001ff95f0, 0x18, {0xc002322d40, 0x3d})
source-controller-7776dc897b-npppk manager 	github.com/fluxcd/source-controller/controllers/gitrepository_controller.go:404 +0x99f
source-controller-7776dc897b-npppk manager github.com/fluxcd/source-controller/controllers.(*GitRepositoryReconciler).reconcile(0x2834958, {0x27ca698, 0xc00272c750}, 0xc001a15200, {0xc001225be8, 0x4, 0x40e494})
source-controller-7776dc897b-npppk manager 	github.com/fluxcd/source-controller/controllers/gitrepository_controller.go:244 +0x3d5
source-controller-7776dc897b-npppk manager github.com/fluxcd/source-controller/controllers.(*GitRepositoryReconciler).Reconcile(0xc0006c0d70, {0x27ca698, 0xc00272c750}, {{{0xc000a5fb97, 0x2384b60}, {0xc000a20d08, 0x30}}})
source-controller-7776dc897b-npppk manager 	github.com/fluxcd/source-controller/controllers/gitrepository_controller.go:205 +0x4bb
source-controller-7776dc897b-npppk manager sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile(0xc00062a2c0, {0x27ca698, 0xc00272c180}, {{{0xc000a5fb97, 0x2384b60}, {0xc000a20d08, 0x415034}}})
source-controller-7776dc897b-npppk manager 	sigs.k8s.io/controller-runtime@v0.11.1/pkg/internal/controller/controller.go:114 +0x26f
source-controller-7776dc897b-npppk manager sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler(0xc00062a2c0, {0x27ca5f0, 0xc000434600}, {0x2226280, 0xc014bb97a0})
source-controller-7776dc897b-npppk manager 	sigs.k8s.io/controller-runtime@v0.11.1/pkg/internal/controller/controller.go:311 +0x33e
source-controller-7776dc897b-npppk manager sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem(0xc00062a2c0, {0x27ca5f0, 0xc000434600})
source-controller-7776dc897b-npppk manager 	sigs.k8s.io/controller-runtime@v0.11.1/pkg/internal/controller/controller.go:266 +0x205
source-controller-7776dc897b-npppk manager sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2()
source-controller-7776dc897b-npppk manager 	sigs.k8s.io/controller-runtime@v0.11.1/pkg/internal/controller/controller.go:227 +0x85
source-controller-7776dc897b-npppk manager created by sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2
source-controller-7776dc897b-npppk manager 	sigs.k8s.io/controller-runtime@v0.11.1/pkg/internal/controller/controller.go:223 +0x357
- source-controller-7776dc897b-npppk › manager
+ source-controller-7776dc897b-npppk › manager

The source-controller deployment:

❯ k get deploy source-controller -oyaml | grep -C10 EXP
      - args:
        - --concurrent=6
        - --events-addr=http://notification-controller.flux-system.svc.cluster.local./
        - --watch-all-namespaces=true
        - --log-level=info
        - --log-encoding=json
        - --enable-leader-election
        - --storage-path=/data
        - --storage-adv-addr=source-controller.$(RUNTIME_NAMESPACE).svc.cluster.local.
        env:
        - name: EXPERIMENTAL_GIT_TRANSPORT
          value: "true"
        - name: RUNTIME_NAMESPACE
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: metadata.namespace
        image: ghcr.io/fluxcd/source-controller:v0.22.4
        imagePullPolicy: IfNotPresent
        livenessProbe:
          failureThreshold: 3

mfamador on Mar 29, 2022

@kingdonb I now restarted source-controller and it started to pull the latest version of the code. FYI, we run flux from single repo based on path separation for multiple environments/clusters. I think this has occurred as we had a git revert on the path specific to this cluster after which it never pulled. Restarting source-controller fixed it. Thanks for the heads up will upgrade to the latest Flux.

natarajmb on Mar 1, 2022