image-automation-controller: Libgit2 Intermittent Crash

The controller is intermittently experiencing a crash during a libgit2 Fetch operation:

fatal error: unexpected signal during runtime execution
[signal SIGSEGV: segmentation violation code=0x1 addr=0x636365207d pc=0x1a2372e]

runtime stack:
runtime.throw({0x1e28bd7?, 0x7f5f8b005348?})
	runtime/panic.go:992 +0x71
runtime.sigpanic()
	runtime/signal_unix.go:802 +0x3a9

goroutine 608 [syscall, locked to thread]:
runtime.cgocall(0x16b6b70, 0xc00061c7c8)
	runtime/cgocall.go:157 +0x5c fp=0xc00061c7a0 sp=0xc00061c768 pc=0x404b7c
github.com/libgit2/git2go/v33._Cfunc_git_remote_fetch(0x7f5f8bedd370, 0xc00131bc70, 0xc000ae1380, 0x0)
	_cgo_gotypes.go:7121 +0x4c fp=0xc00061c7c8 sp=0xc00061c7a0 pc=0x14affec
github.com/libgit2/git2go/v33.(*Remote).Fetch.func2(0xc000ae1380?, 0xc00136af30?, 0x0?, 0x90?)
	github.com/libgit2/git2go/v33@v33.0.9/remote.go:1044 +0xa7 fp=0xc00061c820 sp=0xc00061c7c8 pc=0x14eabc7
github.com/libgit2/git2go/v33.(*Remote).Fetch(0xc0011032c0, {0xc00061ccb0, 0x1, 0x1}, 0x0?, {0x0?, 0x0})
	github.com/libgit2/git2go/v33@v33.0.9/remote.go:1044 +0x1f2 fp=0xc00061c8d0 sp=0xc00061c820 pc=0x14ea892
github.com/fluxcd/source-controller/pkg/git/libgit2.(*CheckoutBranch).Checkout(0xc0010fcaa0, {0x2088f68, 0xc001103200}, {0xc000c8d740, 0x26}, {0xc00084d710, 0x2e}, 0xc001083680)
	github.com/fluxcd/source-controller@v0.25.9/pkg/git/libgit2/checkout.go:142 +0x9f3 fp=0xc00061ce28 sp=0xc00061c8d0 pc=0x15534d3
github.com/fluxcd/image-automation-controller/controllers.cloneInto({0x2088f68, 0xc001103200}, {0xc001083680?, {0xc00084d710?, 0x3?}}, 0xc0011031a0?, {0xc000c8d740, 0x26})
	github.com/fluxcd/image-automation-controller/controllers/imageupdateautomation_controller.go:575 +0x188 fp=0xc00061cf20 sp=0xc00061ce28 pc=0x16aac08
github.com/fluxcd/image-automation-controller/controllers.(*ImageUpdateAutomationReconciler).Reconcile(0xc000bb3640, {0x2088fa0, 0xc0011d6270}, {{{0xc000cda110?, 0x1d1c660?}, {0xc000cde090?, 0x30?}}})
	github.com/fluxcd/image-automation-controller/controllers/imageupdateautomation_controller.go:294 +0x1a12 fp=0xc00061dc98 sp=0xc00061cf20 pc=0x16a6fd2
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile(0xc0000fe370, {0x2088fa0, 0xc0011d61e0}, {{{0xc000cda110?, 0x1d1c660?}, {0xc000cde090?, 0x404314?}}})
	sigs.k8s.io/controller-runtime@v0.11.2/pkg/internal/controller/controller.go:114 +0x27e fp=0xc00061dd78 sp=0xc00061dc98 pc=0x128d95e
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler(0xc0000fe370, {0x2088ef8, 0xc000bb3580}, {0x1c2d2c0?, 0xc0005cf700?})
	sigs.k8s.io/controller-runtime@v0.11.2/pkg/internal/controller/controller.go:311 +0x349 fp=0xc00061dee0 sp=0xc00061dd78 pc=0x128f929
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem(0xc0000fe370, {0x2088ef8, 0xc000bb3580})
	sigs.k8s.io/controller-runtime@v0.11.2/pkg/internal/controller/controller.go:266 +0x1d9 fp=0xc00061df80 sp=0xc00061dee0 pc=0x128f159
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2()
	sigs.k8s.io/controller-runtime@v0.11.2/pkg/internal/controller/controller.go:227 +0x85 fp=0xc00061dfe0 sp=0xc00061df80 pc=0x128eba5
runtime.goexit()
	runtime/asm_amd64.s:1571 +0x1 fp=0xc00061dfe8 sp=0xc00061dfe0 pc=0x46c7e1
created by sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2
	sigs.k8s.io/controller-runtime@v0.11.2/pkg/internal/controller/controller.go:223 +0x31c

For full log refer to https://github.com/fluxcd/image-automation-controller/issues/339#issuecomment-1187026366.

About this issue

Original URL
State: closed
Created 2 years ago
Reactions: 3
Comments: 20 (11 by maintainers)

Most upvoted comments

Thanks @aryan9600 That works!! Any plan to release the changes of the rc-0d41741b image?

fabidick22 on Nov 29, 2022

After extensive tests, the best approach for mitigating panic was to move on from libgit2 as gitImplementation and use go-git instead, which is the default behaviour as of v0.27.0.

For more information, please refer to the v0.27.0 changelog.

I am closing this for the time being. If users continue to experience intermittent crashes after migrating to the latest version we will reopen this issue.

After upgrating to version v0.37.0 (image-automation-controller:v0.27.0) of flux with the terraform provider, we started getting OOMKilled events in this controller. The controller seems to work fine after the update, but after a few minutes it starts getting OOMKlling events and then in a CrashLoopBackOff state. After the first restart, it seems to use more resources. I had to roll back to version 0.36.0 of Flux(image-automation-controller:v0.26.1) which seems to be more stable.

Note: there are no logs related to this error

I have 27 ImageRepository resources with an interval of 1 min, is this possibly the cause of the OOM events?

fabidick22 on Nov 29, 2022

Improved resource management is helpful it seems, though on some clusters it is harder to see a difference. I might look into adjusting some of it, will keep an eye on the issue.

paha on Oct 14, 2022

@paha we have recently released a RC version with the latest changes to image automation controller. Can you please use the image below and check whether it fixes the problem you are experiencing?

ghcr.io/fluxcd/image-automation-controller:rc-8f7a773c

It would be great if you could report back any issues or improvements you observe.

pjbgf on Oct 4, 2022