source-controller: OOMKill for period of time after which it magically resolves without changes
source-controller just got OOMKilled out of the blue. It started about an hour ago and I can’t see any prior activity that triggered it. No recently added sources, nothing. It started with rc3. I upgraded it to rc4 but it’s the same behaviour. The memory usage goes out of the roof and the cluster kills the pod.
❯❯❯ flux stats
RECONCILERS RUNNING FAILING SUSPENDED STORAGE
GitRepository 7 0 0 1.9 MiB
OCIRepository 0 0 0 -
HelmRepository 0 0 0 -
HelmChart 0 0 0 -
Bucket 0 0 0 -
Kustomization 3 0 0 -
HelmRelease 0 0 0 -
Alert 0 0 0 -
Provider 0 0 0 -
Receiver 0 0 0 -
ImageUpdateAutomation 6 0 0 -
ImagePolicy 20 0 0 -
ImageRepository 20 0 0 -
❯❯❯ flux check
► checking prerequisites
✔ Kubernetes 1.22.17-eks-0a21954 >=1.20.6-0
► checking controllers
✔ helm-controller: deployment ready
► ghcr.io/fluxcd/helm-controller:v0.34.0
✔ image-automation-controller: deployment ready
► ghcr.io/fluxcd/image-automation-controller:v0.34.0
✔ image-reflector-controller: deployment ready
► ghcr.io/fluxcd/image-reflector-controller:v0.28.0
✔ kustomize-controller: deployment ready
► ghcr.io/fluxcd/kustomize-controller:v1.0.0-rc.4
✔ notification-controller: deployment ready
► ghcr.io/fluxcd/notification-controller:v1.0.0-rc.4
flux check gets stuck at this point as the source controller is not responding.
About this issue
- Original URL
- State: open
- Created a year ago
- Reactions: 1
- Comments: 29 (9 by maintainers)
Same thing is happening with some of our clusters as well
Without this happening again, and a proper HEAP snapshot when this happens, I fear this will be very much like looking for a needle in a haystack.
@cwrau did this start with RC.3 as well?
In addition to this, did you run RC.2 or RC.1 before without issues?
Based on the HEAP profile shared, I can’t tell what is happening as it’s taken before the actual issue seems to occur. What may help is temporarily increasing the limits to be able to take a proper snapshot while the thing happens without the Pod getting killed.