flux2: Random failure of helm-controller to get last release revision
Describe the bug
Hi guys,
We run 20+ k8s clusters with workloads managed by Flux on them. Recently I observed that on three environments starting at different dates and times all the helm releases got stuck upgrading and Flux started to throw the following alert for each helm release:
helmrelease/<hr-name>.flux-system
reconciliation failed: failed to get last release revision: query: failed to query with labels: Unauthorized
The quick way to fix that was to bounce the helm-controller: k rollout restart deployment -n flux-system helm-controller. I had to fix all environments quickly as those were production ones.
Have you observed this problem before or have any ideas why this happens and what is more importantly how to prevent this from happening?
Steps to reproduce
N/A
Expected behavior
N/A
Screenshots and recordings
No response
OS / Distro
N/A
Flux version
13.3
Flux check
N/A
Git provider
No response
Container Registry provider
No response
Additional context
No response
Code of Conduct
- I agree to follow this project’s Code of Conduct
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Reactions: 2
- Comments: 16 (5 by maintainers)
Same for me,
helm-controllerpod restart fixed the problem.@migspedroso which version of Flux are you using? We fixed the stale token issue for helm-controller in v0.31
Page 540: https://docs.aws.amazon.com/eks/latest/userguide/eks-ug.pdf
Helm controller’s pod was 91 days old when this problem happened. Restarting the pod and refreshing the service account’s token did bring it back to normal.
Same here, fixed by restart