flux2: Random failure of helm-controller to get last release revision

Describe the bug

Hi guys,

We run 20+ k8s clusters with workloads managed by Flux on them. Recently I observed that on three environments starting at different dates and times all the helm releases got stuck upgrading and Flux started to throw the following alert for each helm release:

helmrelease/<hr-name>.flux-system
reconciliation failed: failed to get last release revision: query: failed to query with labels: Unauthorized

The quick way to fix that was to bounce the helm-controller: k rollout restart deployment -n flux-system helm-controller. I had to fix all environments quickly as those were production ones.

Have you observed this problem before or have any ideas why this happens and what is more importantly how to prevent this from happening?

Steps to reproduce

N/A

Expected behavior

N/A

Screenshots and recordings

No response

OS / Distro

N/A

Flux version

13.3

Flux check

N/A

Git provider

No response

Container Registry provider

No response

Additional context

No response

Code of Conduct

  • I agree to follow this project’s Code of Conduct

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Reactions: 2
  • Comments: 16 (5 by maintainers)

Most upvoted comments

Same for me, helm-controller pod restart fixed the problem.

@migspedroso which version of Flux are you using? We fixed the stale token issue for helm-controller in v0.31

At first sight this looks like the helm-controller Pod lost access rights on some API resources.

Seems that Helm can’t list secrets to find the release storage, as if the helm-controller service account lost its privileges. But if that was the case, then all the other API queries should’ve failed before it reached the helm function.

Maybe these HelmReleases have spec.ServiceAccountName specified?

Page 540: https://docs.aws.amazon.com/eks/latest/userguide/eks-ug.pdf

You see these errors if your service account token has expired on a 1.21 or later cluster.

As mentioned in the Kubernetes 1.21 (p. 69) and 1.22 (p. 67) release notes, the BoundServiceAccount token feature that graduated to beta in 1.21 improves the security of service account tokens by allowing workloads running on Kubernetes to request JSON web tokens that are audience, time, and key bound. Service account tokens now have an expiration of one hour. To enable a smooth migration of clients to the newer time-bound service account tokens, Kubernetes adds an extended expiry period to the service account token over the default one hour. For Amazon EKS clusters, the extended expiry period is 90 days. Your Amazon EKS cluster’s Kubernetes API server rejects requests with tokens older than 90 days.

Helm controller’s pod was 91 days old when this problem happened. Restarting the pod and refreshing the service account’s token did bring it back to normal.

Same here, fixed by restart