helm-controller: helmrelease "upgrade retries exhausted" regression
Describe the bug
When a helmrelease stuck in reconciliation failed: upgrade retries exhausted only fluxcli v1.16.1 can trigger a successful reconciliation .###
Steps to reproduce
When a helmrelease stuck in helm-controller reconciliation failed: upgrade retries exhausted this can normally be fixed by running ` ./flux reconcile helmrelease from the command line, but only till fluxcli v0.16.1
Expected behavior
flux reconcile should trigger a helm upgrade when it stuck in in upgrade retries exhausted
Screenshots and recordings
This time i upgrade kube-prometheus-stack helmrelease
I tried different versions v0.17.2 v0.16.2 but only v0.16.1 triggered a successful helm upgrade
❯ flux -v
flux version 0.17.2
❯ flux reconcile helmrelease -n monitoring infra --with-source
► annotating HelmRepository prometheus-community in flux-system namespace
✔ HelmRepository annotated
◎ waiting for HelmRepository reconciliation
✔ HelmRepository reconciliation completed
✔ fetched revision 6b8293a6fda62b3318b3bbe18e9e4654b07b3c80
► annotating HelmRelease infra in monitoring namespace
✔ HelmRelease annotated
◎ waiting for HelmRelease reconciliation
✔ HelmRelease reconciliation completed
✗ HelmRelease reconciliation failed
❯ ./flux -v
flux version 0.16.2
❯ ./flux reconcile helmrelease -n monitoring infra --with-source
► annotating HelmRepository prometheus-community in flux-system namespace
✔ HelmRepository annotated
◎ waiting for HelmRepository reconciliation
✔ HelmRepository reconciliation completed
✔ fetched revision 6b8293a6fda62b3318b3bbe18e9e4654b07b3c80
► annotating HelmRelease infra in monitoring namespace
✔ HelmRelease annotated
◎ waiting for HelmRelease reconciliation
✔ HelmRelease reconciliation completed
✗ HelmRelease reconciliation failed
❯ ./flux -v
flux version 0.16.1
❯ ./flux reconcile helmrelease -n monitoring infra --with-source
► annotating HelmRepository prometheus-community in flux-system namespace
✔ HelmRepository annotated
◎ waiting for HelmRepository reconciliation
✔ HelmRepository reconciliation completed
✔ fetched revision 6b8293a6fda62b3318b3bbe18e9e4654b07b3c80
► annotating HelmRelease infra in monitoring namespace
✔ HelmRelease annotated
◎ waiting for HelmRelease reconciliation
✔ HelmRelease reconciliation completed
✔ applied revision 18.1.1
❯ k describe hr -n monitoring infra
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal info 19m (x403 over 8d) helm-controller HelmChart 'flux-system/monitoring-infra' is not ready
Normal info 18m (x9 over 8d) helm-controller Helm upgrade has started
Normal error 18m helm-controller Helm upgrade failed: cannot patch "infra-kube-prometheus-stac-kube-apiserver-burnrate.rules" with kind PrometheusRule: Internal error occurred: failed calling webhook "prometheusrulemutate.monitoring.coreos.com": Post "https://infra-kube-prometheus-stac-operator.monitoring.svc:443/admission-prometheusrules/validate?timeout=10s": EOF && cannot patch "infra-kube-prometheus-stac-kube-apiserver-histogram.rules" with kind PrometheusRule: Internal error occurred: failed calling webhook "prometheusrulemutate.monitoring.coreos.com": Post "https://infra-kube-prometheus-stac-operator.monitoring.svc:443/admission-prometheusrules/validate?timeout=10s": dial tcp 10.240.28.152:443: connect: connection refused
Last Helm logs:
Looks like there are no changes for Service "infra-prometheus-node-exporter"
Looks like there are no changes for DaemonSet "infra-prometheus-node-exporter"
error updating the resource "infra-kube-prometheus-stac-kube-apiserver-burnrate.rules":
cannot patch "infra-kube-prometheus-stac-kube-apiserver-burnrate.rules" with kind PrometheusRule: Internal error occurred: failed calling webhook "prometheusrulemutate.monitoring.coreos.com": Post "https://infra-kube-prometheus-stac-operator.monitoring.svc:443/admission-prometheusrules/validate?timeout=10s": EOF
error updating the resource "infra-kube-prometheus-stac-kube-apiserver-histogram.rules":
cannot patch "infra-kube-prometheus-stac-kube-apiserver-histogram.rules" with kind PrometheusRule: Internal error occurred: failed calling webhook "prometheusrulemutate.monitoring.coreos.com": Post "https://infra-kube-prometheus-stac-operator.monitoring.svc:443/admission-prometheusrules/validate?timeout=10s": dial tcp 10.240.28.152:443: connect: connection refused
warning: Upgrade "infra" failed: cannot patch "infra-kube-prometheus-stac-kube-apiserver-burnrate.rules" with kind PrometheusRule: Internal error occurred: failed calling webhook "prometheusrulemutate.monitoring.coreos.com": Post "https://infra-kube-prometheus-stac-operator.monitoring.svc:443/admission-prometheusrules/validate?timeout=10s": EOF && cannot patch "infra-kube-prometheus-stac-kube-apiserver-histogram.rules" with kind PrometheusRule: Internal error occurred: failed calling webhook "prometheusrulemutate.monitoring.coreos.com": Post "https://infra-kube-prometheus-stac-operator.monitoring.svc:443/admission-prometheusrules/validate?timeout=10s": dial tcp 10.240.28.152:443: connect: connection refused
Normal error 18m helm-controller reconciliation failed: Helm upgrade failed: cannot patch "infra-kube-prometheus-stac-kube-apiserver-burnrate.rules" with kind PrometheusRule: Internal error occurred: failed calling webhook "prometheusrulemutate.monitoring.coreos.com": Post "https://infra-kube-prometheus-stac-operator.monitoring.svc:443/admission-prometheusrules/validate?timeout=10s": EOF && cannot patch "infra-kube-prometheus-stac-kube-apiserver-histogram.rules" with kind PrometheusRule: Internal error occurred: failed calling webhook "prometheusrulemutate.monitoring.coreos.com": Post "https://infra-kube-prometheus-stac-operator.monitoring.svc:443/admission-prometheusrules/validate?timeout=10s": dial tcp 10.240.28.152:443: connect: connection refused
Normal error 17m (x6 over 8d) helm-controller reconciliation failed: Operation cannot be fulfilled on helmreleases.helm.toolkit.fluxcd.io "infra": the object has been modified; please apply your changes to the latest version and try again
Normal error 14m (x386 over 5d16h) helm-controller reconciliation failed: upgrade retries exhausted
Normal error 7m5s (x19 over 12m) helm-controller reconciliation failed: upgrade retries exhausted
Normal info 2m28s (x2 over 4m11s) helm-controller Helm upgrade has started
Normal info 118s (x2 over 3m39s) helm-controller Helm upgrade succeeded
hlem-controller logs:
❯ klf -n flux-system helm-controller-dc6ffd55b-rg6qk
{"level":"info","ts":"2021-09-30T08:01:18.602Z","logger":"controller-runtime.metrics","msg":"metrics server is starting to listen","addr":":8080"}
{"level":"info","ts":"2021-09-30T08:01:18.603Z","logger":"setup","msg":"starting manager"}
I0930 08:01:18.603524 6 leaderelection.go:243] attempting to acquire leader lease flux-system/helm-controller-leader-election...
{"level":"info","ts":"2021-09-30T08:01:18.603Z","msg":"starting metrics server","path":"/metrics"}
I0930 08:01:18.630310 6 leaderelection.go:253] successfully acquired lease flux-system/helm-controller-leader-election
{"level":"info","ts":"2021-09-30T08:01:18.704Z","logger":"controller.helmrelease","msg":"Starting EventSource","reconciler group":"helm.toolkit.fluxcd.io","reconciler kind":"HelmRelease","source":"kind source: /, Kind="}
{"level":"info","ts":"2021-09-30T08:01:18.704Z","logger":"controller.helmrelease","msg":"Starting EventSource","reconciler group":"helm.toolkit.fluxcd.io","reconciler kind":"HelmRelease","source":"kind source: /, Kind="}
{"level":"info","ts":"2021-09-30T08:01:18.704Z","logger":"controller.helmrelease","msg":"Starting Controller","reconciler group":"helm.toolkit.fluxcd.io","reconciler kind":"HelmRelease"}
{"level":"info","ts":"2021-09-30T08:01:18.805Z","logger":"controller.helmrelease","msg":"Starting workers","reconciler group":"helm.toolkit.fluxcd.io","reconciler kind":"HelmRelease","worker count":4}
{"level":"info","ts":"2021-09-30T08:01:21.488Z","logger":"controller.helmrelease","msg":"reconcilation finished in 2.458244966s, next run in 5m0s","reconciler group":"helm.toolkit.fluxcd.io","reconciler kind":"HelmRelease","name":"infra","namespace":"monitoring"}
{"level":"error","ts":"2021-09-30T08:01:21.488Z","logger":"controller.helmrelease","msg":"Reconciler error","reconciler group":"helm.toolkit.fluxcd.io","reconciler kind":"HelmRelease","name":"infra","namespace":"monitoring","error":"upgrade retries exhausted"}
{"level":"info","ts":"2021-09-30T08:01:23.376Z","logger":"controller.helmrelease","msg":"reconcilation finished in 1.882877287s, next run in 5m0s","reconciler group":"helm.toolkit.fluxcd.io","reconciler kind":"HelmRelease","name":"infra","namespace":"monitoring"}
{"level":"error","ts":"2021-09-30T08:01:23.376Z","logger":"controller.helmrelease","msg":"Reconciler error","reconciler group":"helm.toolkit.fluxcd.io","reconciler kind":"HelmRelease","name":"infra","namespace":"monitoring","error":"upgrade retries exhausted"}
{"level":"info","ts":"2021-09-30T08:01:25.090Z","logger":"controller.helmrelease","msg":"reconcilation finished in 1.703418398s, next run in 5m0s","reconciler group":"helm.toolkit.fluxcd.io","reconciler kind":"HelmRelease","name":"infra","namespace":"monitoring"}
{"level":"error","ts":"2021-09-30T08:01:25.090Z","logger":"controller.helmrelease","msg":"Reconciler error","reconciler group":"helm.toolkit.fluxcd.io","reconciler kind":"HelmRelease","name":"infra","namespace":"monitoring","error":"upgrade retries exhausted"}
{"level":"info","ts":"2021-09-30T08:01:26.792Z","logger":"controller.helmrelease","msg":"reconcilation finished in 1.680926502s, next run in 5m0s","reconciler group":"helm.toolkit.fluxcd.io","reconciler kind":"HelmRelease","name":"infra","namespace":"monitoring"}
{"level":"error","ts":"2021-09-30T08:01:26.792Z","logger":"controller.helmrelease","msg":"Reconciler error","reconciler group":"helm.toolkit.fluxcd.io","reconciler kind":"HelmRelease","name":"infra","namespace":"monitoring","error":"upgrade retries exhausted"}
{"level":"info","ts":"2021-09-30T08:01:28.450Z","logger":"controller.helmrelease","msg":"reconcilation finished in 1.618160674s, next run in 5m0s","reconciler group":"helm.toolkit.fluxcd.io","reconciler kind":"HelmRelease","name":"infra","namespace":"monitoring"}
{"level":"error","ts":"2021-09-30T08:01:28.450Z","logger":"controller.helmrelease","msg":"Reconciler error","reconciler group":"helm.toolkit.fluxcd.io","reconciler kind":"HelmRelease","name":"infra","namespace":"monitoring","error":"upgrade retries exhausted"}
{"level":"info","ts":"2021-09-30T08:01:30.191Z","logger":"controller.helmrelease","msg":"reconcilation finished in 1.660457185s, next run in 5m0s","reconciler group":"helm.toolkit.fluxcd.io","reconciler kind":"HelmRelease","name":"infra","namespace":"monitoring"}
{"level":"error","ts":"2021-09-30T08:01:30.191Z","logger":"controller.helmrelease","msg":"Reconciler error","reconciler group":"helm.toolkit.fluxcd.io","reconciler kind":"HelmRelease","name":"infra","namespace":"monitoring","error":"upgrade retries exhausted"}
{"level":"info","ts":"2021-09-30T08:01:32.050Z","logger":"controller.helmrelease","msg":"reconcilation finished in 1.697716689s, next run in 5m0s","reconciler group":"helm.toolkit.fluxcd.io","reconciler kind":"HelmRelease","name":"infra","namespace":"monitoring"}
{"level":"error","ts":"2021-09-30T08:01:32.050Z","logger":"controller.helmrelease","msg":"Reconciler error","reconciler group":"helm.toolkit.fluxcd.io","reconciler kind":"HelmRelease","name":"infra","namespace":"monitoring","error":"upgrade retries exhausted"}
{"level":"info","ts":"2021-09-30T08:01:34.059Z","logger":"controller.helmrelease","msg":"reconcilation finished in 1.687846249s, next run in 5m0s","reconciler group":"helm.toolkit.fluxcd.io","reconciler kind":"HelmRelease","name":"infra","namespace":"monitoring"}
{"level":"error","ts":"2021-09-30T08:01:34.059Z","logger":"controller.helmrelease","msg":"Reconciler error","reconciler group":"helm.toolkit.fluxcd.io","reconciler kind":"HelmRelease","name":"infra","namespace":"monitoring","error":"upgrade retries exhausted"}
{"level":"info","ts":"2021-09-30T08:01:36.377Z","logger":"controller.helmrelease","msg":"reconcilation finished in 1.677888858s, next run in 5m0s","reconciler group":"helm.toolkit.fluxcd.io","reconciler kind":"HelmRelease","name":"infra","namespace":"monitoring"}
{"level":"error","ts":"2021-09-30T08:01:36.377Z","logger":"controller.helmrelease","msg":"Reconciler error","reconciler group":"helm.toolkit.fluxcd.io","reconciler kind":"HelmRelease","name":"infra","namespace":"monitoring","error":"upgrade retries exhausted"}
{"level":"error","ts":"2021-09-30T08:01:39.382Z","logger":"controller.helmrelease","msg":"unable to update status after reconciliation","reconciler group":"helm.toolkit.fluxcd.io","reconciler kind":"HelmRelease","name":"infra","namespace":"monitoring","error":"Operation cannot be fulfilled on helmreleases.helm.toolkit.fluxcd.io \"infra\": the object has been modified; please apply your changes to the latest version and try again"}
{"level":"error","ts":"2021-09-30T08:01:39.382Z","logger":"controller.helmrelease","msg":"Reconciler error","reconciler group":"helm.toolkit.fluxcd.io","reconciler kind":"HelmRelease","name":"infra","namespace":"monitoring","error":"Operation cannot be fulfilled on helmreleases.helm.toolkit.fluxcd.io \"infra\": the object has been modified; please apply your changes to the latest version and try again"}
{"level":"info","ts":"2021-09-30T08:01:41.124Z","logger":"controller.helmrelease","msg":"reconcilation finished in 1.742011934s, next run in 5m0s","reconciler group":"helm.toolkit.fluxcd.io","reconciler kind":"HelmRelease","name":"infra","namespace":"monitoring"}
{"level":"error","ts":"2021-09-30T08:01:41.124Z","logger":"controller.helmrelease","msg":"Reconciler error","reconciler group":"helm.toolkit.fluxcd.io","reconciler kind":"HelmRelease","name":"infra","namespace":"monitoring","error":"upgrade retries exhausted"}
{"level":"info","ts":"2021-09-30T08:01:43.644Z","logger":"controller.helmrelease","msg":"reconcilation finished in 1.702195992s, next run in 5m0s","reconciler group":"helm.toolkit.fluxcd.io","reconciler kind":"HelmRelease","name":"infra","namespace":"monitoring"}
{"level":"error","ts":"2021-09-30T08:01:43.644Z","logger":"controller.helmrelease","msg":"Reconciler error","reconciler group":"helm.toolkit.fluxcd.io","reconciler kind":"HelmRelease","name":"infra","namespace":"monitoring","error":"upgrade retries exhausted"}
{"level":"info","ts":"2021-09-30T08:01:55.508Z","logger":"controller.helmrelease","msg":"reconcilation finished in 1.622736012s, next run in 5m0s","reconciler group":"helm.toolkit.fluxcd.io","reconciler kind":"HelmRelease","name":"infra","namespace":"monitoring"}
{"level":"error","ts":"2021-09-30T08:01:55.508Z","logger":"controller.helmrelease","msg":"Reconciler error","reconciler group":"helm.toolkit.fluxcd.io","reconciler kind":"HelmRelease","name":"infra","namespace":"monitoring","error":"upgrade retries exhausted"}
{"level":"info","ts":"2021-09-30T08:02:17.669Z","logger":"controller.helmrelease","msg":"reconcilation finished in 1.680522601s, next run in 5m0s","reconciler group":"helm.toolkit.fluxcd.io","reconciler kind":"HelmRelease","name":"infra","namespace":"monitoring"}
{"level":"error","ts":"2021-09-30T08:02:17.669Z","logger":"controller.helmrelease","msg":"Reconciler error","reconciler group":"helm.toolkit.fluxcd.io","reconciler kind":"HelmRelease","name":"infra","namespace":"monitoring","error":"upgrade retries exhausted"}
{"level":"info","ts":"2021-09-30T08:03:00.294Z","logger":"controller.helmrelease","msg":"reconcilation finished in 1.664480571s, next run in 5m0s","reconciler group":"helm.toolkit.fluxcd.io","reconciler kind":"HelmRelease","name":"infra","namespace":"monitoring"}
{"level":"error","ts":"2021-09-30T08:03:00.294Z","logger":"controller.helmrelease","msg":"Reconciler error","reconciler group":"helm.toolkit.fluxcd.io","reconciler kind":"HelmRelease","name":"infra","namespace":"monitoring","error":"upgrade retries exhausted"}
{"level":"info","ts":"2021-09-30T08:04:23.880Z","logger":"controller.helmrelease","msg":"reconcilation finished in 1.665473894s, next run in 5m0s","reconciler group":"helm.toolkit.fluxcd.io","reconciler kind":"HelmRelease","name":"infra","namespace":"monitoring"}
{"level":"error","ts":"2021-09-30T08:04:23.880Z","logger":"controller.helmrelease","msg":"Reconciler error","reconciler group":"helm.toolkit.fluxcd.io","reconciler kind":"HelmRelease","name":"infra","namespace":"monitoring","error":"upgrade retries exhausted"}
{"level":"info","ts":"2021-09-30T08:04:46.197Z","logger":"controller.helmrelease","msg":"reconcilation finished in 1.637545948s, next run in 5m0s","reconciler group":"helm.toolkit.fluxcd.io","reconciler kind":"HelmRelease","name":"infra","namespace":"monitoring"}
{"level":"error","ts":"2021-09-30T08:04:46.197Z","logger":"controller.helmrelease","msg":"Reconciler error","reconciler group":"helm.toolkit.fluxcd.io","reconciler kind":"HelmRelease","name":"infra","namespace":"monitoring","error":"upgrade retries exhausted"}
{"level":"info","ts":"2021-09-30T08:05:52.746Z","logger":"controller.helmrelease","msg":"reconcilation finished in 1.681377434s, next run in 5m0s","reconciler group":"helm.toolkit.fluxcd.io","reconciler kind":"HelmRelease","name":"infra","namespace":"monitoring"}
{"level":"error","ts":"2021-09-30T08:05:52.746Z","logger":"controller.helmrelease","msg":"Reconciler error","reconciler group":"helm.toolkit.fluxcd.io","reconciler kind":"HelmRelease","name":"infra","namespace":"monitoring","error":"upgrade retries exhausted"}
{"level":"info","ts":"2021-09-30T08:07:09.465Z","logger":"controller.helmrelease","msg":"reconcilation finished in 1.74306092s, next run in 5m0s","reconciler group":"helm.toolkit.fluxcd.io","reconciler kind":"HelmRelease","name":"infra","namespace":"monitoring"}
{"level":"error","ts":"2021-09-30T08:07:09.465Z","logger":"controller.helmrelease","msg":"Reconciler error","reconciler group":"helm.toolkit.fluxcd.io","reconciler kind":"HelmRelease","name":"infra","namespace":"monitoring","error":"upgrade retries exhausted"}
{"level":"info","ts":"2021-09-30T08:10:35.591Z","logger":"controller.helmrelease","msg":"reconcilation finished in 34.25555514s, next run in 5m0s","reconciler group":"helm.toolkit.fluxcd.io","reconciler kind":"HelmRelease","name":"infra","namespace":"monitoring"}
helmrelease spec:
---
apiVersion: helm.toolkit.fluxcd.io/v2beta1
kind: HelmRelease
metadata:
name: infra
namespace: monitoring
spec:
interval: 5m
chart:
spec:
# renovate: registryUrl=https://prometheus-community.github.io/helm-charts
chart: kube-prometheus-stack
version: 19.0.1
sourceRef:
kind: HelmRepository
name: prometheus-community
namespace: flux-system
interval: 1m
install:
crds: Create
upgrade:
crds: CreateReplace
# valuesFrom:
# - kind: Secret
# name: kube-prometheus-values
# # valuesKey: values.yaml
values:
...
OS / Distro
Ubuntu 20.04
Flux version
0.17.2
Flux check
❯ flux check ► checking prerequisites ✔ kubectl 1.20.11 >=1.18.0-0 ✔ Kubernetes 1.19.15 >=1.16.0-0 ► checking controllers ✔ helm-controller: deployment ready ► ghcr.io/fluxcd/helm-controller:v0.11.2 ✔ kustomize-controller: deployment ready ► ghcr.io/fluxcd/kustomize-controller:v0.14.1 ✔ notification-controller: deployment ready ► ghcr.io/fluxcd/notification-controller:v0.16.0 ✔ source-controller: deployment ready ► ghcr.io/fluxcd/source-controller:v0.15.4 ✔ all checks passed
Git provider
No response
Container Registry provider
No response
Additional context
Maybe we can collect some kind of documentation how to get out of this “upgrade exhausted” situation?
Code of Conduct
- I agree to follow this project’s Code of Conduct
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Reactions: 60
- Comments: 51 (12 by maintainers)
Hello!
I can confirm this issue is still relevant for the latest version of a helm-controller. The workaround for now is this:
it will reconcile broken release states such as “exhausted” and “another rollback/release is in progress”. Works for me.
Hopefully this helps to people also facing the same issue.
This issue still being open means it still exists, otherwise we would have closed it (hopefully) 😃.
But! This has been actively worked on since https://github.com/fluxcd/helm-controller/issues/454#issuecomment-1120945094. Latest update was https://github.com/fluxcd/helm-controller/pull/503 (merged two days ago), which is the foundation for the new solution. I will continue to focus on finishing this in the following weeks, including release candidates at some point.
Thank your for your patience 🙇
Hey folks,
Just wanted to echo the same here as I did in https://github.com/fluxcd/helm-controller/issues/149#issuecomment-1111860111. This message is from a little more than a week ago, and I am now after #477 at the point to start rewriting the release logic. While doing this, I will take this long standing issue into account, and ensure it’s covered with a regression test.
@MKruger777 There’s already great progress, but it’s behind a feature flag and you need to pay some attention to monitoring in order to set it up properly and reap the benefits that you’re looking for. The tl;dr is, if you are monitoring HelmReleases appropriately, then you can set:
and Flux will not give up, it will retry indefinitely, and you should not see “upgrade retries exhausted” anymore – (but by itself this answer leaves quite a great many gaps, and it also misses or glosses over some of the quite important developments in progress/already delivered/scheduled for Q3 and “Flux 2.1”, … altogether that may explain why this answer is so long…)
The key issue here is somewhat multi-faceted that’s going to be hard to “fix” per-se, is that Helm is natively an imperative process, at least due to the way the Helm CLI is normally expected to be used, but also due to the concept of lifecycle hooks… Helm is an imperative process which uses a large amount of resources instantaneously during its install/upgrade attempts, and so repeated install/upgrade attempts are definitely a thing to minimize, especially when you have many going concurrently. You can imagine a situation where a simple error that a human operator could easily see “can never fix itself, not worth retrying” now unfortunately triggers some failure loop that crashes repeatedly, and this process that consumes so many resources just taking down the entire cluster, …so it should be clear why the behavior listed above is not the default.
Also, adding to this that Helm counts upgrades in a way that facilitates some manner of “release accounting” with Helm’s built-in secrets. These secrets are used to tell
helm historyandhelm rollbackhow to work, which Flux abides and can work alongside. So to prevent trampling over the history, Helm Controller only upgrades when it’s absolutely necessary.Now this is all configurable as well, and in the default behavior of HelmRelease, it currently does this minimal upgrade in two ways, by having a configurable number of retries (by default it’s 0 - it doesn’t retry) and additionally, by tracking inputs - the HelmRelease spec, the chart itself (values, template) and any external secrets or configmaps that HelmRelease refers to.
When “upgrade retries exhausted” appears, if you haven’t done any of this configuration, it just means an upgrade failed or timed out. If you are monitoring for these conditions, it might be perfectly acceptable to retry indefinitely until it succeeds, (so that nobody needs to “kick the release” once the condition leading up to the failure is resolved. That’s
flux suspend helmreleaseandflux resume helmrelease– this doesn’t always work, as it often depends on unknown external factors)By default, on each reconcile HelmRelease will only trigger an upgrade if some of these inputs have actually changed – that is, up until this PR:
Now, in the feature flagged behavior, it also tracks for drift. You need to enable this feature flag for now, since drift detection has the same foot-guns as infinite retries mentioned above
https://fluxcd.io/flux/cheatsheets/bootstrap/#enable-helm-drift-detection
Hopefully it’s clear why this drift correction is part of the solution – we need Helm to retry an “upgrade” whenever some resource has changed from the template in the Helm chart, too. This closes the loop, making Helm Controller behave in all ways how GitOps practitioners expect declarative appliers to work, correcting any drift that gets introduced unexpectedly, and not ever landing in a “stuck” or stalled condition like this (unless the inputs are actually invalid!)
I think successful adoption of this feature flag by enough users will help us better understand how to provide this feature in a way that “solves the issue” so we can turn the feature on for all users. But at present, it’s a complicated issue and the users need to know some of these details in order to solve it.
For now, in order to enable drift detection safely, you need to have Flux’s Alert and Provider resources configured to tell you about HelmRelease events. This will ensure you have a way to know when Flux is “trapped in an upgrade loop” which is something you can see happening a bit more often while drift detection is enabled, because of features like the lifecycle hooks and other things like Kubernetes operators installed via Helm, that may occasionally write back updates to some of the resources that Helm template installed. That’s when you need a (human) operator to be notified and intervene ASAP.
So, since we’re now “correcting drift” in Helm templates and Helm Controller sees a number of perfectly normal things as drift, we need to mark some drift as “allowable” – this is all covered in the cheatsheet link above, which links out to this doc:
https://fluxcd.io/flux/components/helm/helmreleases/#drift-detection
tl;dr: This issue should be “Solved” in Flux Helm GA scheduled for next quarter, which is the next entry on the Roadmap after Flux GitOps GA – I think that the intention is that it should be solved in Flux Helm GA, but caveat that: it actually remains to be seen how much of this problem can really be successfully abstracted away from users, and how much will change from the solution that’s already available in existing Flux releases.
I might recommend a simple solution which would be to add a flux reset-retries option? and have it just wipe the retries counter wherever its set
When you run
flux reconcileit annotates the Flux resource with a patch to set theReconcileRequestAnnotationand the current time, so that Flux triggers the periodic reconciliation ahead of the regularly scheduled interval.When you run
flux suspendit also annotates the Flux resource, patchingsuspend: trueinto the spec. This stops reconciliation altogether until it’s reversed byflux resume. I’m not sure what mechanism causes this, but anupgradeis always triggered on resume. This is why the drift gets reverted then / upgrade retries gets a new starting count.We have the Flux-E2E guide which doesn’t cover this very well, and could probably stand to get an update soon as it still describes HelmRepository as “the preferred source” for HelmReleases, (hopefully we can say that OCI sources are preferred soon enough… but that’s a digression here) – to update the text right now in detail would be less than advisable because of the rewrite in progress, important details are being worked out and will be changing soon to accommodate this:
That’s what we’re all here for 👍 it’s coming, please be patient, this is rocket surgery as far as I’m concerned 🚀⎈🏋️♂️ I mean this is the subject of the issue, if I understand the whole thread.
The general behavior of Flux is to always correct and control drift of any kind. The Helm Controller behavior is different today because of the mechanics of Helm, a secret which tracks every “upgrade” that should only be incremented/duplicated once for every time a change is made (or in the case of
suspend/resumeat least today when an upgrade has been done, regardless of whether any changes happened.)It’s a priority to resolve this, stay tuned.
Also happening with Flux CLI Version
0.24.1:flux reconcile hr <name>-->HelmRelease reconciliation failed: install retries exhaustedWorkaround as suggest above works:
flux suspend hr <name>followed byflux resume hr <name>are working in terms of a workaround.The most predictable way I have addressed this is
kubectl delete secret -l owner=helm,name=[release name],status=pending-upgrade && flux reconcile hr -n [release namespace] [release name]Our organization is still encountering this bug regularly, and it’s becoming more disruptive to our gitops setup. Just curious if there are any plans for addressing this at some point
I’ve had success working around this by doing a suspend followed by a resume.
This is another suggestion, although I don’t like it as much it may also have worked for you:
When you need to trigger a new HelmRelease reconciliation after “upgrade retries exhausted” and you aren’t in a position to run
helm rollbackorhelm uninstall, try editingspec.values– this is one place where an untypedvaluescomes in handy, you can invent a new value that doesn’t mean anything, sayspec.values.nonceand just update it.Helm does not type
values.yamlso it has no way of knowing that change tononcedoesn’t actually update anything when it is substituted into the templates, and it cannot know because there’s no mechanism in helm to detect what types of changes are made by any post-install or post-upgrade hooks there might have been in any given Helm chart. (Any hooks might care about the value ofnonceas they can be running processes that manipulate the state of the release in post.)Helm will be forced to run the upgrade again each time you update the nonce value. Hope this helps as well!
Facing the same issue. Really causing some headaches : ( Any Idea on some sort of time line when this will be tackled? Many thanks!
I’m fairly sure that’s correct.
--timeoutis a global option on the flux cli, it does not pass through to HelmRelease.I’ve run into situations where
helm uninstall ...orflux delete hr ...was the only way to resolve this issue as well, suspend/resume had no effect. Next time it happens I’ll try to have more information. Seems like Flux gets stuck on trying to install or upgrade and only a fresh install of the helm release fixes it.Hi @kingdonb , all right, thanks a lot for providing your explanation. Your right, fixing errors with just uninstall + install is a long time ago 😉 In our case we have sometimes services based on helmcharts on dev enviroments, which arent important and havent been used for a longer time. Thats when we ignore installation errors, because in 99% they are just occuring because of outdated helmchart versions (e.g. the image isnt available any more). Just to know that supsending and resuming does the same as reconcile in the past is ok for me. Because on important enviroments where the services have to run everytime, we imitially get informed by alerting and a “install retries exhausted” error could hardly happen.
@snukone The Flux 0.26.1 release out this week has lots of Helm updates that will make Helm fail less often, according to reports we’ve received.
I have heard mixed reports about whether suspend/resume will actually retry a failed HelmRelease that exhausted retries or not, it may depend on how it failed. I’d be surprised if
install retries exhaustedwas solved that way in fact, since a failed install leaves a secret behind, and I think the secret records it as failed? I guess I’m in the minority here if this doesn’t work for me.In any case I think you’d have to configure
remediationStrategysettings for your preferred number of retries, and/or remediation method. It sounds like the days are long gone when your best option was runninghelm uninstalland trying again. 👍