helm-controller: Helm upgrade failed: another operation (install/upgrade/rollback) is in progress
Sometimes helm releases are not installed because of this error:
{"level":"info","ts":"2020-11-19T15:41:11.273Z","logger":"controllers.HelmRelease","msg":"reconcilation finished in 50.12655ms, next run in 9m0s","controller":"helmrelease","request":"traefik/traefik"}
{"level":"error","ts":"2020-11-19T15:41:11.274Z","logger":"controller","msg":"Reconciler error","reconcilerGroup":"helm.toolkit.fluxcd.io","reconcilerKind":"HelmRelease","controller":"helmrelease","name":"traefik","namespace":"traefik","error":"Helm upgrade failed: another operation (install/upgrade/rollback) is in progress"}
{"level":"info","ts":"2020-11-19T15:43:19.310Z","logger":"controllers.HelmRelease","msg":"reconcilation finished in 69.439664ms, next run in 9m0s","controller":"helmrelease","request":"traefik/traefik"}
{"level":"error","ts":"2020-11-19T15:43:19.310Z","logger":"controller","msg":"Reconciler error","reconcilerGroup":"helm.toolkit.fluxcd.io","reconcilerKind":"HelmRelease","controller":"helmrelease","name":"traefik","namespace":"traefik","error":"Helm upgrade failed: another operation (install/upgrade/rollback) is in progress"}
{"level":"info","ts":"2020-11-19T15:52:42.524Z","logger":"controllers.HelmRelease","msg":"reconcilation finished in 69.944579ms, next run in 9m0s","controller":"helmrelease","request":"traefik/traefik"}
{"level":"error","ts":"2020-11-19T15:52:42.525Z","logger":"controller","msg":"Reconciler error","reconcilerGroup":"helm.toolkit.fluxcd.io","reconcilerKind":"HelmRelease","controller":"helmrelease","name":"traefik","namespace":"traefik","error":"Helm upgrade failed: another operation (install/upgrade/rollback) is in progress"}
In this case the helm release is stuck in pending status.
We have not found any corresponding log entry of the actual installation. Is this some concurrency bug?
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Reactions: 37
- Comments: 78 (28 by maintainers)
Commits related to this issue
- Update flux components helm-controller: v0.11.1 source-controller: v0.15.3 This brings in an increase in the default leader election deadlines, in order to hopefully reduce the impact of any cluster... — committed to airshipit/airshipctl by seaneagan 3 years ago
- Update Helm to v3.7.0 This pulls in Kubernetes dependencies at `v0.22.1`, but should include improvements that would help resolve https://github.com/fluxcd/helm-controller/issues/149 Signed-off-by: ... — committed to fluxcd/helm-controller by hiddeco 3 years ago
- Update Helm to v3.7.0 This pulls in Kubernetes dependencies at `v0.22.1`, but should include improvements that would help resolve https://github.com/fluxcd/helm-controller/issues/149 Signed-off-by: ... — committed to fluxcd/helm-controller by hiddeco 3 years ago
- feat(flux): reduce concurrent workers This was recommended as one of the suggestions in a bug report with the similar issue https://github.com/fluxcd/helm-controller/issues/149. — committed to qlonik/musical-parakeet by qlonik a year ago
- feat(flux/config): reduce concurrent workers This was recommended as one of the suggestions in a bug report with the similar issue https://github.com/fluxcd/helm-controller/issues/149. — committed to qlonik/musical-parakeet by qlonik a year ago
I got same issue. Please check the following first. I was not even able to list the release under this usual command
this was responding empty. So funny behavior from helm.
make sure your context is set for correct kuberenetes cluster.
then next step is
try applying the rollback to above command.
Yes.
The helm-controller is scheduled to see the same refactoring round as the source-controller recently did, in which reconciliation logic in the broadest sense will be improved, and long standing issues will be taken care of. I expect to start on this at the beginning of next week.
Use the following command to also see charts in all namespaces and also the ones where installation is in progress.
This problem still persists. It is very annoying that we cannot rely on GitOps to eventually converge a cluster to the expected state as it is stuck on affected HelmRelease objects, stuck in
Helm upgrade failed: another operation (install/upgrade/rollback) is in progress. Or to phrase it with other words, FluxCD based GitOps setup with HelmReleases is not self-healing and needs a lot of (unpredictable) manual interventions to runhelm rollback ... && flux reconcile hr ...commands in order to fix things.Is there anything that prevents us from adding a new feature to the helm-controller to detect stuck (locked) HelmReleases and automatically fix them by rolling them back immediately followed by a reconciliation?
I’m adding this comment to restate/clarify the remaining problem so it will hopefully be easier to identify and resolve.
Helm obtains a mutex lock on the chart install, so any HelmRelease resources under reconciliation at the moment the Helm Controller crashes (or runs out of memory or the node on which it is running crashes) will get stuck in a deadlocked PENDING state as nothing will subsequently remove the lock. When the next Helm Controller pod starts, it attempts to resolve the HelmRelease for the deadlocked chart and encounter the
Helm upgrade failed: another operation (install/upgrade/rollback) is in progresserror.See https://github.com/helm/helm/issues/4558 and https://github.com/helm/helm/pull/9180 for more details about Helm mutex lock issue.
From @hiddeco 's earlier comment - assuming that it will be some time before this is fixed in Helm itself, the likely workaround within Helm Controller would be:
@stefanprodan 's PR #239 mitigates a crash condition that would leave releases stuck in the PENDING state, but does not provide a resolution path for those releases.
@sharkztex this is the same problem I commonly see. The workarounds I know of are:
Alternatively you can use helm rollback:
Same here, and we constantly manually apply
helm rollback ... && flux reconcile ...to fix it. What about adding a flag toHelmReleaseto opt-in to a self healing approach, where the helm-controller would recognise HelmReleases in this state and automatically apply a rollback to them.This saved me my chrismas eve dinner, thank you so much!
@marcocaberletti we fixed the OOM issues, we’ll do a flux release today. Please see: https://github.com/fluxcd/helm-controller/blob/main/CHANGELOG.md#0122
I’ve resetup my whole fluxcd repo, inspired by the helm example repo (https://github.com/fluxcd/flux2-kustomize-helm-example).
I have several kustomizations for infra tools, google config connector, cert-manager and monitoring charts now, which depend on each other and also make use of health checks. My monitoring namespace charts have some helm dependencies, so other monitoring charts are installed after kube-prometheus-stack chart, hoping this would lower the pressure on the k8s api.
Nevertheless k8s api dies shortly after kube-prometheus-stack is installed, kustomize controller is restartet and the helmrelease stays forever with this state:
Therefore the kustomize healt check, testing for the kube-prometheus-stack helmrelase to be ready, is also stuck.
So again it does not look like its a kustomize controller problem but it would still be nice if we could recover from that without doing something by hand.
Deleting the helmrelease and running
flux reconcile kustomization monitoringsolved it.@niclarkin thank you so much for testing this. After running my own tests last night on several clusters I have bumped the default deadline to 30 seconds, my tests were focused on CNI upgrades and API network failures. I’ve also changed the flag names. I’ll add these flags to all toolkit controllers and the new defaults will be available in the next flux release.
We’ve disabled the kube-state-metrics on kube-prometheus-stack chart, but same result
Helm itself places a lock when it starts an upgrade, if you kill Helm while doing it, it leaves the lock in place preventing any further upgrade operations. Doing a rollback is very expensive and can have grave consequences for charts with statefulsets or charts that contain hooks which perform db migrations and other state altering operations. We’ll need to find a way to remove the lock without affecting the deployed workloads.
Same issue for me after the upgrade to flux v0.21.0, on the 11/02 in the plot below. I notice a huge increase in memory usage for the helm-controller, so often the pod is OOM-killed and Helm releases stay in
pending-upgradestatus.Started to run into this with flux 0.19.1. The only thing that seem to work as a fix is manually deleting helmrelease. My setup is: Kustomization (interval: 2m0s) creates -> HelmRelease (interval: 5m0s, infinite retries).
@mfamador @florinherbert note that kube-prometheus-stack comes with an exporter called kube-state-metrics that runs tones of queries against the API. My guess is that it DDOSes the Kubernetes API in such a way that both AKS and kubeadm control planes are crashing 🙃
Not sure if it’s of interest, and sorry for the spam, but it might give a clue of what might be happening.
Updated the cluster from 1.18.14 to 1.19.7, and added a new node to have more resources, killed all pods on kube-system so that
tunnelfrontwas also restarted (it was already restarted when updating the cluster).Was able to install the helm release, but the controllers crashed the same, leaving it in
pending-installagain. Installing it manually with❯ helm install manual-prometheus -n monitoring prometheus-community/kube-prometheus-stackworked just fine.For the others in this issue:
The problem you are all running into has been around in Helm for awhile, and is most of the time related to Helm not properly restoring / updating the release state for some timeouts that may happen during the rollout of a release.
The reason you are seeing this more frequently compared to earlier versions of Helm is due to the introduction of https://github.com/helm/helm/pull/7322 in
v3.4.x.There are three options that would eventually resolve this for you all:
Until we have opted for one of those options (likely to be option 2), and your issue isn’t due to the controller(s) crashing, you may want to set your timeout to a more generous value as already suggested in https://github.com/fluxcd/helm-controller/issues/149#issuecomment-796782509. That should minimize the chances of running into Helm#4558.
For all folks who do not experience helm controller crashes… could you try adding bigger timeout to HelmRelease
timeout: 30mSame here, from what I could see the flux controllers are crashing while reconciling the
helmreleasesand the charts stay with pending status.And the helm releases:
After deleting a
helmreleaseso that it can be recreated again, thekustomize-controlleris crashing:helm uninstallfor thepending-installreleases seems to solve the problem some times, but most of the times the controllers are still crashing:Try
k describe helmreleases <therelease>and look at the events. In my case I believe it was caused by:I did a helm upgrade by hand, and then it reconciled in flux too.