helm-controller: Helm upgrade failed: another operation (install/upgrade/rollback) is in progress

Sometimes helm releases are not installed because of this error:

{"level":"info","ts":"2020-11-19T15:41:11.273Z","logger":"controllers.HelmRelease","msg":"reconcilation finished in 50.12655ms, next run in 9m0s","controller":"helmrelease","request":"traefik/traefik"}
{"level":"error","ts":"2020-11-19T15:41:11.274Z","logger":"controller","msg":"Reconciler error","reconcilerGroup":"helm.toolkit.fluxcd.io","reconcilerKind":"HelmRelease","controller":"helmrelease","name":"traefik","namespace":"traefik","error":"Helm upgrade failed: another operation (install/upgrade/rollback) is in progress"}
{"level":"info","ts":"2020-11-19T15:43:19.310Z","logger":"controllers.HelmRelease","msg":"reconcilation finished in 69.439664ms, next run in 9m0s","controller":"helmrelease","request":"traefik/traefik"}
{"level":"error","ts":"2020-11-19T15:43:19.310Z","logger":"controller","msg":"Reconciler error","reconcilerGroup":"helm.toolkit.fluxcd.io","reconcilerKind":"HelmRelease","controller":"helmrelease","name":"traefik","namespace":"traefik","error":"Helm upgrade failed: another operation (install/upgrade/rollback) is in progress"}
{"level":"info","ts":"2020-11-19T15:52:42.524Z","logger":"controllers.HelmRelease","msg":"reconcilation finished in 69.944579ms, next run in 9m0s","controller":"helmrelease","request":"traefik/traefik"}
{"level":"error","ts":"2020-11-19T15:52:42.525Z","logger":"controller","msg":"Reconciler error","reconcilerGroup":"helm.toolkit.fluxcd.io","reconcilerKind":"HelmRelease","controller":"helmrelease","name":"traefik","namespace":"traefik","error":"Helm upgrade failed: another operation (install/upgrade/rollback) is in progress"}

In this case the helm release is stuck in pending status.

We have not found any corresponding log entry of the actual installation. Is this some concurrency bug?

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Reactions: 37
  • Comments: 78 (28 by maintainers)

Commits related to this issue

Most upvoted comments

I got same issue. Please check the following first. I was not even able to list the release under this usual command

helm list -n <name-space>

this was responding empty. So funny behavior from helm.

 kubectl config get-contexts

make sure your context is set for correct kuberenetes cluster.

then next step is

helm history <release> -n <name-space> --kube-context <kube-context-name>

try applying the rollback to above command.

helm rollback <release> <revision> -n <name-space> --kube-context <kube-context-nam>

Yes.

The helm-controller is scheduled to see the same refactoring round as the source-controller recently did, in which reconciliation logic in the broadest sense will be improved, and long standing issues will be taken care of. I expect to start on this at the beginning of next week.

Use the following command to also see charts in all namespaces and also the ones where installation is in progress.

helm list -Aa

This problem still persists. It is very annoying that we cannot rely on GitOps to eventually converge a cluster to the expected state as it is stuck on affected HelmRelease objects, stuck in Helm upgrade failed: another operation (install/upgrade/rollback) is in progress. Or to phrase it with other words, FluxCD based GitOps setup with HelmReleases is not self-healing and needs a lot of (unpredictable) manual interventions to run helm rollback ... && flux reconcile hr ... commands in order to fix things.

Is there anything that prevents us from adding a new feature to the helm-controller to detect stuck (locked) HelmReleases and automatically fix them by rolling them back immediately followed by a reconciliation?

I’m adding this comment to restate/clarify the remaining problem so it will hopefully be easier to identify and resolve.

Helm obtains a mutex lock on the chart install, so any HelmRelease resources under reconciliation at the moment the Helm Controller crashes (or runs out of memory or the node on which it is running crashes) will get stuck in a deadlocked PENDING state as nothing will subsequently remove the lock. When the next Helm Controller pod starts, it attempts to resolve the HelmRelease for the deadlocked chart and encounter the Helm upgrade failed: another operation (install/upgrade/rollback) is in progress error.

See https://github.com/helm/helm/issues/4558 and https://github.com/helm/helm/pull/9180 for more details about Helm mutex lock issue.

From @hiddeco 's earlier comment - assuming that it will be some time before this is fixed in Helm itself, the likely workaround within Helm Controller would be:

Detect the pending state in the controller, assume that we are the sole actors over the release (and can safely ignore the pessimistic lock), and fall back to the configured remediation strategy for the release to attempt to perform an automatic rollback (or uninstall).

@stefanprodan 's PR #239 mitigates a crash condition that would leave releases stuck in the PENDING state, but does not provide a resolution path for those releases.

We’ve experienced issues where some of our release get stuck on:

"Helm upgrade failed: another operation (install/upgrade/rollback) is in progress"

@sharkztex this is the same problem I commonly see. The workarounds I know of are:

# example w/ kiali
HR_NAME=kiali
HR_NAMESPACE=kiali
kubectl get secrets -n ${HR_NAMESPACE} | grep ${HR_NAME}
# example output:
sh.helm.release.v1.kiali.v1                                       helm.sh/release.v1                    1      18h
sh.helm.release.v1.kiali.v2                                       helm.sh/release.v1                    1      17h
sh.helm.release.v1.kiali.v3                                       helm.sh/release.v1                    1      17m
# Delete the most recent one:
kubectl delete secret -n ${HR_NAMESPACE} sh.helm.release.v1.${HR_NAME}.v3

# suspend/resume the hr
flux suspend hr -n ${HR_NAMESPACE} ${HR_NAME}
flux resume hr -n ${HR_NAMESPACE} ${HR_NAME}

Alternatively you can use helm rollback:

HR_NAME=kiali
HR_NAMESPACE=kiali

# Run a helm history command to get the latest release before the issue (should show deployed)
helm history ${HR_NAME} -n ${HR_NAMESPACE} 
# Use that revision in this command
helm rollback ${HR_NAME} <revision> -n ${HR_NAMESPACE} 
flux reconcile hr bigbang -n ${HR_NAMESPACE} 

Same here, and we constantly manually apply helm rollback ... && flux reconcile ... to fix it. What about adding a flag to HelmRelease to opt-in to a self healing approach, where the helm-controller would recognise HelmReleases in this state and automatically apply a rollback to them.

I got same issue. Please check the following first. I was not even able to list the release under this usual command

helm list -n <name-space>

this was responding empty. So funny behavior from helm.

 kubectl config get-contexts

make sure your context is set for correct kuberenetes cluster.

then next step is

helm history <release> -n <name-space> --kube-context <kube-context-name>

try applying the rollback to above command.

helm rollback <release> <revision> -n <name-space> --kube-context <kube-context-nam>

This saved me my chrismas eve dinner, thank you so much!

@marcocaberletti we fixed the OOM issues, we’ll do a flux release today. Please see: https://github.com/fluxcd/helm-controller/blob/main/CHANGELOG.md#0122

I’ve resetup my whole fluxcd repo, inspired by the helm example repo (https://github.com/fluxcd/flux2-kustomize-helm-example).

I have several kustomizations for infra tools, google config connector, cert-manager and monitoring charts now, which depend on each other and also make use of health checks. My monitoring namespace charts have some helm dependencies, so other monitoring charts are installed after kube-prometheus-stack chart, hoping this would lower the pressure on the k8s api.

Nevertheless k8s api dies shortly after kube-prometheus-stack is installed, kustomize controller is restartet and the helmrelease stays forever with this state:

k -n monitoring get helmreleases.helm.toolkit.fluxcd.io 
NAME                           READY   STATUS                                                                             AGE
kube-prometheus-stack          False   Helm upgrade failed: another operation (install/upgrade/rollback) is in progress   106m
prometheus-blackbox-exporter   False   dependency 'monitoring/kube-prometheus-stack' is not ready                         106m
prometheus-mysql-exporter      False   HelmChart 'infra/monitoring-prometheus-mysql-exporter' is not ready                106m
prometheus-postgres-exporter   False   dependency 'monitoring/kube-prometheus-stack' is not ready                         106m

Therefore the kustomize healt check, testing for the kube-prometheus-stack helmrelase to be ready, is also stuck.

 k get kustomizations.kustomize.toolkit.fluxcd.io 
NAME           READY     STATUS                                                            AGE
cert-manager   True      Applied revision: main/1e59bc6d254a87ecb3b9f1e273840054603b8bd9   120m
cnrm-system    True      Applied revision: main/1e59bc6d254a87ecb3b9f1e273840054603b8bd9   120m
flux-system    True      Applied revision: main/1e59bc6d254a87ecb3b9f1e273840054603b8bd9   120m
infra          True      Applied revision: main/1e59bc6d254a87ecb3b9f1e273840054603b8bd9   120m
monitoring     Unknown   reconciliation in progress                                        120m
remote-apps    False     dependency 'flux-system/monitoring' is not ready                  120m
tls-cert       True      Applied revision: main/1e59bc6d254a87ecb3b9f1e273840054603b8bd9   120m

So again it does not look like its a kustomize controller problem but it would still be nice if we could recover from that without doing something by hand.

Deleting the helmrelease and running flux reconcile kustomization monitoring solved it.

@niclarkin thank you so much for testing this. After running my own tests last night on several clusters I have bumped the default deadline to 30 seconds, my tests were focused on CNI upgrades and API network failures. I’ve also changed the flag names. I’ll add these flags to all toolkit controllers and the new defaults will be available in the next flux release.

apiVersion: helm.toolkit.fluxcd.io/v2beta1
kind: HelmRelease
metadata:
  name: kube-prometheus-stack
spec:
  releaseName: kube-prometheus-stack
  chart:
    spec:
      chart: kube-prometheus-stack
      sourceRef:
        kind: HelmRepository
        name: prometheus-community
        namespace: flux-system
      version: "14.0.1"
  interval: 1h0m0s
  timeout: 30m
  install:
    remediation:
      retries: 3
  values:
    kubeStateMetrics:
      enabled: false

We’ve disabled the kube-state-metrics on kube-prometheus-stack chart, but same result

Helm itself places a lock when it starts an upgrade, if you kill Helm while doing it, it leaves the lock in place preventing any further upgrade operations. Doing a rollback is very expensive and can have grave consequences for charts with statefulsets or charts that contain hooks which perform db migrations and other state altering operations. We’ll need to find a way to remove the lock without affecting the deployed workloads.

Same issue for me after the upgrade to flux v0.21.0, on the 11/02 in the plot below. I notice a huge increase in memory usage for the helm-controller, so often the pod is OOM-killed and Helm releases stay in pending-upgrade status.

Screenshot from 2021-11-11 16-53-55

Started to run into this with flux 0.19.1. The only thing that seem to work as a fix is manually deleting helmrelease. My setup is: Kustomization (interval: 2m0s) creates -> HelmRelease (interval: 5m0s, infinite retries).

@mfamador @florinherbert note that kube-prometheus-stack comes with an exporter called kube-state-metrics that runs tones of queries against the API. My guess is that it DDOSes the Kubernetes API in such a way that both AKS and kubeadm control planes are crashing 🙃

Not sure if it’s of interest, and sorry for the spam, but it might give a clue of what might be happening.

Updated the cluster from 1.18.14 to 1.19.7, and added a new node to have more resources, killed all pods on kube-system so that tunnelfront was also restarted (it was already restarted when updating the cluster).

Was able to install the helm release, but the controllers crashed the same, leaving it in pending-install again. Installing it manually with ❯ helm install manual-prometheus -n monitoring prometheus-community/kube-prometheus-stack worked just fine.

❯ helm list -Aa
NAME                   	NAMESPACE   	REVISION	UPDATED                                	STATUS         	CHART                       	APP VERSION
manual-prometheus      	monitoring  	1       	2021-03-12 09:53:43.33732 +0000 UTC    	deployed       	kube-prometheus-stack-14.0.1	0.46.0
kube-prometheus-stack  	monitoring  	1       	2021-03-12 09:32:23.673133881 +0000 UTC	pending-install	kube-prometheus-stack-14.0.1	0.46.0

For the others in this issue:

The problem you are all running into has been around in Helm for awhile, and is most of the time related to Helm not properly restoring / updating the release state for some timeouts that may happen during the rollout of a release.

The reason you are seeing this more frequently compared to earlier versions of Helm is due to the introduction of https://github.com/helm/helm/pull/7322 in v3.4.x.

There are three options that would eventually resolve this for you all:

  1. Rely on it being fixed in the Helm core, an attempt is being made in https://github.com/helm/helm/pull/9180, but it will likely take some time before there is consensus there about what the actual fix would look like.
  2. Detect the pending state in the controller, assume that we are the sole actors over the release (and can safely ignore the pessimistic lock), and fall back to the configured remediation strategy for the release to attempt to perform an automatic rollback (or uninstall).
  3. Patch the Helm core, as others have done in e.g. https://github.com/werf/helm/commit/ea7631bd21e6aeed05515e594fdd6b029fc0bf23, so that it is suited to our needs. I am however not a big fan of maintaining forks, and much more in favor of helping fix it upstream.

Until we have opted for one of those options (likely to be option 2), and your issue isn’t due to the controller(s) crashing, you may want to set your timeout to a more generous value as already suggested in https://github.com/fluxcd/helm-controller/issues/149#issuecomment-796782509. That should minimize the chances of running into Helm#4558.

For all folks who do not experience helm controller crashes… could you try adding bigger timeout to HelmRelease timeout: 30m

Same here, from what I could see the flux controllers are crashing while reconciling the helmreleases and the charts stay with pending status. Screenshot 2021-03-10 at 21 09 48

❯ helm list -Aa
NAME              	NAMESPACE   	REVISION	UPDATED                                	STATUS         	CHART                       	APP VERSION
flagger           	istio-system	1       	2021-03-10 20:53:41.632527436 +0000 UTC	deployed       	flagger-1.6.4               	1.6.4
flagger-loadtester	istio-system	1       	2021-03-10 20:53:41.523101293 +0000 UTC	deployed       	loadtester-0.18.0           	0.18.0
istio-operator    	istio-system	1       	2021-03-10 20:54:52.180338043 +0000 UTC	deployed       	istio-operator-1.7.0
loki              	monitoring  	1       	2021-03-10 20:53:42.29377712 +0000 UTC 	pending-install	loki-distributed-0.26.0     	2.1.0
prometheus-adapter	monitoring  	1       	2021-03-10 20:53:50.218395164 +0000 UTC	pending-install	prometheus-adapter-2.12.1   	v0.8.3
prometheus-stack  	monitoring  	1       	2021-03-10 21:08:35.889548922 +0000 UTC	pending-install	kube-prometheus-stack-14.0.1	0.46.0
tempo             	monitoring  	1       	2021-03-10 20:53:42.279556436 +0000 UTC	pending-install	tempo-distributed-0.8.5     	0.6.0

And the helm releases:

Every 5.0s: kubectl get helmrelease -n monitoring                                                                                                                                    tardis.Home: Wed Mar 10 21:14:39 2021

NAME                 READY   STATUS                                                                             AGE
loki                 False   Helm upgrade failed: another operation (install/upgrade/rollback) is in progress   20m
prometheus-adapter   False   Helm upgrade failed: another operation (install/upgrade/rollback) is in progress   20m
prometheus-stack     False   Helm upgrade failed: another operation (install/upgrade/rollback) is in progress   16m
tempo                False   Helm upgrade failed: another operation (install/upgrade/rollback) is in progress   20m

After deleting a helmrelease so that it can be recreated again, the kustomize-controller is crashing:

kustomize-controller-689774778b-rqhsq manager E0310 21:17:29.520573       6 leaderelection.go:361] Failed to update lock: Put "https://10.0.0.1:443/apis/coordination.k8s.io/v1/namespaces/flux-system/leases/7593cc5d.fluxcd.io": context deadline exceeded
kustomize-controller-689774778b-rqhsq manager I0310 21:17:29.520663       6 leaderelection.go:278] failed to renew lease flux-system/7593cc5d.fluxcd.io: timed out waiting for the condition
kustomize-controller-689774778b-rqhsq manager {"level":"error","ts":"2021-03-10T21:17:29.520Z","logger":"setup","msg":"problem running manager","error":"leader election lost"}

helm uninstall for the pending-install releases seems to solve the problem some times, but most of the times the controllers are still crashing:

helm-controller-75bcfd86db-4mj8s manager E0310 22:20:31.375402       6 leaderelection.go:361] Failed to update lock: Put "https://10.0.0.1:443/apis/coordination.k8s.io/v1/namespaces/flux-system/leases/5b6ca942.fluxcd.io": context deadline exceeded
helm-controller-75bcfd86db-4mj8s manager I0310 22:20:31.375495       6 leaderelection.go:278] failed to renew lease flux-system/5b6ca942.fluxcd.io: timed out waiting for the condition
helm-controller-75bcfd86db-4mj8s manager {"level":"error","ts":"2021-03-10T22:20:31.375Z","logger":"setup","msg":"problem running manager","error":"leader election lost"}
- helm-controller-75bcfd86db-4mj8s › manager
+ helm-controller-75bcfd86db-4mj8s › manager
helm-controller-75bcfd86db-4mj8s manager {"level":"info","ts":"2021-03-10T22:20:41.976Z","logger":"controller-runtime.metrics","msg":"metrics server is starting to listen","addr":":8080"}
helm-controller-75bcfd86db-4mj8s manager {"level":"info","ts":"2021-03-10T22:20:41.977Z","logger":"controller-runtime.injectors-warning","msg":"Injectors are deprecated, and will be removed in v0.10.x"}
helm-controller-75bcfd86db-4mj8s manager {"level":"info","ts":"2021-03-10T22:20:41.977Z","logger":"controller-runtime.injectors-warning","msg":"Injectors are deprecated, and will be removed in v0.10.x"}
helm-controller-75bcfd86db-4mj8s manager {"level":"info","ts":"2021-03-10T22:20:41.977Z","logger":"controller-runtime.injectors-warning","msg":"Injectors are deprecated, and will be removed in v0.10.x"}
helm-controller-75bcfd86db-4mj8s manager {"level":"info","ts":"2021-03-10T22:20:41.977Z","logger":"controller-runtime.injectors-warning","msg":"Injectors are deprecated, and will be removed in v0.10.x"}
helm-controller-75bcfd86db-4mj8s manager {"level":"info","ts":"2021-03-10T22:20:41.977Z","logger":"setup","msg":"starting manager"}
helm-controller-75bcfd86db-4mj8s manager I0310 22:20:41.977697       7 leaderelection.go:243] attempting to acquire leader lease flux-system/5b6ca942.fluxcd.io...
helm-controller-75bcfd86db-4mj8s manager {"level":"info","ts":"2021-03-10T22:20:41.977Z","msg":"starting metrics server","path":"/metrics"}
helm-controller-75bcfd86db-4mj8s manager I0310 22:21:12.049163       7 leaderelection.go:253] successfully acquired lease flux-system/5b6ca942.

Try k describe helmreleases <therelease> and look at the events. In my case I believe it was caused by:

Events:
  Type    Reason  Age                  From             Message
  ----    ------  ----                 ----             -------
  Normal  info    47m (x3 over 47m)    helm-controller  HelmChart 'flux-system/postgres-operator-postgres-operator' is not ready
  Normal  error   26m (x4 over 42m)    helm-controller  reconciliation failed: Helm upgrade failed: timed out waiting for the condition

I did a helm upgrade by hand, and then it reconciled in flux too.