helm: helm upgrade > timeout on pre-upgrade hook > revision stuck in `PENDING_UPGRADE` and multiple `DEPLOYED` revisions arise soon

Reproduction and symptom

  1. helm upgrade with a helm pre-upgrade hook that times out.
  2. Error: UPGRADE FAILED: timed out waiting for the condition.
  3. helm history my-release-name
    # the last line...
    22      	Wed Aug 29 17:59:48 2018	PENDING_UPGRADE	jupyterhub-0.7-04ccf1a 	Preparing upgrade
    

Expected outcome

The revision should end up in FAILED rather than PENDING_UPGRADE right?

About this issue

  • Original URL
  • State: open
  • Created 6 years ago
  • Reactions: 71
  • Comments: 60 (13 by maintainers)

Commits related to this issue

Most upvoted comments

This happened to me when I SIGTERMd an upgrade. I solved it by deleting the helm secret associated with this release, e.g.

$ k get secrets
NAME                                 TYPE                                  DATA   AGE
sh.helm.release.v1.app.v1            helm.sh/release.v1                    1      366d
sh.helm.release.v1.app.v2            helm.sh/release.v1                    1      331d
sh.helm.release.v1.app.v3            helm.sh/release.v1                    1      247d
sh.helm.release.v1.app.v4            helm.sh/release.v1                    1      77d
sh.helm.release.v1.app.v5            helm.sh/release.v1                    1      77d
sh.helm.release.v1.app.v6            helm.sh/release.v1                    1      15m
sh.helm.release.v1.app.v7            helm.sh/release.v1                    1      66s

$ k delete secret sh.helm.release.v1.app.v7

If you are deleting the secret in a pipeline, you could use the following before deploying:

kubectl -n NS delete secret -l name=release-name,status=pending-upgrade

That way you don’t need to query for the version when deleting the secret.

We have the same problem in our GitLab pipelines. The workaround (running rollback) is not a good solution for prod CI/CD pipelines.

Is there a workaround for this? Is upgrading to helm3 a solution?

I’ve just run into this issue and worked around it by performing a helm rollback to a previous release as follows:

problem:

26      	Mon Jun 15 14:13:24 2020	superseded     	elasticsearch-7.5.1	7.5.1      	Upgrade complete
27      	Mon Jun 15 17:52:09 2020	pending-upgrade	elasticsearch-7.5.1	7.5.1      	Preparing upgrade

fix:

$ helm rollback elasticsearch-release 26
Rollback was a success! Happy Helming!
$ helm history elasticsearch-release
26      	Mon Jun 15 14:13:24 2020	superseded     	elasticsearch-7.5.1	7.5.1      	Upgrade complete
27      	Mon Jun 15 17:52:09 2020	pending-upgrade	elasticsearch-7.5.1	7.5.1      	Preparing upgrade
28      	Tue Jun 23 14:51:11 2020	deployed       	elasticsearch-7.5.1	7.5.1      	Rollback to 26

Can we add a parameter to helm to control whether to continue execution or return an error message when the pending-upgrade state appears?

The best “manual” solution:

kubectl --namespace $NAMESPACE get secrets -l owner=helm
# could get really specific with owner=helm,status=pending-upgrade
helm --namespace NS history RELEASE

It should be the latest release that’s blocking. You can use these to match up and check release versions. Delete the secret or rollback. The fastest is to just delete the secret and run a fresh build on your pipeline.

kubectl --namespace $NAMESPACE delete secret sh.helm.release.v1.$RELEASE.vAFFENDING_RELEASE

We are running into this same issue with helm 3. The pipeline gets canceled and the helm operation is stuck in pending-upgrade. The current workaround for running a rollback does work but it isn’t that great for an automated pipeline unless we add a check before to make sure to “rollback” before deploy.

Is there anyway to just bypass the “pending-upgrade” status on a new deploy without running a rollback?

Same problem, coming here searching for a reason/fix 👍

Same problem here. Our pipeline gets canceled when there is a new version running and afterwards we can’t deploy anymore because of Error: UPGRADE FAILED: another operation (install/upgrade/rollback) is in progress.

How about recording the timeout flag to the release data (if it isn’t there already)? That way, if a release

  • has status pending
  • and has timeout of N minutes
  • but started over N minutes ago

then we could treat it as failed, not pending. This behavior could be optional behind a flag.

In my case, the issue was related to the lack of permissions for the role that was performing helm upgrade. I resolved it by adding the “update” verb to all resources that have to be changed during the deployment, example:

apiVersion: rbac.authorization.k8s.io/v1
metadata:
  name: ci-cd-cluster-role
rules:
- apiGroups: [""]
  resources: ["namespaces"]
  verbs: ["create","get","list","update","watch"]
- apiGroups: [""]
  resources: ["pods"]
  verbs: ["create","delete","get","list","patch","update","watch"]
- apiGroups: ["apps"]
  resources: ["deployments"]
  verbs: ["get", "create","patch","update"]```

How to get an indeployable Deployment deployed

I tried to create a minimalistic reproduction and ended up with something slightly different but I bet that this is related.

The following Charts deployment should never be deployed, right? Because it has a hook that should keep running in eternity. But it will be deployed if you run two upgrades in succession and have a hook resource already available with the same name and about to terminate.

Chart.yaml:

apiVersion: v1
appVersion: "1.0"
description: A Helm chart for Kubernetes
name: issue-4558
version: 0.1.0

templates/deployment.yaml:

apiVersion: apps/v1beta2
kind: Deployment
metadata:
  name: never-to-be-installed-deployment
spec:
  selector:
    matchLabels:
      dummy: dummy
  template:
    metadata:
      labels:
        dummy: dummy
    spec:
      containers:
        - name: never-to-be-installed-deployment
          image: "gcr.io/google_containers/pause:3.1"

templates/job.yaml:

apiVersion: batch/v1
kind: Job
metadata:
  name: never-finishing-job
  annotations:
    "helm.sh/hook": pre-install,pre-upgrade
    "helm.sh/hook-delete-policy": before-hook-creation,hook-succeeded
spec:
  template:
    spec:
      restartPolicy: Never
      containers:
        - name: never-finishing-job
          image: "gcr.io/google_containers/pause:3.1"

Reproduction commands:

helm upgrade issue . --install --namespace issue
# abort
helm upgrade issue . --install --namespace issue

messy-helm-upgrading

Use this before any upgrade/install (maybe already posted in this issue) :

# delete failed previous deployment if any (that would else require a helm delete)
kubectl -n $NAMESPACE delete secret -l "name=$APP_NAME,status in (pending-install, pending-upgrade)" || true

I initially found it here : https://github.com/helm/helm/issues/5595#issuecomment-700742563

The delete of secret will remove the release in the status=pending-upgrade, which means the future executions of the Helm command will start the 3 way merge process with the previous release.

Any side-effects?

In the use case, you are reporting it will have a nasty side effect of deleting a release that a helm command is manipulating.

But I think is the same side effect that you are proposing. So let’s check this scenario Alice starts an upgrade to release foo with wait flag Meanwhile, Bob starts another upgrade in the release foo with your switch to continue execution. That means Bob’s release will overwrite Alice’s upgrade. So what happens to Alice’s upgrade? It will fail? And to the Bob’s release, the 3way merge will use the previous successful release or the pending one?

To clarify because others have asked: yes, this is a bug. PRs are welcome to help fix/mitigate this issue. It appearss https://github.com/helm/helm/pull/9180 has stalled out, so if someone wants to help out @Moser-ss with a fix, please feel free.

For now, helm rollback will help you return back to a known working state before re-attempting an upgrade.

You can recreate this. I’m using helm v3.5.4 to do this.

  1. Start the install, upgrade, etc in one terminal.
  2. After is created the pending status in helm history, run this in another terminal ps -ax | grep [h]elm|cut -d' ' -f1|xargs kill -9

Why is this an issue? Because any additional install, upgrade, etc will fail with this error. 737 Thu Jun 17 08:14:33 2021 pending-upgrade pagerinc-11.15.2 11.15.2 Preparing upgrade

What makes this difficult for developers? We have everything in CI tools and when a build or run is canceled, punted, times out, etc, it puts helm in this state. The issue is that it now requires manual intervention to correct. Is there a solution that we can build into our pipelines to bypass or correct this? Something along the lines of ignoring a pending build that is more than a configurable time like 20 minutes.

previously I was able to reproduce the reported behavior by:

  1. starting a helm upgrade with --wait
  2. killing the process from another terminal

the above resulted to pending-upgrade status when checked with helm history Using the same steps with helm v3.8.0 results to status failed when checked with helm history based on about a dozen of tests since Friday and Im yet to experience the pending-upgrade` status.

There is no way to stop releases getting stuck in pending upgrade state. The reason is that the local helm client is responsible for updating the progress of a chart upgrade by writing to the k8s API. If the network connection drops / client exits unexpectedly / k8s API stops responding etc. then the “pending upgrade” status simply cannot be updated to “failed”, because there is nothing to do the update.

https://github.com/helm/helm/issues/4558#issuecomment-1004477657 seems like a reasonable way of handling this situation IMO, but there is no way to stop it from happening in the first place.

How about recording the timeout flag to the release data (if it isn’t there already)? That way, if a release

  • has status pending
  • and has timeout of N minutes
  • but started over N minutes ago

then we could treat it as failed, not pending. This behavior could be optional behind a flag.

@Artemkulish yes, however that seems to be more related to the discussion in #7139. This ticket discusses issues when a timeout occurs during an upgrade. #7139 discusses issues around improper role-based access controls in place.

We are running on Helm 3.4.1 and are running into the same issue as here from time to time. Worth mentioning that the previous version 3.3.x had no such trouble with the deployments… Can someone from the Helm team take a look at this and give an update or something?

This issue has been marked as stale because it has been open for 90 days with no activity. This thread will be automatically closed in 30 days if no further activity occurs.

If you know the previous timeout it is better to delete secrets which are pending and hit timeout to avoid unnecessary race condition.

something like

kubectl -n $NAMESPACE get secrets -o=custom-columns=NAME:.metadata.name,AGE:.metadata.creationTimestamp --no-headers=true --sort-by=.metadata.creationTimestamp -l "name=${RELEASE_NAME},status in (pending-install, pending-upgrade)" | awk '$2 < "'`date -d "${TIMEOUT_IN_MINS} minutes ago" -Ins --utc | sed 's/+0000/Z/'`'" { print $1 }' | xargs --no-run-if-empty kubectl delete secret -n $NAMESPACE 

provided clock to be in sync

If you deploy the same module in production multiple times at the exact same time, you have bigger problems than this one, my friend. For other environments, just deploy again.

Before avoiding problems occuring once in a million, there are other everyday’s problems to solve, generally speaking 😉

Use this before any upgrade/install (maybe already posted in this issue) :

# delete failed previous deployment if any (that would else require a helm delete)
kubectl -n $NAMESPACE delete secret -l "name=$APP_NAME,status in (pending-install, pending-upgrade)" || true

I initially found it here : #5595 (comment)

Brilliant, this is exactly what I was looking for.

2+ really required

1+ really required

Yes, this issue is still existing with new version. We got the same with v3.8.2: version.BuildInfo{Version:"v3.8.2", GitCommit:"6e3701edea09e5d55a8ca2aae03a68917630e91b", GitTreeState:"clean", GoVersion:"go1.17.5"}

I didn’t fix it for me, I’ve cancelled a deployment using v3.8.2 and it still got stuck on pending-upgrade.

Can we add a parameter to helm to control whether to continue execution or return an error message when the pending-upgrade state appears? What do you mean by that? A flag to force an upgrade when the pending-upgrade state appears? Isn’t that dangerous? The error appears because helm identifies another helm instance that is executing an upgrade. Basically is a mechanism to avoid the corruption of data.