helm: helm upgrade > timeout on pre-upgrade hook > revision stuck in `PENDING_UPGRADE` and multiple `DEPLOYED` revisions arise soon
Reproduction and symptom
helm upgradewith a helm pre-upgrade hook that times out.Error: UPGRADE FAILED: timed out waiting for the condition.helm history my-release-name# the last line... 22 Wed Aug 29 17:59:48 2018 PENDING_UPGRADE jupyterhub-0.7-04ccf1a Preparing upgrade
Expected outcome
The revision should end up in FAILED rather than PENDING_UPGRADE right?
About this issue
- Original URL
- State: open
- Created 6 years ago
- Reactions: 71
- Comments: 60 (13 by maintainers)
Commits related to this issue
- Do not exit if one component fails Helm releases would stay in a "pending" change if the installer exits early. Maybe related to https://github.com/helm/helm/issues/4558 — committed to epinio/installer by manno 3 years ago
- AppFwk: Recover apply from helm operation in progress It is observed that when a helm release is in pending state, another helm release can't be started by FluxCD. FluxCD will not try to do steps to ... — committed to starlingx/config by deleted user 2 years ago
This happened to me when I SIGTERMd an upgrade. I solved it by deleting the helm secret associated with this release, e.g.
If you are deleting the secret in a pipeline, you could use the following before deploying:
kubectl -n NS delete secret -l name=release-name,status=pending-upgradeThat way you don’t need to query for the version when deleting the secret.
We have the same problem in our GitLab pipelines. The workaround (running rollback) is not a good solution for prod CI/CD pipelines.
I’ve just run into this issue and worked around it by performing a
helm rollbackto a previous release as follows:problem:
fix:
Can we add a parameter to helm to control whether to continue execution or return an error message when the pending-upgrade state appears?
The best “manual” solution:
It should be the latest release that’s blocking. You can use these to match up and check release versions. Delete the secret or rollback. The fastest is to just delete the secret and run a fresh build on your pipeline.
We are running into this same issue with helm 3. The pipeline gets canceled and the helm operation is stuck in pending-upgrade. The current workaround for running a rollback does work but it isn’t that great for an automated pipeline unless we add a check before to make sure to “rollback” before deploy.
Is there anyway to just bypass the “pending-upgrade” status on a new deploy without running a rollback?
Same problem, coming here searching for a reason/fix 👍
Same problem here. Our pipeline gets canceled when there is a new version running and afterwards we can’t deploy anymore because of
Error: UPGRADE FAILED: another operation (install/upgrade/rollback) is in progress.How about recording the timeout flag to the release data (if it isn’t there already)? That way, if a release
NminutesNminutes agothen we could treat it as failed, not pending. This behavior could be optional behind a flag.
In my case, the issue was related to the lack of permissions for the role that was performing helm upgrade. I resolved it by adding the “update” verb to all resources that have to be changed during the deployment, example:
How to get an indeployable Deployment deployed
I tried to create a minimalistic reproduction and ended up with something slightly different but I bet that this is related.
The following Charts deployment should never be deployed, right? Because it has a hook that should keep running in eternity. But it will be deployed if you run two upgrades in succession and have a hook resource already available with the same name and about to terminate.
Chart.yaml:
templates/deployment.yaml:
templates/job.yaml:
Reproduction commands:
Use this before any upgrade/install (maybe already posted in this issue) :
I initially found it here : https://github.com/helm/helm/issues/5595#issuecomment-700742563
The delete of secret will remove the release in the status=pending-upgrade, which means the future executions of the Helm command will start the 3 way merge process with the previous release.
In the use case, you are reporting it will have a nasty side effect of deleting a release that a helm command is manipulating.
But I think is the same side effect that you are proposing. So let’s check this scenario Alice starts an upgrade to release foo with wait flag Meanwhile, Bob starts another upgrade in the release foo with your switch to continue execution. That means Bob’s release will overwrite Alice’s upgrade. So what happens to Alice’s upgrade? It will fail? And to the Bob’s release, the 3way merge will use the previous successful release or the pending one?
To clarify because others have asked: yes, this is a bug. PRs are welcome to help fix/mitigate this issue. It appearss https://github.com/helm/helm/pull/9180 has stalled out, so if someone wants to help out @Moser-ss with a fix, please feel free.
For now,
helm rollbackwill help you return back to a known working state before re-attempting an upgrade.You can recreate this. I’m using helm
v3.5.4to do this.ps -ax | grep [h]elm|cut -d' ' -f1|xargs kill -9Why is this an issue? Because any additional install, upgrade, etc will fail with this error.
737 Thu Jun 17 08:14:33 2021 pending-upgrade pagerinc-11.15.2 11.15.2 Preparing upgradeWhat makes this difficult for developers? We have everything in CI tools and when a build or run is canceled, punted, times out, etc, it puts helm in this state. The issue is that it now requires manual intervention to correct. Is there a solution that we can build into our pipelines to bypass or correct this? Something along the lines of ignoring a pending build that is more than a configurable time like 20 minutes.
previously I was able to reproduce the reported behavior by:
the above resulted to
pending-upgradestatus when checked withhelm historyUsing the same steps with helmv3.8.0results to statusfailedwhen checked withhelm historybased on about a dozen of tests since Friday and Im yet to experience thepending-upgrade` status.There is no way to stop releases getting stuck in pending upgrade state. The reason is that the local helm client is responsible for updating the progress of a chart upgrade by writing to the k8s API. If the network connection drops / client exits unexpectedly / k8s API stops responding etc. then the “pending upgrade” status simply cannot be updated to “failed”, because there is nothing to do the update.
https://github.com/helm/helm/issues/4558#issuecomment-1004477657 seems like a reasonable way of handling this situation IMO, but there is no way to stop it from happening in the first place.
@Artemkulish yes, however that seems to be more related to the discussion in #7139. This ticket discusses issues when a timeout occurs during an upgrade. #7139 discusses issues around improper role-based access controls in place.
We are running on Helm 3.4.1 and are running into the same issue as here from time to time. Worth mentioning that the previous version 3.3.x had no such trouble with the deployments… Can someone from the Helm team take a look at this and give an update or something?
This issue has been marked as stale because it has been open for 90 days with no activity. This thread will be automatically closed in 30 days if no further activity occurs.
If you know the previous timeout it is better to delete secrets which are pending and hit timeout to avoid unnecessary race condition.
something like
provided clock to be in sync
If you deploy the same module in production multiple times at the exact same time, you have bigger problems than this one, my friend. For other environments, just deploy again.
Before avoiding problems occuring once in a million, there are other everyday’s problems to solve, generally speaking 😉
Brilliant, this is exactly what I was looking for.
2+ really required
1+ really required
Yes, this issue is still existing with new version. We got the same with
v3.8.2:version.BuildInfo{Version:"v3.8.2", GitCommit:"6e3701edea09e5d55a8ca2aae03a68917630e91b", GitTreeState:"clean", GoVersion:"go1.17.5"}I didn’t fix it for me, I’ve cancelled a deployment using
v3.8.2and it still got stuck onpending-upgrade.