helm: Helm delete/upgrade command hangs

The issue started all of a sudden. In case it means anything, the app on which helm delete hanged for the first time had a statefulset in it.

The following are the version details:

Client: &version.Version{SemVer:"v2.4.2", GitCommit:"82d8e9498d96535cc6787a6a9194a76161d29b4c", GitTreeState:"clean"}
Server: &version.Version{SemVer:"v2.4.2", GitCommit:"82d8e9498d96535cc6787a6a9194a76161d29b4c", GitTreeState:"clean"}

Let me know if there are any other details I can provide.

About this issue

  • Original URL
  • State: closed
  • Created 7 years ago
  • Reactions: 27
  • Comments: 52 (22 by maintainers)

Commits related to this issue

Most upvoted comments

I fixed the issue by deleting the tiller-deploy pod, and letting it create a new one on it’s own. Everything works again.

Some more information:

  • Installing a new chart with install works

  • Then upgrade the exact same chart with the same information hangs, just like everything else.

  • Installing a new chart with upgrade --install also works

  • Then upgrade the exact same chart with the same information hangs, just like everything else.

  • delete’s don’t work at all

@technosophos It would be great if this could make it into the next point release but not sure if we can get others to test this before then.

Would it be also possible to get this relabeled as a bug?

There is a race condition with the Storage.releaseLocksLock.Lock(). If you execute helm delete <releaseName> --purge, hit ctrl-c and then immediately issue another delete command before that one returns you can get into a deadlock where subsequent requests cannot get a lock.

To reproduce with an example chart named “testchart” (can be any chart)

helm install . --name testchart
helm delete testchart --purge
<Ctrl-C>
helm delete testchart --purge

The delete commands will hang and after this you will not be able to delete any charts until Tiller is restarted.

Ive been looking into it heavily and after asking around the response was our mutex protecting a map of mutex's might be a bad idea in the first place. Im still baffled because I cannot yet reproduce this in an extracted example so any help is appreciated.

The only errors I can see in the tiller-deploy logs are these (snippets):

....
2017/08/06 20:14:37 client.go:398: Looks like there are no changes for PersistentVolumeClaim "log-1"
2017/08/06 20:14:37 client.go:386: generating strategic merge patch for *unstructured.Unstructured
2017/08/06 20:14:37 client.go:251: error updating the resource "dns":
         Patch https://10.96.0.1:443/api/v1/namespaces/infra/services/dns: unexpected EOF
2017/08/06 20:14:37 release_server.go:329: warning: Upgrade "infra" failed: Could not get information about the resource: err: Get https://10.96.0.1:443/apis/extensions/v1beta1/namespaces/infra/deployments/dns: read tcp 10.36.0.2:45962->10.96.0.1:443: read: connection reset by peer
2017/08/06 20:14:37 storage.go:59: Updating "infra" (v81) in storage
2017/08/06 20:14:37 cfgmaps.go:322: configmaps: update: failed to update: Put https://10.96.0.1:443/api/v1/namespaces/kube-system/configmaps/infra.v81: read tcp 10.36.0.2:46052->10.96.0.1:443: read: connection reset by peer
2017/08/06 20:14:37 release_server.go:855: warning: Failed to update release "infra": Put https://10.96.0.1:443/api/v1/namespaces/kube-system/configmaps/infra.v81: read tcp 10.36.0.2:46052->10.96.0.1:443: read: connection reset by peer
2017/08/06 20:14:37 storage.go:51: Create release "infra" (v82) in storage
2017/08/06 20:14:37 cfgmaps.go:322: configmaps: create: failed to create: Post https://10.96.0.1:443/api/v1/namespaces/kube-system/configmaps: read tcp 10.36.0.2:46062->10.96.0.1:443: read: connection reset by peer
2017/08/06 20:14:37 release_server.go:858: warning: Failed to record release "infra": Post https://10.96.0.1:443/api/v1/namespaces/kube-system/configmaps: read tcp 10.36.0.2:46062->10.96.0.1:443: read: connection reset by peer
2017/08/06 20:14:37 grpc: Server.processUnaryRPC failed to write status: stream error: code = Canceled desc = "context canceled"
2017/08/06 20:14:46 storage.go:139: Getting release history for 'infra'
....

I noticed that the kubernetes api pod did complain about duplicated NodePort at the same time. This have been ok for a while, but it showed up as an error. The port is duplicated because one is TCP and one is UDP so it shoulnt have problems binding them. I commended out the UDP port anyhow to see if that makes it better.

Hey, sorry for interjecting here, but I created smaller PR that seems to fix the issue, while retaining locking feature: https://github.com/kubernetes/helm/pull/2712

The approach I described above (removing locks from tiller’s storage) seems to be working. Concurrent requests do what is expected to the state of the release system but the failure messages could be better for the subsequent concurrent requests. For example, running two concurrent upgrade commands will yield the successful response for the first and the following for the second:

Error: UPGRADE FAILED: release: "test" already exists

The above response is confusing because release “test” should exist since this is an upgrade. If we include the version by using key instead of Release.Name we could at least show

Error: UPGRADE FAILED: release: "test.v7" already exists

which would at least give some indication that it might be a version specific update problem.

Another minor hiccup is that two concurrent delete requests both report success when in reality the first causes the deletion and the second merely picks up on the same status.

release "test" deleted

Two concurrent deletes with --purge will result in the second failing since the release has already been sent for deletion.

# for second request 
Error: release: "test.v3" not found

The resulting state from all ops seems to be fine. I will post PR and do some more strenuous concurrent testing.

Im convinced the hangs are the result of mutex locking a map of mutexes in Store. Its simply not safe for concurrent access as subsequent requests are hitting locks at different times and causing deadlocks. Its something like this but varies:

reqA->wants to lock a release
reqA->LockLocks()
reqA->LockRelease()
reqA->UnlockLocks()
reqB->wants to lock a release
reqB->LockLocks() 
reqB->LockRelease() # Release mutex already locked. Hangs forever.
reqA->wants to unlock a release
reqA->LockLocks() # reqB still has LockLocks locked. Hangs forever.
reqC->wants to lock a release
reqC->LockLocks() # reqB still has LockLocks locked. Hangs forever.

On first examination (and due to my initial lack of understanding of sync) it seems it should all shake out so I spent a lot of time trying to fix the current system but I am now convinced it cannot be fixed as designed.

Essentially the upgrade, delete and rollback funcs all take a lock on the map of release mutexes, and then lock the target release mutex. During this time they attempt to do WAY TOO MANY THINGS including requests to k8s etc but that isn’t the actual problem. The existing usage of locks attempts to block subsequent operations on the same release but in the end results in a deadlocked table level lock or deadlocked row level lock.

As far as I can tell, the only state Tiller permanently stores is Release ConfigMaps in k8s. K8s already has some resource update controls in place: https://kubernetes.io/docs/api-reference/v1.7/#resource-operations

“Replace - For read-then-write operations this is safe because an optimistic lock failure will occur if the resource was modified between the read and write.” “Patches - Patches will never cause optimistic locking failures, and the last write will win.”

Im still looking into how Tiller is actually conducting updates to k8s but current intuition is to just ensure we take advantage of optimistic locking in k8s and send request without Tiller locking anything and returning failures when they occur. If thats not sufficient we will need another design.

I ran in to the same issue here as well.

Client: &version.Version{SemVer:"v2.5.0", GitCommit:"012cb0ac1a1b2f888144ef5a67b8dab6c2d45be6", GitTreeState:"clean"}
Server: &version.Version{SemVer:"v2.5.0", GitCommit:"012cb0ac1a1b2f888144ef5a67b8dab6c2d45be6", GitTreeState:"clean"}

Only specific step that I did unlike the others here is that I tried to delete my deployment within the minute of running the deploy command.

Here is the exact install command.

helm install -f values.yaml -n consul --namespace kube-system stable/consul

Then I tried to delete it using helm del consul which never completed. Had to ^C out of it and retry, which is when I started running in to the issue.

We are also experiencing this issue with the following version of helm.

Client: &version.Version{SemVer:"v2.5.0", GitCommit:"012cb0ac1a1b2f888144ef5a67b8dab6c2d45be6", GitTreeState:"clean"}
Server: &version.Version{SemVer:"v2.5.0", GitCommit:"012cb0ac1a1b2f888144ef5a67b8dab6c2d45be6", GitTreeState:"clean"}

We have been able to move past this by deleting the tiller pod and letting the deployment rekick it but there is not any indication of it failing. Initially we thought this might be an issue with our resource utilization of the tiller pod. I have attached a graph of resource utilization on the tiller right up until the crash.

screen shot 2017-07-11 at 5 09 06 pm

There is also nothing in the tiller logs to indicate an issue.

I was also able to resolve the issue by deleting the tiller deployment on kubernetes and then running helm init.

# kubectl delete deployment -n=kube-system tiller-deploy
deployment "tiller-deploy" deleted

# helm init --upgrade
$HELM_HOME has been configured at /root/.helm.

Tiller (the helm server side component) has been upgraded to the current version.
Happy Helming!

# helm delete somethingNotThere --purge
Error: Unable to lock release somethingNotThere: release not found