helm: Helm delete/upgrade command hangs
The issue started all of a sudden. In case it means anything, the app on which helm delete hanged for the first time had a statefulset in it.
The following are the version details:
Client: &version.Version{SemVer:"v2.4.2", GitCommit:"82d8e9498d96535cc6787a6a9194a76161d29b4c", GitTreeState:"clean"}
Server: &version.Version{SemVer:"v2.4.2", GitCommit:"82d8e9498d96535cc6787a6a9194a76161d29b4c", GitTreeState:"clean"}
Let me know if there are any other details I can provide.
About this issue
- Original URL
- State: closed
- Created 7 years ago
- Reactions: 27
- Comments: 52 (22 by maintainers)
Commits related to this issue
- fix(tiller): remove locking system from storage and rely on backend controls Tiller currently hangs indefinitely when deadlocks arise from certain concurrent operations. This commit removes the neste... — committed to jascott1/helm by jascott1 7 years ago
- fix(tiller): remove locking system from storage and rely on backend controls Tiller currently hangs indefinitely when deadlocks arise from certain concurrent operations. This commit removes the neste... — committed to helm/helm by jascott1 7 years ago
I fixed the issue by deleting the
tiller-deploypod, and letting it create a new one on it’s own. Everything works again.Some more information:
Installing a new chart with
installworksThen
upgradethe exact same chart with the same information hangs, just like everything else.Installing a new chart with
upgrade --installalso worksThen
upgradethe exact same chart with the same information hangs, just like everything else.delete’s don’t work at all
@technosophos It would be great if this could make it into the next point release but not sure if we can get others to test this before then.
Would it be also possible to get this relabeled as a bug?
There is a race condition with the Storage.releaseLocksLock.Lock(). If you execute
helm delete <releaseName> --purge, hitctrl-cand then immediately issue another delete command before that one returns you can get into a deadlock where subsequent requests cannot get a lock.To reproduce with an example chart named “testchart” (can be any chart)
The delete commands will hang and after this you will not be able to delete any charts until Tiller is restarted.
Ive been looking into it heavily and after asking around the response was our
mutex protecting a map of mutex'smight be a bad idea in the first place. Im still baffled because I cannot yet reproduce this in an extracted example so any help is appreciated.Its going out to lunch here https://github.com/kubernetes/helm/blob/master/pkg/storage/storage.go#L163
The only errors I can see in the
tiller-deploylogs are these (snippets):I noticed that the kubernetes api pod did complain about duplicated
NodePortat the same time. This have been ok for a while, but it showed up as an error. The port is duplicated because one isTCPand one isUDPso it shoulnt have problems binding them. I commended out theUDPport anyhow to see if that makes it better.Hey, sorry for interjecting here, but I created smaller PR that seems to fix the issue, while retaining locking feature: https://github.com/kubernetes/helm/pull/2712
The approach I described above (removing locks from tiller’s storage) seems to be working. Concurrent requests do what is expected to the state of the release system but the failure messages could be better for the subsequent concurrent requests. For example, running two concurrent upgrade commands will yield the successful response for the first and the following for the second:
The above response is confusing because release “test” should exist since this is an upgrade. If we include the version by using key instead of Release.Name we could at least show
which would at least give some indication that it might be a version specific update problem.
Another minor hiccup is that two concurrent delete requests both report success when in reality the first causes the deletion and the second merely picks up on the same status.
Two concurrent deletes with
--purgewill result in the second failing since the release has already been sent for deletion.The resulting state from all ops seems to be fine. I will post PR and do some more strenuous concurrent testing.
Im convinced the hangs are the result of
mutex locking a map of mutexesin Store. Its simply not safe for concurrent access as subsequent requests are hitting locks at different times and causing deadlocks. Its something like this but varies:On first examination (and due to my initial lack of understanding of sync) it seems it should all shake out so I spent a lot of time trying to fix the current system but I am now convinced it cannot be fixed as designed.
Essentially the upgrade, delete and rollback funcs all take a lock on the map of release mutexes, and then lock the target release mutex. During this time they attempt to do WAY TOO MANY THINGS including requests to k8s etc but that isn’t the actual problem. The existing usage of locks attempts to block subsequent operations on the same release but in the end results in a
deadlocked table level lockordeadlocked row level lock.As far as I can tell, the only state Tiller permanently stores is Release ConfigMaps in k8s. K8s already has some resource update controls in place: https://kubernetes.io/docs/api-reference/v1.7/#resource-operations
“Replace - For read-then-write operations this is safe because an optimistic lock failure will occur if the resource was modified between the read and write.” “Patches - Patches will never cause optimistic locking failures, and the last write will win.”
Im still looking into how Tiller is actually conducting updates to k8s but current intuition is to just ensure we take advantage of optimistic locking in k8s and send request without Tiller locking anything and returning failures when they occur. If thats not sufficient we will need another design.
I ran in to the same issue here as well.
Only specific step that I did unlike the others here is that I tried to delete my deployment within the minute of running the deploy command.
Here is the exact install command.
Then I tried to delete it using
helm del consulwhich never completed. Had to ^C out of it and retry, which is when I started running in to the issue.We are also experiencing this issue with the following version of helm.
We have been able to move past this by deleting the tiller pod and letting the deployment rekick it but there is not any indication of it failing. Initially we thought this might be an issue with our resource utilization of the tiller pod. I have attached a graph of resource utilization on the tiller right up until the crash.
There is also nothing in the tiller logs to indicate an issue.
I was also able to resolve the issue by deleting the tiller deployment on kubernetes and then running helm init.