kubernetes: Hanging in postStart hook can cause pod to get stuck in ContainerCreating state with no logs/event info
We’re continuing to observe 5-10% failure rates in creating pods, with them hanging in ContainerCreating state. These are 2-container pods with an postStart lifecycle hook, but I don’t believe that’s implicated here (no problems in the kubelet.log). One other detail here is that pods stuck in ContainerCreating do not appear to respond to a default delete command (but do terminate with the --now flag or otherwise specifying a 0 second timeout).
We’ve run into this issue across various 1.3.x versions, including 1.3.4.
The only observable issue in kubelet.log (this taken from an install using kube-aws and k8s 1.3.4) is repeated remounting of the secrets volume (on I believe the DNS pod):
Aug 04 00:26:43 ip-172-20-0-7.ec2.internal kubelet-wrapper[1469]: I0804 00:26:43.157398 1469 reconciler.go:254] MountVolume operation started for volume "kubernetes.io/secret/dc220eac-59d4-11e6-823f-12cddf49f625-default-token-0bpst" (spec.Name: "default-token-0bpst") to pod "dc220eac-59d4-11e6-823f-12cddf49f625" (UID: "dc220eac-59d4-11e6-823f-12cddf49f625"). Volume is already mounted to pod, but remount was requested.
Aug 04 00:26:43 ip-172-20-0-7.ec2.internal kubelet-wrapper[1469]: I0804 00:26:43.159847 1469 operation_executor.go:740] MountVolume.SetUp succeeded for volume "kubernetes.io/secret/dc220eac-59d4-11e6-823f-12cddf49f625-default-token-0bpst" (spec.Name: "default-token-0bpst") pod "dc220eac-59d4-11e6-823f-12cddf49f625" (UID: "dc220eac-59d4-11e6-823f-12cddf49f625").
Aug 04 00:27:11 ip-172-20-0-7.ec2.internal kubelet-wrapper[1469]: I0804 00:27:11.218422 1469 reconciler.go:254] MountVolume operation started for volume "kubernetes.io/secret/dc2248b4-59d4-11e6-823f-12cddf49f625-default-token-0bpst" (spec.Name: "default-token-0bpst") to pod "dc2248b4-59d4-11e6-823f-12cddf49f625" (UID: "dc2248b4-59d4-11e6-823f-12cddf49f625"). Volume is already mounted to pod, but remount was requested.
Aug 04 00:27:11 ip-172-20-0-7.ec2.internal kubelet-wrapper[1469]: I0804 00:27:11.221004 1469 operation_executor.go:740] MountVolume.SetUp succeeded for volume "kubernetes.io/secret/dc2248b4-59d4-11e6-823f-12cddf49f625-default-token-0bpst" (spec.Name: "default-token-0bpst") pod "dc2248b4-59d4-11e6-823f-12cddf49f625" (UID: "dc2248b4-59d4-11e6-823f-12cddf49f625").
Any suggestions for further debugging here? This is likely related to https://github.com/kubernetes/kubernetes/issues/29059 and the other referenced issues there.
For our use case (spinning up containers on demand for users), this is quite a nasty bug because it occasionally fails to spin up a container in a reasonable timeframe and also requires using a hard kill.
About this issue
- Original URL
- State: closed
- Created 8 years ago
- Comments: 17 (8 by maintainers)
so what’s the solution?Should I change the apiversion?
On further investigating this issue, it does appear that our containers are hanging due to the
postStartlifecycle hook as of1.3.4(though on prior 1.3.x versions we had this problem w/opostStarthooks).Investigating (1) how to make our
postStarthook more robust; and (2) where to find logging information for it…