kubernetes: Hanging in postStart hook can cause pod to get stuck in ContainerCreating state with no logs/event info

We’re continuing to observe 5-10% failure rates in creating pods, with them hanging in ContainerCreating state. These are 2-container pods with an postStart lifecycle hook, but I don’t believe that’s implicated here (no problems in the kubelet.log). One other detail here is that pods stuck in ContainerCreating do not appear to respond to a default delete command (but do terminate with the --now flag or otherwise specifying a 0 second timeout).

We’ve run into this issue across various 1.3.x versions, including 1.3.4.

The only observable issue in kubelet.log (this taken from an install using kube-aws and k8s 1.3.4) is repeated remounting of the secrets volume (on I believe the DNS pod):

Aug 04 00:26:43 ip-172-20-0-7.ec2.internal kubelet-wrapper[1469]: I0804 00:26:43.157398    1469 reconciler.go:254] MountVolume operation started for volume "kubernetes.io/secret/dc220eac-59d4-11e6-823f-12cddf49f625-default-token-0bpst" (spec.Name: "default-token-0bpst") to pod "dc220eac-59d4-11e6-823f-12cddf49f625" (UID: "dc220eac-59d4-11e6-823f-12cddf49f625"). Volume is already mounted to pod, but remount was requested.
Aug 04 00:26:43 ip-172-20-0-7.ec2.internal kubelet-wrapper[1469]: I0804 00:26:43.159847    1469 operation_executor.go:740] MountVolume.SetUp succeeded for volume "kubernetes.io/secret/dc220eac-59d4-11e6-823f-12cddf49f625-default-token-0bpst" (spec.Name: "default-token-0bpst") pod "dc220eac-59d4-11e6-823f-12cddf49f625" (UID: "dc220eac-59d4-11e6-823f-12cddf49f625").
Aug 04 00:27:11 ip-172-20-0-7.ec2.internal kubelet-wrapper[1469]: I0804 00:27:11.218422    1469 reconciler.go:254] MountVolume operation started for volume "kubernetes.io/secret/dc2248b4-59d4-11e6-823f-12cddf49f625-default-token-0bpst" (spec.Name: "default-token-0bpst") to pod "dc2248b4-59d4-11e6-823f-12cddf49f625" (UID: "dc2248b4-59d4-11e6-823f-12cddf49f625"). Volume is already mounted to pod, but remount was requested.
Aug 04 00:27:11 ip-172-20-0-7.ec2.internal kubelet-wrapper[1469]: I0804 00:27:11.221004    1469 operation_executor.go:740] MountVolume.SetUp succeeded for volume "kubernetes.io/secret/dc2248b4-59d4-11e6-823f-12cddf49f625-default-token-0bpst" (spec.Name: "default-token-0bpst") pod "dc2248b4-59d4-11e6-823f-12cddf49f625" (UID: "dc2248b4-59d4-11e6-823f-12cddf49f625").

Any suggestions for further debugging here? This is likely related to https://github.com/kubernetes/kubernetes/issues/29059 and the other referenced issues there.

For our use case (spinning up containers on demand for users), this is quite a nasty bug because it occasionally fails to spin up a container in a reasonable timeframe and also requires using a hard kill.

About this issue

Original URL
State: closed
Created 8 years ago
Comments: 17 (8 by maintainers)

Most upvoted comments

so what’s the solution？Should I change the apiversion？

zhangzheming629 on May 14, 2019

On further investigating this issue, it does appear that our containers are hanging due to the postStart lifecycle hook as of 1.3.4 (though on prior 1.3.x versions we had this problem w/o postStart hooks).

Investigating (1) how to make our postStart hook more robust; and (2) where to find logging information for it…

boydgreenfield on Aug 4, 2016