kubernetes: Stateless services get stuck in ContainerCreating forever

What happened: A number of our crons are getting stuck in ContainerCreating. Most of them are stateless, or use a ConfigMap or a secret or two.

What you expected to happen: For the container to run, or to be rescheduled on another node.

How to reproduce it (as minimally and precisely as possible): We’re not certain how to reproduce this. It’s sporadic and hits about half of our crons at a time. We have a few machines with different specs and it seems to happen more often on our lower specced machines.

Anything else we need to know?: The crons are mostly stateless. The containers are usually pretty big (.5Gi) but have maybe one ConfigMap or Secret each which are always present.

Environment:

Kubernetes version (use kubectl version):

$ kubectl version
Client Version: version.Info{Major:"1", Minor:"14", GitVersion:"v1.14.0", GitCommit:"641856db18352033a0d96dbc99153fa3b27298e5", GitTreeState:"clean", BuildDate:"2019-03-25T15:53:57Z", GoVersion:"go1.12.1", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"13", GitVersion:"v1.13.2", GitCommit:"cff46ab41ff0bb44d8584413b598ad8360ec1def", GitTreeState:"clean", BuildDate:"2019-01-10T23:28:14Z", GoVersion:"go1.11.4", Compiler:"gc", Platform:"linux/amd64"}

Cloud provider or hardware configuration: On prem -> 384Gi mem/Xeon gold (a few) -> 60Gi mem/Intel® Xeon E5620 (one) -> 65Gmem/Intel® Xeon® CPU E5-2620 (more)
OS (e.g: cat /etc/os-release):

k8s-node $ cat /etc/os-release
NAME="Container Linux by CoreOS"
ID=coreos
VERSION=1911.4.0
VERSION_ID=1911.4.0
BUILD_ID=2018-11-26-1924
PRETTY_NAME="Container Linux by CoreOS 1911.4.0 (Rhyolite)"
ANSI_COLOR="38;5;75"
HOME_URL="https://coreos.com/"
BUG_REPORT_URL="https://issues.coreos.com"
COREOS_BOARD="amd64-usr"

Kernel (e.g. uname -a):

Linux k8s-node 4.14.81-coreos #1 SMP Mon Nov 26 18:51:57 UTC 2018 x86_64 Intel(R) Xeon(R) Gold 5118 CPU @ 2.30GHz GenuineIntel GNU/Linux

Install tools: ansible
Network plugin and version (if this is a network-related bug): Flannel and CoreDNS (probably not the issue, but included anyway)

About this issue

Original URL
State: closed
Created 5 years ago
Reactions: 4
Comments: 32 (7 by maintainers)

Most upvoted comments

@DarrienG Yeah as I expected, the problem here is that calls to docker timeout while docker still tries to execute the operation and finally successed. Kubernetes doesn’t know about it and retries, leading to this conflict. I’m afraid there is no easy fix for this. You’d have to have some docker api semantics that guarantees if a call times out it will actually not executed. You best options are simply avoiding such high load (which is what we did by upgrading to faster disks) or try crio-o instead, which might help. But even if it doesn’t help, it’s more likely to get a proper solution than something Docker based.

discordianfish on Oct 25, 2019

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale

fejta-bot on Jul 6, 2020

@DarrienG Do you see container creation timeouts for that pod in the events or kubelet logs? Is it possible that you have very high IO load on the affected nodes?

I had the same issue due to containers with heavy write load on the container filesystem. This load caused docker’s create_container api calls to timeout but eventually finish. Kubernetes appears to have assumed that the operation failed and tried to re-create the pod, leading to Conflict. The name ... is already in use by container ... from which it didn’t recover.

discordianfish on Jun 19, 2019