kops: Pod sandbox issues and stuck at ContainerCreating
- What
kopsversion are you running? The commandkops version, will display this information.
Version 1.8.0 (git-5099bc5)
- What Kubernetes version are you running?
kubectl versionwill print the version if a cluster is running or provide the Kubernetes version specified as akopsflag.
Client Version: version.Info{Major:"1", Minor:"8", GitVersion:"v1.8.5", GitCommit:"cce11c6a185279d037023e02ac5249e14daa22bf", GitTreeState:"clean", BuildDate:"2017-12-07T16:16:03Z", GoVersion:"go1.8.3", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"8", GitVersion:"v1.8.5", GitCommit:"cce11c6a185279d037023e02ac5249e14daa22bf", GitTreeState:"clean", BuildDate:"2017-12-07T16:05:18Z", GoVersion:"go1.8.3", Compiler:"gc", Platform:"linux/amd64"}
- What cloud provider are you using?
AWS
- What commands did you run? What is the simplest way to reproduce this issue?
Applying any configuration resulting in the creation of a new container ends up with a pod stuck in ContainerCreating status and afterwards into sandbox failures (see logs).
- What happened after the commands executed?
First the pod stays in ‘ContainerCreating’ for a long time, afterwards there are several sandbox errors (see logs)
- What did you expect to happen?
Normal deployment without the hanging.
- Please provide your cluster manifest. Execute
kops get --name my.example.com -oyamlto display your cluster manifest. You may want to remove your cluster name and other sensitive information.
apiVersion: kops/v1alpha2
kind: Cluster
metadata:
creationTimestamp: 2017-09-07T12:12:21Z
name: kubernetes.xxx.xxx
spec:
api:
loadBalancer:
type: Public
authorization:
alwaysAllow: {}
channel: stable
cloudProvider: aws
configBase: s3://xxx-kops/kubernetes.xxx.xxx
dnsZone: Z16C10VSQ4D9E
docker:
logDriver: ""
storage: overlay2
etcdClusters:
- etcdMembers:
- instanceGroup: master-eu-central-1a
name: a
- instanceGroup: master-eu-central-1b
name: b
- instanceGroup: master-eu-central-1c
name: c
name: main
- etcdMembers:
- instanceGroup: master-eu-central-1a
name: a
- instanceGroup: master-eu-central-1b
name: b
- instanceGroup: master-eu-central-1c
name: c
name: events
iam:
legacy: true
kubeAPIServer:
runtimeConfig:
batch/v2alpha1: "true"
kubernetesApiAccess:
- 0.0.0.0/0
kubernetesVersion: 1.8.5
masterInternalName: api.internal.kubernetes.xxx.xxx
masterPublicName: api.kubernetes.xxx.xxx
networkCIDR: 172.20.0.0/16
networking:
weave:
mtu: 8912
nonMasqueradeCIDR: 100.64.0.0/10
sshAccess:
- 0.0.0.0/0
subnets:
- cidr: 172.20.32.0/19
name: eu-central-1a
type: Private
zone: eu-central-1a
- cidr: 172.20.64.0/19
name: eu-central-1b
type: Private
zone: eu-central-1b
- cidr: 172.20.96.0/19
name: eu-central-1c
type: Private
zone: eu-central-1c
- cidr: 172.20.0.0/22
name: utility-eu-central-1a
type: Utility
zone: eu-central-1a
- cidr: 172.20.4.0/22
name: utility-eu-central-1b
type: Utility
zone: eu-central-1b
- cidr: 172.20.8.0/22
name: utility-eu-central-1c
type: Utility
zone: eu-central-1c
topology:
bastion:
bastionPublicName: bastion.kubernetes.xxx.xxx
dns:
type: Public
masters: private
nodes: private
---
apiVersion: kops/v1alpha2
kind: InstanceGroup
metadata:
creationTimestamp: 2017-09-07T12:12:21Z
labels:
kops.k8s.io/cluster: kubernetes.xxx.xxx
name: bastions
spec:
image: kope.io/k8s-1.7-debian-jessie-amd64-hvm-ebs-2017-07-28
machineType: t2.micro
maxSize: 1
minSize: 1
role: Bastion
subnets:
- utility-eu-central-1a
- utility-eu-central-1b
- utility-eu-central-1c
---
apiVersion: kops/v1alpha2
kind: InstanceGroup
metadata:
creationTimestamp: 2017-09-07T12:12:21Z
labels:
kops.k8s.io/cluster: kubernetes.xxx.xxx
name: master-eu-central-1a
spec:
image: kope.io/k8s-1.7-debian-jessie-amd64-hvm-ebs-2017-07-28
machineType: m4.large
maxSize: 1
minSize: 1
role: Master
subnets:
- eu-central-1a
---
apiVersion: kops/v1alpha2
kind: InstanceGroup
metadata:
creationTimestamp: 2017-09-07T12:12:21Z
labels:
kops.k8s.io/cluster: kubernetes.xxx.xxx
name: master-eu-central-1b
spec:
image: kope.io/k8s-1.7-debian-jessie-amd64-hvm-ebs-2017-07-28
machineType: m4.large
maxSize: 1
minSize: 1
role: Master
subnets:
- eu-central-1b
---
apiVersion: kops/v1alpha2
kind: InstanceGroup
metadata:
creationTimestamp: 2017-09-07T12:12:21Z
labels:
kops.k8s.io/cluster: kubernetes.xxx.xxx
name: master-eu-central-1c
spec:
image: kope.io/k8s-1.7-debian-jessie-amd64-hvm-ebs-2017-07-28
machineType: m4.large
maxSize: 1
minSize: 1
role: Master
subnets:
- eu-central-1c
---
apiVersion: kops/v1alpha2
kind: InstanceGroup
metadata:
creationTimestamp: 2017-12-13T20:46:55Z
labels:
beta.kubernetes.io/fluentd-ds-ready: "true"
kops.k8s.io/cluster: kubernetes.xxx.xxx
name: nodes-base
spec:
image: kope.io/k8s-1.7-debian-jessie-amd64-hvm-ebs-2017-12-02
machineType: m4.2xlarge
maxSize: 4
minSize: 4
nodeLabels:
kops.k8s.io/instancegroup: nodes-base
role: Node
subnets:
- eu-central-1a
- eu-central-1b
- eu-central-1c
---
apiVersion: kops/v1alpha2
kind: InstanceGroup
metadata:
creationTimestamp: 2018-01-22T10:26:14Z
labels:
beta.kubernetes.io/fluentd-ds-ready: "true"
kops.k8s.io/cluster: kubernetes.xxx.xxx
name: nodes-general-purpose
spec:
image: kope.io/k8s-1.8-debian-jessie-amd64-hvm-ebs-2018-01-05
machineType: m4.2xlarge
maxSize: 0
minSize: 0
nodeLabels:
kops.k8s.io/instancegroup: nodes-general-purpose
role: Node
subnets:
- eu-central-1a
- eu-central-1b
- eu-central-1c
- Please run the commands with most verbose logging by adding the
-v 10flag. Paste the logs into this report, or in a gist and provide the gist link here.
Jan 23 14:20:07 ip-172-20-96-21 kubelet[9704]: I0123 14:20:07.640211 9704 aws.go:1051] Could not determine public DNS from AWS metadata.
Jan 23 14:20:12 ip-172-20-96-21 kubelet[9704]: E0123 14:20:12.259174 9704 summary.go:92] Failed to get system container stats for "/system.slice/kubelet.service": failed to get cgroup stats for "/system.slice/kubelet.service": failed to get container info for "/system.slice/kubelet.service": unknown container "/system.slice/kubelet.service"
Jan 23 14:20:12 ip-172-20-96-21 kubelet[9704]: E0123 14:20:12.259200 9704 summary.go:92] Failed to get system container stats for "/system.slice/docker.service": failed to get cgroup stats for "/system.slice/docker.service": failed to get container info for "/system.slice/docker.service": unknown container "/system.slice/docker.service"
Jan 23 14:20:17 ip-172-20-96-21 kubelet[9704]: I0123 14:20:17.667982 9704 aws.go:1051] Could not determine public DNS from AWS metadata.
Jan 23 14:20:20 ip-172-20-96-21 kubelet[9704]: I0123 14:20:20.553308 9704 qos_container_manager_linux.go:320] [ContainerManager]: Updated QoS cgroup configuration
Jan 23 14:20:22 ip-172-20-96-21 kubelet[9704]: E0123 14:20:22.352318 9704 summary.go:92] Failed to get system container stats for "/system.slice/kubelet.service": failed to get cgroup stats for "/system.slice/kubelet.service": failed to get container info for "/system.slice/kubelet.service": unknown container "/system.slice/kubelet.service"
Jan 23 14:20:22 ip-172-20-96-21 kubelet[9704]: E0123 14:20:22.352346 9704 summary.go:92] Failed to get system container stats for "/system.slice/docker.service": failed to get cgroup stats for "/system.slice/docker.service": failed to get container info for "/system.slice/docker.service": unknown container "/system.slice/docker.service"
Jan 23 14:20:22 ip-172-20-96-21 kubelet[9704]: I0123 14:20:22.512209 9704 server.go:779] GET /metrics: (32.208317ms) 200 [[Prometheus/1.8.1] 172.20.91.21:34458]
Jan 23 14:20:27 ip-172-20-96-21 kubelet[9704]: I0123 14:20:27.687652 9704 aws.go:1051] Could not determine public DNS from AWS metadata.
Jan 23 14:20:30 ip-172-20-96-21 kubelet[9704]: E0123 14:20:30.431621 9704 remote_runtime.go:115] StopPodSandbox "4799a6a8c867bc324480b64df7221f13d6b83e8171a14c527b9c0559cf4b6426" from runtime service failed: rpc error: code = DeadlineExceeded desc = context deadline exceeded
Jan 23 14:20:30 ip-172-20-96-21 kubelet[9704]: E0123 14:20:30.431669 9704 kuberuntime_manager.go:781] Failed to stop sandbox {"docker" "4799a6a8c867bc324480b64df7221f13d6b83e8171a14c527b9c0559cf4b6426"}
Jan 23 14:20:30 ip-172-20-96-21 kubelet[9704]: E0123 14:20:30.431708 9704 kubelet_pods.go:1063] Failed killing the pod "nginx-deployment-569477d6d8-jcbjz": failed to "KillPodSandbox" for "406a218c-0048-11e8-b572-026c39b367e0" with KillPodSandboxError: "rpc error: code = DeadlineExceeded desc = context deadline exceeded"
Jan 23 14:20:32 ip-172-20-96-21 kubelet[9704]: E0123 14:20:32.431681 9704 summary.go:92] Failed to get system container stats for "/system.slice/kubelet.service": failed to get cgroup stats for "/system.slice/kubelet.service": failed to get container info for "/system.slice/kubelet.service": unknown container "/system.slice/kubelet.service"
Jan 23 14:20:32 ip-172-20-96-21 kubelet[9704]: E0123 14:20:32.431728 9704 summary.go:92] Failed to get system container stats for "/system.slice/docker.service": failed to get cgroup stats for "/system.slice/docker.service": failed to get container info for "/system.slice/docker.service": unknown container "/system.slice/docker.service"
Jan 23 14:20:37 ip-172-20-96-21 kubelet[9704]: I0123 14:20:37.712114 9704 aws.go:1051] Could not determine public DNS from AWS metadata.
Jan 23 14:20:42 ip-172-20-96-21 kubelet[9704]: E0123 14:20:42.513956 9704 summary.go:92] Failed to get system container stats for "/system.slice/kubelet.service": failed to get cgroup stats for "/system.slice/kubelet.service": failed to get container info for "/system.slice/kubelet.service": unknown container "/system.slice/kubelet.service"
Jan 23 14:20:42 ip-172-20-96-21 kubelet[9704]: E0123 14:20:42.513986 9704 summary.go:92] Failed to get system container stats for "/system.slice/docker.service": failed to get cgroup stats for "/system.slice/docker.service": failed to get container info for "/system.slice/docker.service": unknown container "/system.slice/docker.service"
Jan 23 14:20:47 ip-172-20-96-21 kubelet[9704]: I0123 14:20:47.734079 9704 aws.go:1051] Could not determine public DNS from AWS metadata.
Jan 23 14:20:52 ip-172-20-96-21 kubelet[9704]: E0123 14:20:52.743703 9704 summary.go:92] Failed to get system container stats for "/system.slice/docker.service": failed to get cgroup stats for "/system.slice/docker.service": failed to get container info for "/system.slice/docker.service": unknown container "/system.slice/docker.service"
Jan 23 14:20:52 ip-172-20-96-21 kubelet[9704]: E0123 14:20:52.743728 9704 summary.go:92] Failed to get system container stats for "/system.slice/kubelet.service": failed to get cgroup stats for "/system.slice/kubelet.service": failed to get container info for "/system.slice/kubelet.service": unknown container "/system.slice/kubelet.service"
Jan 23 14:20:57 ip-172-20-96-21 kubelet[9704]: I0123 14:20:57.761509 9704 aws.go:1051] Could not determine public DNS from AWS metadata.
Jan 23 14:21:02 ip-172-20-96-21 kubelet[9704]: E0123 14:21:02.826386 9704 summary.go:92] Failed to get system container stats for "/system.slice/kubelet.service": failed to get cgroup stats for "/system.slice/kubelet.service": failed to get container info for "/system.slice/kubelet.service": unknown container "/system.slice/kubelet.service"
Jan 23 14:21:02 ip-172-20-96-21 kubelet[9704]: E0123 14:21:02.826413 9704 summary.go:92] Failed to get system container stats for "/system.slice/docker.service": failed to get cgroup stats for "/system.slice/docker.service": failed to get container info for "/system.slice/docker.service": unknown container "/system.slice/docker.service"
Jan 23 14:21:05 ip-172-20-96-21 kubelet[9704]: E0123 14:21:05.083079 9704 summary.go:92] Failed to get system container stats for "/system.slice/kubelet.service": failed to get cgroup stats for "/system.slice/kubelet.service": failed to get container info for "/system.slice/kubelet.service": unknown container "/system.slice/kubelet.service"
Jan 23 14:21:05 ip-172-20-96-21 kubelet[9704]: E0123 14:21:05.083105 9704 summary.go:92] Failed to get system container stats for "/system.slice/docker.service": failed to get cgroup stats for "/system.slice/docker.service": failed to get container info for "/system.slice/docker.service": unknown container "/system.slice/docker.service"
Jan 23 14:21:05 ip-172-20-96-21 kubelet[9704]: I0123 14:21:05.089787 9704 server.go:779] GET /stats/summary/: (61.282407ms) 200 [[Go-http-client/1.1] 172.20.66.110:33458]
Jan 23 14:21:07 ip-172-20-96-21 kubelet[9704]: I0123 14:21:07.780547 9704 aws.go:1051] Could not determine public DNS from AWS metadata.
Jan 23 14:21:12 ip-172-20-96-21 kubelet[9704]: E0123 14:21:12.904377 9704 summary.go:92] Failed to get system container stats for "/system.slice/kubelet.service": failed to get cgroup stats for "/system.slice/kubelet.service": failed to get container info for "/system.slice/kubelet.service": unknown container "/system.slice/kubelet.service"
Jan 23 14:21:12 ip-172-20-96-21 kubelet[9704]: E0123 14:21:12.904407 9704 summary.go:92] Failed to get system container stats for "/system.slice/docker.service": failed to get cgroup stats for "/system.slice/docker.service": failed to get container info for "/system.slice/docker.service": unknown container "/system.slice/docker.service"
Jan 23 14:21:17 ip-172-20-96-21 kubelet[9704]: I0123 14:21:17.800646 9704 aws.go:1051] Could not determine public DNS from AWS metadata.
Jan 23 14:21:20 ip-172-20-96-21 kubelet[9704]: I0123 14:21:20.554402 9704 qos_container_manager_linux.go:320] [ContainerManager]: Updated QoS cgroup configuration
Jan 23 14:21:22 ip-172-20-96-21 kubelet[9704]: I0123 14:21:22.502214 9704 server.go:779] GET /metrics: (9.704924ms) 200 [[Prometheus/1.8.1] 172.20.91.21:34458]
Jan 23 14:21:22 ip-172-20-96-21 kubelet[9704]: E0123 14:21:22.995951 9704 summary.go:92] Failed to get system container stats for "/system.slice/kubelet.service": failed to get cgroup stats for "/system.slice/kubelet.service": failed to get container info for "/system.slice/kubelet.service": unknown container "/system.slice/kubelet.service"
Jan 23 14:21:22 ip-172-20-96-21 kubelet[9704]: E0123 14:21:22.995978 9704 summary.go:92] Failed to get system container stats for "/system.slice/docker.service": failed to get cgroup stats for "/system.slice/docker.service": failed to get container info for "/system.slice/docker.service": unknown container "/system.slice/docker.service"
Jan 23 14:21:27 ip-172-20-96-21 kubelet[9704]: I0123 14:21:27.823773 9704 aws.go:1051] Could not determine public DNS from AWS metadata.
Jan 23 14:21:33 ip-172-20-96-21 kubelet[9704]: E0123 14:21:33.062525 9704 summary.go:92] Failed to get system container stats for "/system.slice/kubelet.service": failed to get cgroup stats for "/system.slice/kubelet.service": failed to get container info for "/system.slice/kubelet.service": unknown container "/system.slice/kubelet.service"
Jan 23 14:21:33 ip-172-20-96-21 kubelet[9704]: E0123 14:21:33.062556 9704 summary.go:92] Failed to get system container stats for "/system.slice/docker.service": failed to get cgroup stats for "/system.slice/docker.service": failed to get container info for "/system.slice/docker.service": unknown container "/system.slice/docker.service"
Jan 23 14:21:43 ip-172-20-96-21 kubelet[9704]: E0123 14:21:43.159664 9704 summary.go:92] Failed to get system container stats for "/system.slice/kubelet.service": failed to get cgroup stats for "/system.slice/kubelet.service": failed to get container info for "/system.slice/kubelet.service": unknown container "/system.slice/kubelet.service"
Jan 23 14:21:43 ip-172-20-96-21 kubelet[9704]: E0123 14:21:43.159715 9704 summary.go:92] Failed to get system container stats for "/system.slice/docker.service": failed to get cgroup stats for "/system.slice/docker.service": failed to get container info for "/system.slice/docker.service": unknown container "/system.slice/docker.service"
Jan 23 14:21:47 ip-172-20-96-21 kubelet[9704]: I0123 14:21:47.881168 9704 aws.go:1051] Could not determine public DNS from AWS metadata.
Jan 23 14:21:49 ip-172-20-96-21 kubelet[9704]: E0123 14:21:49.208647 9704 remote_runtime.go:115] StopPodSandbox "9dfd449d99efe66115045c5557efba54d57cab1b3617fb67fb412fc11487d266" from runtime service failed: rpc error: code = DeadlineExceeded desc = context deadline exceeded
Jan 23 14:21:49 ip-172-20-96-21 kubelet[9704]: E0123 14:21:49.208684 9704 kuberuntime_gc.go:152] Failed to stop sandbox "9dfd449d99efe66115045c5557efba54d57cab1b3617fb67fb412fc11487d266" before removing: rpc error: code = DeadlineExceeded desc = context deadline exceeded
Jan 23 14:21:53 ip-172-20-96-21 kubelet[9704]: E0123 14:21:53.238500 9704 summary.go:92] Failed to get system container stats for "/system.slice/kubelet.service": failed to get cgroup stats for "/system.slice/kubelet.service": failed to get container info for "/system.slice/kubelet.service": unknown container "/system.slice/kubelet.service"
Jan 23 14:21:53 ip-172-20-96-21 kubelet[9704]: E0123 14:21:53.238527 9704 summary.go:92] Failed to get system container stats for "/system.slice/docker.service": failed to get cgroup stats for "/system.slice/docker.service": failed to get container info for "/system.slice/docker.service": unknown container "/system.slice/docker.service"
- Anything else do we need to know?
This started to happen suddenly a week ago. First it manifested itself as ‘slow deployments’ with intermittent sandbox failure. 3 days later, the deployments wouldn’t finish anymore, always resulting in sandbox errors. Probably related to CNI from what I’ve searched, but all issues point to ‘fixed’ in 1.8.5 but somehow I get this problem.
I’m also using weavenet 2.0.1
The deployment I used for this test is:
apiVersion: apps/v1beta2 # for versions before 1.9.0 use apps/v1beta2
kind: Deployment
metadata:
name: nginx-deployment
spec:
selector:
matchLabels:
app: nginx
replicas: 3 # tells deployment to run 2 pods matching the template
template: # create pods using pod definition in this template
metadata:
# unlike pod-nginx.yaml, the name is not included in the meta data as a unique name is
# generated from the deployment name
labels:
app: nginx
spec:
containers:
- name: nginx
image: nginx:1.7.9
ports:
- containerPort: 80
About this issue
- Original URL
- State: closed
- Created 6 years ago
- Comments: 15 (6 by maintainers)
See this two links: https://github.com/weaveworks/weave/issues/2797 https://github.com/weaveworks/weave/issues/2797
At the company I work for we just run into this yesterday 😃
Basically weave does not reclaim unused IP addresses after nodes have been removed from the cluster. This was fixed in weave 2.1.1 but kops release 1.8.0 ships weave 2.0.5. The master branch in kops has weave 2.3.1, so we’re waiting for a new release. Meanwhile we’re removing unused peers.
kubectl exec -n kube-system {MASTER_NODE_WEAVE_POD_ID} -c weave -- /home/weave/weave --local status ipamkubectl exec -n kube-system {MASTER_NODE_WEAVE_POD_ID} -c weave -- /home/weave/weave --local rmpeer {MAC_OF_UNREACHABLE_NODE}The node on which you run
rmpeerwill claim the unused addresses, so we’re running the command across the master nodesIt might be as silly as enabling CNI plugins ports. I am using weave-net so adequate ports are 6783/tcp, 6783/udp, 6784/udp on master node(s) in your firewall
Same issue with version 2.4.1
@yoz2326 Your steps fixed the issue for me. I had the issue after testing what would happen when I rebooted my nodes in a cluster. Note I’m actually using kubicorn with digitalocean, but I thought I’d post here to thank you and maybe help someone else who has the same issue 🙂
You mentioned though that this is fixed in weave 2.1.1. But from what I can see this is still an issue when using :
For me, the output of the command was as follows:
I followed your second command and removed the two unreachable nodes ( they are the same node, but after the reboot appear to have got a different mac address).
As soon as this happened the cluster sprang back into life.