azure-container-networking: Azure CNI IP addresses exhausted, old pod IPs not being reclaimed?
I have a Kubernetes 1.7.5 cluster which has run out of IPs to allocate, even though I only have about 15 pods running. Scheduling a new deployment fails. The events on the pod to be scheduled are:
default 2017-09-28 03:57:02 -0400 EDT 2017-09-28 03:57:02 -0400 EDT 1 hello-4059723819-8s35v Pod spec.containers{hello} Normal Pulled kubelet, k8s-agentpool1-18117938-2 Successfully pulled image "myregistry.azurecr.io/mybiz/hello"
default 2017-09-28 03:57:02 -0400 EDT 2017-09-28 03:57:02 -0400 EDT 1 hello-4059723819-8s35v Pod spec.containers{hello} Normal Created kubelet, k8s-agentpool1-18117938-2 Created container
default 2017-09-28 03:57:03 -0400 EDT 2017-09-28 03:57:03 -0400 EDT 1 hello-4059723819-8s35v Pod spec.containers{hello} Normal Started kubelet, k8s-agentpool1-18117938-2 Started container
default 2017-09-28 03:57:13 -0400 EDT 2017-09-28 03:57:01 -0400 EDT 2 hello-4059723819-tj043 Pod Warning FailedSync kubelet, k8s-agentpool1-18117938-3 Error syncing pod
default 2017-09-28 03:57:13 -0400 EDT 2017-09-28 03:57:02 -0400 EDT 2 hello-4059723819-tj043 Pod Normal SandboxChanged kubelet, k8s-agentpool1-18117938-3 Pod sandbox changed, it will be killed and re-created.
default 2017-09-28 03:57:24 -0400 EDT 2017-09-28 03:57:01 -0400 EDT 3 hello-4059723819-tj043 Pod Warning FailedSync kubelet, k8s-agentpool1-18117938-3 Error syncing pod
default 2017-09-28 03:57:25 -0400 EDT 2017-09-28 03:57:02 -0400 EDT 3 hello-4059723819-tj043 Pod Normal SandboxChanged kubelet, k8s-agentpool1-18117938-3 Pod sandbox changed, it will be killed and re-created.
[...]
The dashboard looks like this:
Eventually, the dashboard shows the error:
Error: failed to start container "hello": Error response from daemon: {"message":"cannot join network of a non running container: 7e95918c6b546714ae20f12349efcc6b4b5b9c1e84b5505cf907807efd57525c"}
The kubelet log on the node shows that an IP address cannot be allocated due to “No available addresses”:
E0928 20:54:01.733682 1750 pod_workers.go:182] Error syncing pod 65127a94-a425-11e7-8d64-000d3af4357e ("hello-4059723819-xx16n_default(65127a94-a425-11e7-8d64-000d3af4357e)"), skipping: failed to "CreatePodSandbox" for "hello-4059723819-xx16n_default(65127a94-a425-11e7-8d64-000d3af4357e)" with CreatePodSandboxError: "CreatePodSandbox for pod \"hello-4059723819-xx16n_default(65127a94-a425-11e7-8d64-000d3af4357e)\" failed: rpc error: code = 2 desc = NetworkPlugin cni failed to set up pod \"hello-4059723819-xx16n_default\" network: Failed to allocate address: Failed to delegate: Failed to allocate address: No available addresses"
I did have a cron job running on this cluster every minute, which would have created (and destroyed) a lot of pods in a short amount of time.
Is this perhaps a bug in Azure CNI not reclaiming IP addresses from terminated containers?
How to reproduce it (as minimally and precisely as possible):
I believe this can be reproduced simply by starting and stopping many containers, such as by enabling the batch/v2alpha1
API and running a CronJob every few seconds.
Anything else we need to know?: No
Environment:
- Kubernetes version (use
kubectl version
):
Client Version: version.Info{Major:"1", Minor:"7", GitVersion:"v1.7.6", GitCommit:"4bc5e7f9a6c25dc4c03d4d656f2cefd21540e28c", GitTreeState:"clean", BuildDate:"2017-09-14T06:55:55Z", GoVersion:"go1.8.3", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"7", GitVersion:"v1.7.5", GitCommit:"17d7182a7ccbb167074be7a87f0a68bd00d58d97", GitTreeState:"clean", BuildDate:"2017-08-31T08:56:23Z", GoVersion:"go1.8.3", Compiler:"gc", Platform:"linux/amd64"}
- Cloud provider or hardware configuration**: Azure with Azure CNI networking plugin
- OS (e.g. from /etc/os-release):
NAME="Ubuntu"
VERSION="16.04.2 LTS (Xenial Xerus)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 16.04.2 LTS"
VERSION_ID="16.04"
HOME_URL="http://www.ubuntu.com/"
SUPPORT_URL="http://help.ubuntu.com/"
BUG_REPORT_URL="http://bugs.launchpad.net/ubuntu/"
VERSION_CODENAME=xenial
UBUNTU_CODENAME=xenial
- Kernel (e.g.
uname -a
):Linux k8s-master-18117938-0 4.4.0-96-generic #119-Ubuntu SMP Tue Sep 12 14:59:54 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
- Install tools: ACS-Engine
- Others: Restarting the master has no effect
This is essentially a copy of upstream bug https://github.com/kubernetes/kubernetes/issues/53226.
About this issue
- Original URL
- State: closed
- Created 7 years ago
- Reactions: 2
- Comments: 23 (10 by maintainers)
@sharmasushant Following up on this, thanks!