azure-container-networking: Azure CNI IP addresses exhausted, old pod IPs not being reclaimed?

I have a Kubernetes 1.7.5 cluster which has run out of IPs to allocate, even though I only have about 15 pods running. Scheduling a new deployment fails. The events on the pod to be scheduled are:

default   2017-09-28 03:57:02 -0400 EDT   2017-09-28 03:57:02 -0400 EDT   1         hello-4059723819-8s35v   Pod       spec.containers{hello}   Normal    Pulled    kubelet, k8s-agentpool1-18117938-2   Successfully pulled image "myregistry.azurecr.io/mybiz/hello"
default   2017-09-28 03:57:02 -0400 EDT   2017-09-28 03:57:02 -0400 EDT   1         hello-4059723819-8s35v   Pod       spec.containers{hello}   Normal    Created   kubelet, k8s-agentpool1-18117938-2   Created container
default   2017-09-28 03:57:03 -0400 EDT   2017-09-28 03:57:03 -0400 EDT   1         hello-4059723819-8s35v   Pod       spec.containers{hello}   Normal    Started   kubelet, k8s-agentpool1-18117938-2   Started container
default   2017-09-28 03:57:13 -0400 EDT   2017-09-28 03:57:01 -0400 EDT   2         hello-4059723819-tj043   Pod                 Warning   FailedSync   kubelet, k8s-agentpool1-18117938-3   Error syncing pod
default   2017-09-28 03:57:13 -0400 EDT   2017-09-28 03:57:02 -0400 EDT   2         hello-4059723819-tj043   Pod                 Normal    SandboxChanged   kubelet, k8s-agentpool1-18117938-3   Pod sandbox changed, it will be killed and re-created.
default   2017-09-28 03:57:24 -0400 EDT   2017-09-28 03:57:01 -0400 EDT   3         hello-4059723819-tj043   Pod                 Warning   FailedSync   kubelet, k8s-agentpool1-18117938-3   Error syncing pod
default   2017-09-28 03:57:25 -0400 EDT   2017-09-28 03:57:02 -0400 EDT   3         hello-4059723819-tj043   Pod                 Normal    SandboxChanged   kubelet, k8s-agentpool1-18117938-3   Pod sandbox changed, it will be killed and re-created.
[...]

The dashboard looks like this:

Eventually, the dashboard shows the error:

Error: failed to start container "hello": Error response from daemon: {"message":"cannot join network of a non running container: 7e95918c6b546714ae20f12349efcc6b4b5b9c1e84b5505cf907807efd57525c"}

The kubelet log on the node shows that an IP address cannot be allocated due to “No available addresses”:

E0928 20:54:01.733682    1750 pod_workers.go:182] Error syncing pod 65127a94-a425-11e7-8d64-000d3af4357e ("hello-4059723819-xx16n_default(65127a94-a425-11e7-8d64-000d3af4357e)"), skipping: failed to "CreatePodSandbox" for "hello-4059723819-xx16n_default(65127a94-a425-11e7-8d64-000d3af4357e)" with CreatePodSandboxError: "CreatePodSandbox for pod \"hello-4059723819-xx16n_default(65127a94-a425-11e7-8d64-000d3af4357e)\" failed: rpc error: code = 2 desc = NetworkPlugin cni failed to set up pod \"hello-4059723819-xx16n_default\" network: Failed to allocate address: Failed to delegate: Failed to allocate address: No available addresses"

I did have a cron job running on this cluster every minute, which would have created (and destroyed) a lot of pods in a short amount of time.

Is this perhaps a bug in Azure CNI not reclaiming IP addresses from terminated containers?

How to reproduce it (as minimally and precisely as possible):

I believe this can be reproduced simply by starting and stopping many containers, such as by enabling the batch/v2alpha1 API and running a CronJob every few seconds.

Anything else we need to know?: No

Environment:

Kubernetes version (use kubectl version):

Client Version: version.Info{Major:"1", Minor:"7", GitVersion:"v1.7.6", GitCommit:"4bc5e7f9a6c25dc4c03d4d656f2cefd21540e28c", GitTreeState:"clean", BuildDate:"2017-09-14T06:55:55Z", GoVersion:"go1.8.3", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"7", GitVersion:"v1.7.5", GitCommit:"17d7182a7ccbb167074be7a87f0a68bd00d58d97", GitTreeState:"clean", BuildDate:"2017-08-31T08:56:23Z", GoVersion:"go1.8.3", Compiler:"gc", Platform:"linux/amd64"}

Cloud provider or hardware configuration**: Azure with Azure CNI networking plugin
OS (e.g. from /etc/os-release):

NAME="Ubuntu"
VERSION="16.04.2 LTS (Xenial Xerus)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 16.04.2 LTS"
VERSION_ID="16.04"
HOME_URL="http://www.ubuntu.com/"
SUPPORT_URL="http://help.ubuntu.com/"
BUG_REPORT_URL="http://bugs.launchpad.net/ubuntu/"
VERSION_CODENAME=xenial
UBUNTU_CODENAME=xenial

Kernel (e.g. uname -a): Linux k8s-master-18117938-0 4.4.0-96-generic #119-Ubuntu SMP Tue Sep 12 14:59:54 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
Install tools: ACS-Engine
Others: Restarting the master has no effect

This is essentially a copy of upstream bug https://github.com/kubernetes/kubernetes/issues/53226.

About this issue

Original URL
State: closed
Created 7 years ago
Reactions: 2
Comments: 23 (10 by maintainers)

Most upvoted comments

@sharmasushant Following up on this, thanks!

rocketraman on Oct 16, 2017