moby: Docker stack fails to allocate IP on an overlay network, and gets stuck in `NEW` current state
Description I do have a docker swarm cluster with 5 managers and 4 working nodes. A while ago, it was 3 managers and 4 working nodes. We use immutable infrastructure for the hosts.
EDIT: I managed to get reproducible tests in https://github.com/moby/moby/issues/37338#issuecomment-437558916
Majority of our containers connect to a specific overlay network, with CIDR /24. my_network: driver: overlay driver_opts: encrypted: “” ipam: driver: default config: - subnet: 10.100.2.0/24
Docker stack deployments happen all the time in this specific cluster.
Occasionally (and I cannot understand what causes it), the swarm is unable to allocate an IP to that task. “Failed allocation for service <service>” error=“could not find an available IP while allocating VIP”
So I assumed we ran out of IPs for the CIDR. But when I count the number of tasks currently attached to the network, there was less than 40 running tasks. I also went and counted all the stopped/historical tasks, and counted all the IPs on that network; still, there were less than 120 IPs. A lot less than the 200-and-something I’d expect.
I tried to restrict the task history size, but that by itself didn’t make any difference. I deleted almost all stacks, and some containers were able to get a new IP. But the problem manifested itself as far as all things got redeployed.
I actually looked for the NetworkDB stats when the problem was happening, and it was all lines like: NetworkDB stats <leader host>(<node>) - netID:<my network> leaving:false netPeers:8 entries:14 Queue qLen:0 netMsg/s:0
After we ‘recycled’ all the managers (including the leader), the problem appears to be resolved. All the tasks which were stuck then received a new IP.
NetworkDB stats <host>(<node>) - netID:<my network> leaving:false netPeers:7 entries:49 Queue qLen:0 netMsg/s:0
It appears that somehow some IPs are not returned back to the pool, but I’m not even sure where to look for more information. Anyone able to help me on how to investigate this problem?
My problem appears similar to what was described here. https://github.com/docker/for-aws/issues/104
Steps to reproduce the issue:
- Create a docker stack that connects to the /24 overlay network
- docker deploy stack -c file.yaml my-stack
- docker service ps my-stack
Describe the results you received: Tasks get stuck as ‘NEW’ state.
Describe the results you expected: If we have less than 200 containers attached to the /24 network, I’d expect the task to be running.
Additional information you deem important (e.g. issue happens only occasionally): We’ve seen this problem before.
The problem apparently persists for days. Eventually, after a few hours waiting, some of the containers receive the IP and start. I’ve seen containers stuck on that state for more than a day.
Output of docker version
:
$ docker version
Server:
Engine:
Version: 18.03.1-ce
API version: 1.37 (minimum version 1.12)
Go version: go1.9.5
Git commit: 9ee9f40
Built: Thu Apr 26 07:23:03 2018
OS/Arch: linux/amd64
Experimental: false output here)
Output of docker info
:
Containers: 3
Running: 2
Paused: 0
Stopped: 1
Images: 3
Server Version: 18.03.1-ce
Storage Driver: overlay2
Backing Filesystem: extfs
Supports d_type: true
Native Overlay Diff: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
Volume: local
Network: bridge host macvlan null overlay
Log: awslogs fluentd gcplogs gelf journald json-file logentries splunk syslog
Swarm: active
NodeID: okljpo50f39c74me2qzem67qw
Is Manager: true
ClusterID: 2ufszb0kyswdcmi7nzxfqjb47
Managers: 5
Nodes: 9
Orchestration:
Task History Retention Limit: 1
Raft:
Snapshot Interval: 10000
Number of Old Snapshots to Retain: 0
Heartbeat Tick: 1
Election Tick: 3
Dispatcher:
Heartbeat Period: 5 seconds
CA Configuration:
Expiry Duration: 3 months
Force Rotate: 0
Autolock Managers: false
Root Rotation In Progress: false
Manager Addresses:
...
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 773c489c9c1b21a6d78b5c538cd395416ec50f88
runc version: 4fc53a81fb7c994640722ac585fa9ca548971871
init version: 949e6fa
Security Options:
seccomp
Profile: default
Kernel Version: 4.9.107-linuxkit
Operating System: Alpine Linux v3.7
OSType: linux
Architecture: x86_64
CPUs: 1
Total Memory: 1.951GiB
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
Labels:
Experimental: false
Insecure Registries:
127.0.0.0/8
Live Restore Enabled: false
Additional environment details (AWS, VirtualBox, physical, etc.): AWS.
About this issue
- Original URL
- State: open
- Created 6 years ago
- Reactions: 42
- Comments: 61
This is still an issue.
It’s not about ingress network itself, but about long-living networks. Increasing subnet size is just a workaround for a insufficient garbage collection of IP pools.
This issue is open since 3 years and nothing has changed even though there are enough people having this issue.
@thaJeztah do you know who can help us look into this issue?
We just had the problem AGAIN! We were just resigning ourselves to rebuilding the cluster from scratch YET AGAIN, but this time “kill -9” of the affected manager recovered the managers (we run three dedicated managers). In our (long!) experience, though, we just kicked the can down the road! Now we will not be able to reliably bring up any new containers until we drain the entire cluster, restart Docker Swarm from scratch, and readd all containers. And that’s ABSURD!
FIVE YEARS this has been a problem, and the Docker team simply won’t fix it. Numerous people (including several on this very thread) have provided straightforward methods to reliably replicate the issue, and we all know what the underlying issue is (garbage cleanup of unused IPs). And the Docker team simply won’t fix it.
The promise of Docker Swarm is superb, way easier to deal with than Kubernetes. But we have rebuilt our production cluster from scratch countless times due to NOTHING working to recover it when Docker blows up in our faces. Demote the manager leader, let it clear out, and then repromote it sometimes (rarely) solves the problem (temporarily, just kicking the inevitable can down the road). Sometimes “kill -9” of the affected manager process temporarily solves the problem (just kicking the inevitable can down the road, as we just did this time). But absolutely nothing reliably SOLVES the problem, and over the years we have REPEATEDLY had to rebuild the entire cluster from scratch to get our containers all back up. UNACCEPTABLE for production!
You simply cannot reliably stop and start containers or scale up additional containers as long as this CORE bug persists in Docker Swarm. Thus, Docker Swarm is NOT a reasonable choice for production environments.
It’s a real pain to switch to Kubernetes, but we can no longer endure spending an entire night getting our cluster back up after Docker Swarm blows up in our faces. And after FIVE YEARS of knowing about this (and how to replicate the problem!), the Docker team should be ashamed of itself for allowing this to persist with no fix.
We migrated the entire company to kubernetes just for this issue. Took about one month. Since this is a 3 years issue and nobody has the knowledge to fix it, i think a warning on documentation is necessary, something like “we don’t recommend to use docker in production with auto-spawning or auto-scaling since its ingress network can only handle up to 256 containers, and removed or stopped ones still count” or something like that
Any updates? 😦
I’ve the same behavior in a Docker Swarm Cluster with Traefik, where at some points every few weeks container got stuck in
New
state. Restarting master node solves the issue temporary. More often deployments seem to lead to this state more often.If this behavior happens the Docker IP Utilization Check Script doesn’t report any IP address exhaustion on any network though.
In my environment, we solved this by creating more networks and linking to traefik, so we could use another 254 available addresses for each network created; Example:
traefik-docker-compose.yml
my api.yml 1:
my front.yml after 254 services:
This issue occured to me now after running swarm cluster for couple of months. We are redeploying services frequently. I am not sure how such pool of IPs rotate in the swarm.
@kirk-wgt I am baffled that you and some others here still haven’t given up hope. We, too, started out with Docker Swarm and quickly saw all our production clusters crashing every few days because of this bug.
This was well over two years (!) ago. This was when I made the hard decision to migrate to K3s. We never looked back. K3s is the perfect replacement for Docker Swarm due to a similar deployment model and the integrated Klipper-LB that behaves very similarly to the routing mesh of Docker Swarm.
Just accept it: Docker Swarm is dead. Do not use it for anything else than simple throwaway clusters where you can count the number of containers on one hand.
Agreed, the issue itself is the ips not being released after a service is no longer running, but the ingress network supports as many containers as you design to support. You can easily remove the ingress network and change the default subnet mask from
/24
to/16
, increasing the number of containers easily to 66534.Perhaps the reason why this issue hasn’t been solved yet is due to the burden of trying to replicate it quickly. Currently, with the default ingress network, you would need to up/down a container 256 times (or create a service with 256x replicas). Or, to encounter it even faster, you could modify the docker swarm to have a lower number of containers, like mentioned here.
I am running proxied services in dnsrr mode without a problem for a longer time now. It does not fix the state new problem it just reduces the occurrence of state new, due to the fact that not that much IP addresses are consumed in each deployment. You safe on VIP IP address for each service deployed in traefik. But there are also reason to run stack behind traefik with a VIP, for example if you need http(s) and some node port mappings.
Same Issue here. We have this problem now ones a week. We deploy about 5 -7 new container every hour and stop/remove the old one. Only a restart of the docker daemon helps.
Docker 19.03.6
When investigating the problem, we came with this python script (note, it’s using TLS auth):
It will show all tasks which are supposedly still running, but assigned to nodes which do not exist. Unfortunately there’s no fix. We tend to undeploy those stacks, and usually the orphan tasks went away.
As you are surely finding, Celtech, and will continue to discover, there is no magic bullet.
Even at this late date, the Docker team can’t seem to track down and kill this core bug (or cascade of bugs) that make Swarm unreliable. Sometimes demotion/promotion of the leader will recover the cluster, sometimes a “kill -9” of the docker daemon will work, sometimes killing every container that can’t start (for whatever ridiculous reason) will work, sometimes “reinitializing” the cluster in-place will work, and many/most times nothing will work. Once you’re bitten by the “Docker Bug,” as we’ve come to call it, all bets are off. You may get your cluster back, and you likely will not. That is completely unacceptable for a production cluster.
We recently entirely gave up on Docker Swarm. Our new cluster runs on Kubernetes, and we’ve written scripts and templates for ourselves to reduce the network-stack management complexities to a manageable level for us.
In our opinion, Docker Swarm is not a production-ready containerization environment and never will be. You are on the right track, in our opinion, to cite “zombie tasks holding these IP’s hostage,” although no such tasks show up using PS. Our belief is that Docker doesn’t engage in robust and rapid garbage collection, and it doesn’t correctly honor the specified subnet value at initialization. But years of waiting and hoping have proved fruitless, and we finally had to go to something reliable (albeit harder to deal with).
I sincerely wish you all the best and good luck in your efforts with Docker Swarm! We were forced to abandon it.
@kirk-wgt I am not a k3s salesperson, so I won’t try to persuade to use one product over the other 😄 But I feel there is a misconception that we had at first, too.
K3s is a full fledged kubernetes distribution with many bells and whistles attached (like a Traefik ingress controller integrated). If anything, Docker Swarm is the lightweight solution when comparing the two. K3s is just as “beefy” as any other kubernetes distribution (and many would argue that all kubernetes distributions are too beefy anyway).
Yes, K3s markets itself for “the edge”, but only because of its single-binary, zero-dependencies deployment model. There is nothing lightweight about it when compared to other k8s distributions (except maybe for the removed (non CSI) storage drivers that all have long been deprecated anyway).
I am sorry for being off-topic here and for triggering the notifications for well too many people. I just cannot help but to continue following this thread and to grab a bag of popcorn whenever someone falls into the same pitfalls that we encountered two years ago.
Edit: Before someone accuses me of being partial to K3s (which I am, but not for commercial reasons): Of course there are other k8s distributions that could be used in favor of Docker Swarm. Since we are talking Docker Swarm here, I should expecially mention k0s, which is a kubernetes distribution by Mirantis. But we made the conscious decision against using k0s after seeing how Mirantis handled (aka letting it die) Docker Swarm. But I’ve heard many good things about k0s.
What will it take for this glaring defect to get any attention from the Docker developers? Is there a docker PM who I can nudge?
I am now in the unenviable position of trying to deploy our production-ready system atop unreliable for production Docker Swarm. Just burned a couple weeks trying various hacks that didn’t work. Too let to switch to k8s right now.
Ran into this issue as well and was able to fix it temporarily with a restart. At the end of the day, ~the problem lies~ one of the other problems lies with the ingress network. Since the ingress network has a default subnet mask of 10.0.0.0/24, which according to this subnet mask table will result in a maximum of 256 services that can be connected to the ingress network.
After your ingress is created, try removing it, and creating your own ingress network with subnet mask that provides more services. For example:
EDIT: Yes admittedly, the real issue is the garbage collection, but if at least people know how to extend the number of services, they can reduce the occurrence.
Didn’t help or did not help for long:
It helped to recreate the swarm with the “–default-addr-pool-mask-length 16” parameter.
Most likely, it will help to recreate the ingress network according to this instruction: https://docs.docker.com/network/overlay/#customize-the-default-ingress-network
To control the situation, I made a trigger:
docker service ls -q | xargs -L1 docker service ps --filter desired-state=running --format '{{if .Node}}true{{else}}false{{end}}' | grep false -c