moby: Docker stack fails to allocate IP on an overlay network, and gets stuck in `NEW` current state

Description I do have a docker swarm cluster with 5 managers and 4 working nodes. A while ago, it was 3 managers and 4 working nodes. We use immutable infrastructure for the hosts.

EDIT: I managed to get reproducible tests in https://github.com/moby/moby/issues/37338#issuecomment-437558916

Majority of our containers connect to a specific overlay network, with CIDR /24. my_network: driver: overlay driver_opts: encrypted: “” ipam: driver: default config: - subnet: 10.100.2.0/24

Docker stack deployments happen all the time in this specific cluster.

Occasionally (and I cannot understand what causes it), the swarm is unable to allocate an IP to that task. “Failed allocation for service <service>” error=“could not find an available IP while allocating VIP”

So I assumed we ran out of IPs for the CIDR. But when I count the number of tasks currently attached to the network, there was less than 40 running tasks. I also went and counted all the stopped/historical tasks, and counted all the IPs on that network; still, there were less than 120 IPs. A lot less than the 200-and-something I’d expect.

I tried to restrict the task history size, but that by itself didn’t make any difference. I deleted almost all stacks, and some containers were able to get a new IP. But the problem manifested itself as far as all things got redeployed.

I actually looked for the NetworkDB stats when the problem was happening, and it was all lines like: NetworkDB stats <leader host>(<node>) - netID:<my network> leaving:false netPeers:8 entries:14 Queue qLen:0 netMsg/s:0

After we ‘recycled’ all the managers (including the leader), the problem appears to be resolved. All the tasks which were stuck then received a new IP.

NetworkDB stats <host>(<node>) - netID:<my network> leaving:false netPeers:7 entries:49 Queue qLen:0 netMsg/s:0

It appears that somehow some IPs are not returned back to the pool, but I’m not even sure where to look for more information. Anyone able to help me on how to investigate this problem?

My problem appears similar to what was described here. https://github.com/docker/for-aws/issues/104

Steps to reproduce the issue:

Create a docker stack that connects to the /24 overlay network
docker deploy stack -c file.yaml my-stack
docker service ps my-stack

Describe the results you received: Tasks get stuck as ‘NEW’ state.

Describe the results you expected: If we have less than 200 containers attached to the /24 network, I’d expect the task to be running.

Additional information you deem important (e.g. issue happens only occasionally): We’ve seen this problem before.

The problem apparently persists for days. Eventually, after a few hours waiting, some of the containers receive the IP and start. I’ve seen containers stuck on that state for more than a day.

Output of docker version:

$ docker version
Server:
 Engine:
  Version:	18.03.1-ce
  API version:	1.37 (minimum version 1.12)
  Go version:	go1.9.5
  Git commit:	9ee9f40
  Built:	Thu Apr 26 07:23:03 2018
  OS/Arch:	linux/amd64
  Experimental:	false output here)

Output of docker info:

Containers: 3
 Running: 2
 Paused: 0
 Stopped: 1
Images: 3
Server Version: 18.03.1-ce
Storage Driver: overlay2
 Backing Filesystem: extfs
 Supports d_type: true
 Native Overlay Diff: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
 Volume: local
 Network: bridge host macvlan null overlay
 Log: awslogs fluentd gcplogs gelf journald json-file logentries splunk syslog
Swarm: active
 NodeID: okljpo50f39c74me2qzem67qw
 Is Manager: true
 ClusterID: 2ufszb0kyswdcmi7nzxfqjb47
 Managers: 5
 Nodes: 9
 Orchestration:
  Task History Retention Limit: 1
 Raft:
  Snapshot Interval: 10000
  Number of Old Snapshots to Retain: 0
  Heartbeat Tick: 1
  Election Tick: 3
 Dispatcher:
  Heartbeat Period: 5 seconds
 CA Configuration:
  Expiry Duration: 3 months
  Force Rotate: 0
 Autolock Managers: false
 Root Rotation In Progress: false
 Manager Addresses:
...
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 773c489c9c1b21a6d78b5c538cd395416ec50f88
runc version: 4fc53a81fb7c994640722ac585fa9ca548971871
init version: 949e6fa
Security Options:
 seccomp
  Profile: default
Kernel Version: 4.9.107-linuxkit
Operating System: Alpine Linux v3.7
OSType: linux
Architecture: x86_64
CPUs: 1
Total Memory: 1.951GiB
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
Labels:
Experimental: false
Insecure Registries:
 127.0.0.0/8
Live Restore Enabled: false

Additional environment details (AWS, VirtualBox, physical, etc.): AWS.

About this issue

Original URL
State: open
Created 6 years ago
Reactions: 42
Comments: 61

Commits related to this issue

Change id of CherryServers for deployer so it can have Debian 10 needed for Dokku. Use networks wiht a mask of 16 in swarm for the shortcoming of Docker Swarm https://github.com/moby/moby/issues/37338 — committed to Stremio/stremio-beamup by gcalcaterra 8 months ago

Most upvoted comments

This is still an issue.

+32

hanaboughannam on Mar 5, 2020

It’s not about ingress network itself, but about long-living networks. Increasing subnet size is just a workaround for a insufficient garbage collection of IP pools.

This issue is open since 3 years and nothing has changed even though there are enough people having this issue.

+16

kkendzia on Jun 3, 2021

@thaJeztah do you know who can help us look into this issue?

+15

vegasbrianc on Jun 2, 2020

We just had the problem AGAIN! We were just resigning ourselves to rebuilding the cluster from scratch YET AGAIN, but this time “kill -9” of the affected manager recovered the managers (we run three dedicated managers). In our (long!) experience, though, we just kicked the can down the road! Now we will not be able to reliably bring up any new containers until we drain the entire cluster, restart Docker Swarm from scratch, and readd all containers. And that’s ABSURD!

FIVE YEARS this has been a problem, and the Docker team simply won’t fix it. Numerous people (including several on this very thread) have provided straightforward methods to reliably replicate the issue, and we all know what the underlying issue is (garbage cleanup of unused IPs). And the Docker team simply won’t fix it.

The promise of Docker Swarm is superb, way easier to deal with than Kubernetes. But we have rebuilt our production cluster from scratch countless times due to NOTHING working to recover it when Docker blows up in our faces. Demote the manager leader, let it clear out, and then repromote it sometimes (rarely) solves the problem (temporarily, just kicking the inevitable can down the road). Sometimes “kill -9” of the affected manager process temporarily solves the problem (just kicking the inevitable can down the road, as we just did this time). But absolutely nothing reliably SOLVES the problem, and over the years we have REPEATEDLY had to rebuild the entire cluster from scratch to get our containers all back up. UNACCEPTABLE for production!

You simply cannot reliably stop and start containers or scale up additional containers as long as this CORE bug persists in Docker Swarm. Thus, Docker Swarm is NOT a reasonable choice for production environments.

It’s a real pain to switch to Kubernetes, but we can no longer endure spending an entire night getting our cluster back up after Docker Swarm blows up in our faces. And after FIVE YEARS of knowing about this (and how to replicate the problem!), the Docker team should be ashamed of itself for allowing this to persist with no fix.

+13

madbolter on Jun 18, 2023

We migrated the entire company to kubernetes just for this issue. Took about one month. Since this is a 3 years issue and nobody has the knowledge to fix it, i think a warning on documentation is necessary, something like “we don’t recommend to use docker in production with auto-spawning or auto-scaling since its ingress network can only handle up to 256 containers, and removed or stopped ones still count” or something like that

Zincr0 on Jun 3, 2021

Any updates? 😦

16g on Aug 10, 2020

I’ve the same behavior in a Docker Swarm Cluster with Traefik, where at some points every few weeks container got stuck in New state. Restarting master node solves the issue temporary. More often deployments seem to lead to this state more often.

If this behavior happens the Docker IP Utilization Check Script doesn’t report any IP address exhaustion on any network though.

pascalberger on Apr 1, 2020

In my environment, we solved this by creating more networks and linking to traefik, so we could use another 254 available addresses for each network created; Example:

traefik-docker-compose.yml

version: '3.3'
networks:
  webgateway:
    driver: overlay
    ipam:
      driver: default
      config:
        - subnet : 192.168.1.0/24
  webgateway_2:
    driver: overlay
    ipam:
      driver: default
  webgateway_3:
    driver: overlay
    ipam:
      driver: default
  webgateway_4:
    driver: overlay
    ipam:
      driver: default

services:
  traefik:
    image: "traefik:v2.1.3"
    command:
      - "--ping=true"
      - "--ping.entryPoint=ping"
      - "--providers.docker.swarmMode=true"
      - "--providers.docker.network=traefik_webgateway"
      - "--providers.docker.network=traefik_webgateway_2"
      - "--providers.docker.network=traefik_webgateway_3"
      - "--providers.docker.network=traefik_webgateway_4"
      - "--providers.file.directory=/configuration"
      - "--providers.file.watch=true"
      - "--entryPoints.web.address=:80"
      - "--entryPoints.web.forwardedHeaders.insecure"
      - "--entryPoints.websecure.address=:443"
      - "--entryPoints.websecure.forwardedHeaders.insecure"
      - "--entryPoints.ping.address=:8082"
      - "--api.dashboard=true"
      - "--api.insecure=true"
      - "--metrics=true"
      - "--metrics.prometheus=true"
      - "--accesslog=true"
    networks:
      - webgateway
      - webgateway_2
      - webgateway_3
      - webgateway_4
    ports:
      - "443:433"
      - "80:80"
      - "8080:8080"
      - "8082:8082"
    volumes:
      - /var/run/docker.sock:/var/run/docker.sock:ro
      - /root/traefik/configuration/:/configuration/
    deploy:
      restart_policy:
        condition: any
        delay: 5s
      mode: global
      placement:
       constraints:
         - node.role == manager
      labels:
        - traefik.enable=false

my api.yml 1:

version: "3.8"

networks:
  traefik_webgateway:
    external: true

services:
  web:
    image: myservicename
    command: ["node", "server"]
    environment:
      - TZ=America/Sao_Paulo
    networks:
      - traefik_webgateway
    deploy:
      replicas: 1
      restart_policy:
        condition: on-failure
        delay: 5s
      labels:
        - "traefik.docker.network=traefik_webgateway"
        - "traefik.http.routers.myservicename.rule=Host(`service.example.com`)"
        - "traefik.http.routers.myservicename.entrypoints=web"
        - "traefik.http.routers.myservicename.service=myservicename"
        - "traefik.http.services.myservicename.loadbalancer.server.port=3335"

my front.yml after 254 services:

version: "3.8"

networks:
  traefik_webgateway_2:
    external: true

services:
  web:
    image: myfront
    command: ["node", "server"]
    environment:
      - TZ=America/Sao_Paulo
    networks:
      - traefik_webgateway_2
    deploy:
      replicas: 1
      restart_policy:
        condition: on-failure
        delay: 5s
      labels:
        - "traefik.docker.network=traefik_webgateway_2"
        - "traefik.http.routers.myfront.rule=Host(`myfront.example.com`)"
        - "traefik.http.routers.myfront.entrypoints=web"
        - "traefik.http.routers.myfront.service=myfront"
        - "traefik.http.services.myfront.loadbalancer.server.port=8080"

cadmax on Aug 5, 2021

This issue occured to me now after running swarm cluster for couple of months. We are redeploying services frequently. I am not sure how such pool of IPs rotate in the swarm.

lgazo on Aug 31, 2018

@kirk-wgt I am baffled that you and some others here still haven’t given up hope. We, too, started out with Docker Swarm and quickly saw all our production clusters crashing every few days because of this bug.

This was well over two years (!) ago. This was when I made the hard decision to migrate to K3s. We never looked back. K3s is the perfect replacement for Docker Swarm due to a similar deployment model and the integrated Klipper-LB that behaves very similarly to the routing mesh of Docker Swarm.

Just accept it: Docker Swarm is dead. Do not use it for anything else than simple throwaway clusters where you can count the number of containers on one hand.

ChristianCiach on Jul 19, 2023

Agreed, the issue itself is the ips not being released after a service is no longer running, but the ingress network supports as many containers as you design to support. You can easily remove the ingress network and change the default subnet mask from /24 to /16, increasing the number of containers easily to 66534.

Perhaps the reason why this issue hasn’t been solved yet is due to the burden of trying to replicate it quickly. Currently, with the default ingress network, you would need to up/down a container 256 times (or create a service with 256x replicas). Or, to encounter it even faster, you could modify the docker swarm to have a lower number of containers, like mentioned here.

bsnuggs1 on Jun 3, 2021

I am running proxied services in dnsrr mode without a problem for a longer time now. It does not fix the state new problem it just reduces the occurrence of state new, due to the fact that not that much IP addresses are consumed in each deployment. You safe on VIP IP address for each service deployed in traefik. But there are also reason to run stack behind traefik with a VIP, for example if you need http(s) and some node port mappings.

klemsjo on Apr 9, 2020

Same Issue here. We have this problem now ones a week. We deploy about 5 -7 new container every hour and stop/remove the old one. Only a restart of the docker daemon helps.

Docker 19.03.6

Ruppsn on Mar 18, 2020

When investigating the problem, we came with this python script (note, it’s using TLS auth):

#!/usr/bin/python
# -*- coding: utf-8 -*-

import json
import logging
import os
import ssl
import sys
import argparse

import docker

USER_AGENT = 'linuxkit-list-orphan-tasks/1.0'


class Stats:
    def __init__(self, logger):
        self.logger = logger
        self.by_network = dict()
        self.by_status = dict()
        self.by_node = dict()
        self.by_service = dict()

    def add_task(self, status, node_id, service_name, networks, task):
        if status not in self.by_status:
            self.by_status[status] = list()
        self.by_status[status].append(task)

        if node_id not in self.by_node:
            self.by_node[node_id] = list()
        self.by_node[node_id].append(task)

        if service_name not in self.by_service:
            self.by_service[service_name] = list()
        self.by_service[service_name].append(task)

        for network in networks:
            if network not in self.by_network:
                self.by_network[network] = list()
            self.by_network[network].append(task)

    def report_dict(self, title, d):
        self.logger.info("================ %s =============", title)
        for key, task_list in sorted(d.items()):
            self.logger.info("%s: %d", key, len(task_list))

    def report(self):
        self.report_dict('networks', self.by_network)
        self.report_dict('status', self.by_status)
        self.report_dict('nodes', self.by_node)
        self.report_dict('services', self.by_service)


class ListOrphanTasks():
    def __init__(self, manager_dns):
        self.logger = None
        self.client = None
        self.api_client = None
        self.tls_opts = dict()
        self.tls_config = None
        self.manager_dns = manager_dns
        self.nodes = dict()
        self.services = dict()

        self.init_logger()
        self.load_tls_files()
        self.load_docker_client()
        self.load_nodes()
        self.load_services()

        self.stats = Stats(self.logger)

    def init_logger(self):
        """
        Initialize the Logger
        """
        self.logger = logging.getLogger()
        self.logger.setLevel(logging.INFO)
        formatter = logging.Formatter(fmt='[%(levelname)s] %(asctime)s.%(msecs)dZ %(module)s.%(funcName)s %(message)s',
                                      datefmt='%Y-%m-%dT%H:%M:%S')
        console_handler = logging.StreamHandler(sys.stderr)
        console_handler.setLevel(logging.INFO)
        console_handler.setFormatter(formatter)
        self.logger.addHandler(console_handler)

    def load_tls_files(self):
        home = os.path.expanduser("~")
        files = [
            '/etc/ssl/certs/ca-certificates.crt',
            '/etc/ssl/cert.pem',
            '/etc/pki/tls/cert.pem',
        ]
        for cacert in files:
            if os.path.exists(cacert):
                self.logger.info("Using CA Cert %s", cacert)
                self.tls_opts['ca_cert'] = cacert
                self.tls_opts['verify'] = True
                break
        else:
            self.logger.warning("No cacert file found")

        client_cert_file = os.path.join(home, '.docker', 'cert.pem')
        client_key_file = os.path.join(home, '.docker', 'key.pem')
        if os.path.exists(client_cert_file) and os.path.exists(client_key_file):
            self.logger.info("Loading TLS Client Certificate %s", client_cert_file)
            self.tls_opts['client_cert'] = (client_cert_file, client_key_file)

        self.tls_opts['ssl_version'] = ssl.PROTOCOL_TLSv1_2
        self.tls_config = docker.tls.TLSConfig(**self.tls_opts)

    def load_docker_client(self):
        proto = 'https'
        base_url = '%s://%s:2375' % (proto, self.manager_dns)
        tls_arg = self.tls_config

        self.client = docker.DockerClient(base_url=base_url, tls=tls_arg, user_agent=USER_AGENT)
        self.client.ping()
        self.logger.info("Successfully connected to %s", base_url)

        self.api_client = docker.APIClient(base_url=base_url, version='auto', tls=tls_arg, user_agent=USER_AGENT)
        self.api_client.ping()

    def load_nodes(self):
        for node in self.client.nodes.list():
            self.nodes[node.id] = node
        self.logger.info("Loaded %d nodes in cache", len(self.nodes))

    def load_services(self):
        for service in self.client.services.list():
            self.services[service.id] = service
        self.logger.info("Loaded %d services in cache", len(self.services))

    def get_service_name(self, serviceID):
        svc = self.services.get(serviceID)
        if svc:
            return svc.name
        else:
            return "NO_SERVICE_NAME_FOUND"

    def check_tasks(self, task):
        status = task['Status']['State']
        node_id = task.get('NodeID')
        service = task.get('ServiceID')
        service_name = self.get_service_name(service) if service else "NO_SERVICE_ID_FOUND"

        if status in ['shutdown', 'rejected', 'failed']:
            # taks is completed. all good.
            return True

        if not node_id and status == 'pending':
            # This task is not yet attached to a node. Clear
            return True

        if node_id and node_id in self.nodes:
            # This is not an orphan task. This is really running on a node.
            return True

        # TEST - Only look at running....
        if status != 'running':
            return True

        networks = list()
        for na in task['NetworksAttachments']:
            if na['Network']['DriverState']['Name'] != 'overlay':
                continue
            network_name = na['Network']['Spec']['Name']
            networks.append(network_name)

        self.logger.warning("Orphan status=%s - NodeID=%s - Service=%s - Networks=%s",
                            status, node_id, service_name, networks)

        if service_name is None:
            # print(json.dumps(task, indent=2))
            pass
        else:
            self.stats.add_task(status, node_id, service_name, networks, task)
        return False

    def run(self):
        tasks = self.api_client.tasks()
        count = 0
        for t in tasks:
            if not self.check_tasks(t):
                count += 1
                # print(json.dumps(t, indent=2))
                # break
        self.logger.warning("Found %d bad tasks on swarm", count)
        self.stats.report()

if __name__ == '__main__':
    parser = argparse.ArgumentParser(description='list-orphan-tasks')
    parser.add_argument('--manager-dns',
                        help='DNS for docker manager',
                        default='manager.yourcompany.com')

    args = parser.parse_args()
    ListOrphanTasks(args.manager_dns).run()

It will show all tasks which are supposedly still running, but assigned to nodes which do not exist. Unfortunately there’s no fix. We tend to undeploy those stacks, and usually the orphan tasks went away.

cintiadr on Feb 27, 2019

As you are surely finding, Celtech, and will continue to discover, there is no magic bullet.

Even at this late date, the Docker team can’t seem to track down and kill this core bug (or cascade of bugs) that make Swarm unreliable. Sometimes demotion/promotion of the leader will recover the cluster, sometimes a “kill -9” of the docker daemon will work, sometimes killing every container that can’t start (for whatever ridiculous reason) will work, sometimes “reinitializing” the cluster in-place will work, and many/most times nothing will work. Once you’re bitten by the “Docker Bug,” as we’ve come to call it, all bets are off. You may get your cluster back, and you likely will not. That is completely unacceptable for a production cluster.

We recently entirely gave up on Docker Swarm. Our new cluster runs on Kubernetes, and we’ve written scripts and templates for ourselves to reduce the network-stack management complexities to a manageable level for us.

In our opinion, Docker Swarm is not a production-ready containerization environment and never will be. You are on the right track, in our opinion, to cite “zombie tasks holding these IP’s hostage,” although no such tasks show up using PS. Our belief is that Docker doesn’t engage in robust and rapid garbage collection, and it doesn’t correctly honor the specified subnet value at initialization. But years of waiting and hoping have proved fruitless, and we finally had to go to something reliable (albeit harder to deal with).

I sincerely wish you all the best and good luck in your efforts with Docker Swarm! We were forced to abandon it.

madbolter on Dec 22, 2023

@kirk-wgt I am not a k3s salesperson, so I won’t try to persuade to use one product over the other 😄 But I feel there is a misconception that we had at first, too.

K3s is a full fledged kubernetes distribution with many bells and whistles attached (like a Traefik ingress controller integrated). If anything, Docker Swarm is the lightweight solution when comparing the two. K3s is just as “beefy” as any other kubernetes distribution (and many would argue that all kubernetes distributions are too beefy anyway).

Yes, K3s markets itself for “the edge”, but only because of its single-binary, zero-dependencies deployment model. There is nothing lightweight about it when compared to other k8s distributions (except maybe for the removed (non CSI) storage drivers that all have long been deprecated anyway).

I am sorry for being off-topic here and for triggering the notifications for well too many people. I just cannot help but to continue following this thread and to grab a bag of popcorn whenever someone falls into the same pitfalls that we encountered two years ago.

Edit: Before someone accuses me of being partial to K3s (which I am, but not for commercial reasons): Of course there are other k8s distributions that could be used in favor of Docker Swarm. Since we are talking Docker Swarm here, I should expecially mention k0s, which is a kubernetes distribution by Mirantis. But we made the conscious decision against using k0s after seeing how Mirantis handled (aka letting it die) Docker Swarm. But I’ve heard many good things about k0s.

ChristianCiach on Jul 19, 2023

What will it take for this glaring defect to get any attention from the Docker developers? Is there a docker PM who I can nudge?

I am now in the unenviable position of trying to deploy our production-ready system atop unreliable for production Docker Swarm. Just burned a couple weeks trying various hacks that didn’t work. Too let to switch to k8s right now.

kirk-wgt on Jul 19, 2023

Ran into this issue as well and was able to fix it temporarily with a restart. At the end of the day, ~the problem lies~ one of the other problems lies with the ingress network. Since the ingress network has a default subnet mask of 10.0.0.0/24, which according to this subnet mask table will result in a maximum of 256 services that can be connected to the ingress network.

After your ingress is created, try removing it, and creating your own ingress network with subnet mask that provides more services. For example:

docker network create --driver overlay --ingress --subnet=10.0.0.0/16 --gateway=10.0.0.1 ingress

EDIT: Yes admittedly, the real issue is the garbage collection, but if at least people know how to extend the number of services, they can reduce the occurrence.

bsnuggs1 on Jun 3, 2021

Didn’t help or did not help for long:

Re-election of the leader.
Removing stopped containers.
Re-creating swarm without parameters.
docker swarm update --task-history-limit=0

It helped to recreate the swarm with the “–default-addr-pool-mask-length 16” parameter.

Most likely, it will help to recreate the ingress network according to this instruction: https://docs.docker.com/network/overlay/#customize-the-default-ingress-network

To control the situation, I made a trigger: docker service ls -q | xargs -L1 docker service ps --filter desired-state=running --format '{{if .Node}}true{{else}}false{{end}}' | grep false -c

IgorOhrimenko on Feb 16, 2021