moby: Starting container failed: Address already in use when deploying service

After updating one of my stacks, I started to receive the following error for one of the services (1 container)

Mar  9 16:46:59 swarm-node-2 dockerd[19990]: time="2017-03-09T16:46:59.241934249Z" level=error msg="failed to deactivate service binding for container rita-l
atest_felix-on-swarm.1.6lekyq04rvlywlxkamgoam3hs" error="network sandbox does not exist for container rita-latest_felix-on-swarm.1.6lekyq04rvlywlxkamgoam3hs"
 module="node/agent"
Mar  9 16:47:03 swarm-node-2 dockerd[19990]: time="2017-03-09T16:47:03.735473042Z" level=error msg="Could not delete service state for endpoint rita-latest_f
elix-on-swarm.1.9efhqytkij33mishx0g26acwc from cluster: cannot delete entry as the entry in table endpoint_table with network id v5bbmsf7fj618ghako1jyuk1x an
d key 0f21c22d4bf497ba654763e85770458bd390897bfb54137e3d5c46e5bc7bd97d does not exist"
Mar  9 16:47:04 swarm-node-2 dockerd[19990]: time="2017-03-09T16:47:04.606619285Z" level=error msg="failed to deactivate service binding for container rita-l
atest_felix-on-swarm.1.v9pt846f1kj97u1j2rvdxkjtg" error="network sandbox does not exist for container rita-latest_felix-on-swarm.1.v9pt846f1kj97u1j2rvdxkjtg"
 module="node/agent"
Mar  9 16:47:09 swarm-node-2 dockerd[19990]: time="2017-03-09T16:47:09.865946709Z" level=error msg="failed to deactivate service binding for container rita-l
atest_felix-on-swarm.1.ypem47u5bu8wwn9j026qx6ppp" error="network sandbox does not exist for container rita-latest_felix-on-swarm.1.ypem47u5bu8wwn9j026qx6ppp"
 module="node/agent"

When inspecting the failed container I see error “Starting confainter failed: Address already in use”

I performed stack rm and redeploy. I deleted the volumes and networks manually. I still get the issue, for this particular stack. As if the name of it is cached somewhere.

What could be the issue?

About this issue

  • Original URL
  • State: open
  • Created 7 years ago
  • Reactions: 8
  • Comments: 103 (26 by maintainers)

Most upvoted comments

After a round of preliminary debugging, got it to work without restarting the daemon/node. I guess the problem is that Swarm fails to deregister/remove the network endpoints from the overlay and the ingress networks after the containers are reaped.

A docker stack rm <stack_name> followed by docker network inspect ingress reveals a bunch of stale network endpoints. So does the docker network inspect <overlay_network_name>.

A very hacky script which force removes these stale endpoints.

network_name="ingress"
for endpoint_map in $(docker network inspect -f '{{range $container_id, $container_def := .Containers}} {{$container_id}}^{{index $container_def "Name"}} {{end}}' $network_name)
do
  container_id=$(echo $endpoint_map | cut -d ^ -f1)
  container_name=$(echo $endpoint_map | cut -d ^ -f2)

  if [ $container_id != "ingress-sbox" ]
  then
    docker inspect --format "{{.State.Status}}" $container_id &>/dev/null
    if [ $? -ne 0 ]
    then
      echo "Removing $container_name"
      docker network disconnect -f $network_name $container_name
    else
      echo "Letting $container_name stay"
    fi
  fi
done

A subsequent docker stack deploy ... works without any hiccups. Maybe I’d spend the weekend digging deeper but this is what I could grok at the moment. YMMV.

I think I face the same issue, with Docker 17.06.0-ce on Ubuntu 16.04. Either with a single-node or multi-node swarm. Edit: I see this issue when I create a service right after having removed a previous service, and the previous service’s container is not killed yet. I guess the new service was assigned the same IP ?

We upgraded to 17.10.0 and still seeing a similar issue. This time it’s not address already in use, but with this endpoint_table and that it doesn’t find entry with ID and KEY.

deleteServiceInfoFromCluster NetworkDB DeleteEntry failed for 59943759fe38b7ff7fcddb722ea740089303f698555a7f48163f32bb70fcbb1f adwz1i3k94yb9d076a4qdss9e err:cannot delete entry as the entry in table endpoint_table with network id adwz1i3k94yb9d076a4qdss9e and key 59943759fe38b7ff7fcddb722ea740089303f698555a7f48163f32bb70fcbb1f does not exist
Dec 29 16:58:12 kafka0 dockerd[1097]: time="2017-12-29T16:58:12.489914938Z" level=error msg="fatal task error" error="starting container failed: No such network: kafka_default" module=node/agent/taskmanager node.id=ecnbx5evaqnipyklmee3ramc8 service.id=4b3est31dqbzza6jkw146orb4 task.id=ygfbav2fetlt8jru34cul0lji

Output from journalctl

Dec 29 16:58:09 kafka0 dockerd[1097]: time="2017-12-29T16:58:09.051766020Z" level=info msg="Node join event for e4ce31a6d326/172.31.100.143"
Dec 29 16:58:12 kafka0 dockerd[1097]: time="2017-12-29T16:58:12.403667722Z" level=warning msg="deleteServiceInfoFromCluster NetworkDB DeleteEntry failed for 59943759fe38b7ff7fcddb722ea740089303f698555a7f48163f32bb70fcbb1f adwz1i3k94yb9d076a4qdss9e err:cannot delete entry as the entry in table endpoint_table with network id adwz1i3k94yb9d076a4qdss9e and key 59943759fe38b7ff7fcddb722ea740089303f698555a7f48163f32bb70fcbb1f does not exist"
Dec 29 16:58:12 kafka0 dockerd[1097]: time="2017-12-29T16:58:12.403712495Z" level=warning msg="rmServiceBinding deleteServiceInfoFromCluster kafka_kafka-1 59943759fe38b7ff7fcddb722ea740089303f698555a7f48163f32bb70fcbb1f aborted c.serviceBindings[skey] !ok"
Dec 29 16:58:12 kafka0 dockerd[1097]: time="2017-12-29T16:58:12.489914938Z" level=error msg="fatal task error" error="starting container failed: No such network: kafka_default" module=node/agent/taskmanager node.id=ecnbx5evaqnipyklmee3ramc8 service.id=4b3est31dqbzza6jkw146orb4 task.id=ygfbav2fetlt8jru34cul0lji
Dec 29 16:58:12 kafka0 dockerd[1097]: time="2017-12-29T16:58:12.842323231Z" level=warning msg="rmServiceBinding handleEpTableEvent kafka_kafka-2 3c6136f784f4636e3a8e9e66a5c17a04999772883012aa7ccc7cb138924534ca aborted c.serviceBindings[skey] !ok"
Dec 29 16:58:12 kafka0 dockerd[1097]: time="2017-12-29T16:58:12.842405469Z" level=warning msg="rmServiceBinding handleEpTableEvent kafka_kafka-2 40a7c8d1fa96a493f07f09a21ece8b6d017478efc590d87bc18b4935089c550b aborted c.serviceBindings[skey] !ok"
Dec 29 16:58:12 kafka0 dockerd[1097]: time="2017-12-29T16:58:12.874109224Z" level=warning msg="rmServiceBinding handleEpTableEvent kafka_kafka-2 00523277e680880b8c37b809925d5f8c9b5b29d1f13c87760de867d29e8efcf5 aborted c.serviceBindings[skey] !ok"
Dec 29 16:58:12 kafka0 dockerd[1097]: time="2017-12-29T16:58:12.875150026Z" level=warning msg="rmServiceBinding handleEpTableEvent kafka_kafka-2 3c6136f784f4636e3a8e9e66a5c17a04999772883012aa7ccc7cb138924534ca aborted c.serviceBindings[skey] !ok"
Dec 29 16:58:12 kafka0 dockerd[1097]: time="2017-12-29T16:58:12.875386070Z" level=warning msg="rmServiceBinding handleEpTableEvent kafka_kafka-2 40a7c8d1fa96a493f07f09a21ece8b6d017478efc590d87bc18b4935089c550b aborted c.serviceBindings[skey] !ok"
Dec 29 16:58:12 kafka0 dockerd[1097]: time="2017-12-29T16:58:12.875979208Z" level=warning msg="rmServiceBinding handleEpTableEvent kafka_kafka-2 614765452a142cff786619fd8e5abeafe283703ebee01d6cf658cd5ebe536cc0 aborted c.serviceBindings[skey] !ok"
Dec 29 16:58:12 kafka0 dockerd[1097]: time="2017-12-29T16:58:12.877931833Z" level=warning msg="rmServiceBinding handleEpTableEvent kafka_kafka-2 cd76ecd09516b2b4e8981dee7cb123b1bc9ac1a9945a31c2e57aa26da79687f1 aborted c.serviceBindings[skey] !ok"

The strange thing is that this network ID adwz1i3k94yb9d076a4qdss9e is our ingress network and it does exist. So does the kafka_default network. Because network ls on that particular node, outputs…

ubuntu@kafka0:~$ sudo docker network ls
NETWORK ID          NAME                DRIVER              SCOPE
c4a8def3d7d8        bridge              bridge              local
e54b62f8cfd0        docker_gwbridge     bridge              local
1a577598008e        host                host                local
adwz1i3k94yb        ingress             overlay             swarm
38x8evaixyyp        kafka_default       overlay             swarm
a0555d01cda4        none                null                local

This is definitely somewhat related to https://github.com/docker/libnetwork/issues/2015

Thank you for submitting this one, because we have the same behavior, especially if the containers of a service are flapping. This produces so much noise on the raft I guess, that the network information is not “in sync” over the cluster. Therefore many thanks to @thaJeztah for pointing out --update-failure-action! And to @fcrisciani for giving the information about 17.10 because we will update soon!

@gaui to reduce this occurrence from 17.10 the IP allocation is sequential so there would not be immediate IP reuse if your subnet has enough space

For anyone googling this, in our specific case, this error was related to “–stop-grace-period=120s”

Lowering it to 3 seconds or less was enough to remove the error message.

We can only guess that between two docker service updates, one has to wait “120sec” (or whatever the stop grace period is) to let the swarm update network settings, etc.

Hope this helps !

ok. I found the reason for my issue at least. Quite a stupid mistake, namely that I removed the CMD command in the Dockerfile. Swarm services requires such a CMD to exist, so of course the service did just start and die immediately. As stupid as this error was on my part, it still took me some days of hard work trying a lot of different approaches to find the problem and of course none was fruitful. So a note to docker maintainers, please add a proper error when a service is started and is missing the CMD ! 👍

I’ve fixed this by re-joining node to swarm.

just had this error when upgrading from 18.03 to 18.06, removing the stacks, networks, pruning and recreating everything “fixed” it

I met this problem yet . my docker version is 17.06

I found that

[root@docker40 ~]# docker network ls NETWORK ID NAME DRIVER SCOPE 724ef4769682 bridge bridge local a051b378203f docker_gwbridge bridge local lexdcg0de00j es19200_esnet4 overlay swarm 17b59d578279 host host local 1aiyumggom6k ingress overlay swarm c842691e98b3 none null local

this node docker40 had already left the swarm and rejoin. the network still exitst.

[root@docker40 ~]# docker network rm es19200_esnet4 Error response from daemon: network es19200_esnet4 id lexdcg0de00jwm54abwopqegk has active endpoints

the I disconnect [root@docker40 ~]# docker network disconnect es19200_esnet4 es19200_es_data.mllz63dw9q12jtkwp2o1w3azi.58cvya349yda6ewltt15y1479 Error response from daemon: No such container: es19200_es_data.mllz63dw9q12jtkwp2o1w3azi.58cvya349yda6ewltt15y1479

USE FORECE ,this work well :
docker network disconnect -f es19200_esnet4 es19200_es_data.mllz63dw9q12jtkwp2o1w3azi.58cvya349yda6ewltt15y1479

the rm the dirty network: [root@docker40 ~]# docker network rm es19200_esnet4 es19200_esnet4

then redeploy your service , all goes well .

@fcrisciani I haven’t experienced it yet, but I will report.

Found a really tricky workaround without restarting daemon,

Scaled the service to more replicas, such as 2 in my case. The swarm could start the myservice.2, but the myservice.1 always failed to be started due to same error. Then scale the service down to 1, the newly created myservice.2 was left.

Got a situation here as well, i’ll shoot up the infos so it might help you guys to find the issue !

We are running a swarm cluter with 7 machines (3 master, 4 slaves) Docker version 1.13.0, build 49bf474

Cluster is live since 5-6 months, overall it’s all running well and smoothly in production on good machines. On Preprod on virtual machines we have some kind of stability issues but that’s another story.

Issue we had: So in production… it occured 2 times in the last few months that one of the docker node is going suddently completely unstable. All containers becomes unhealthy.

I ran “docker inspect” on each containers and they all have the same error message in healthcheck logs :

rpc error: code = 13 desc = transport is closing followed by two other failures rpc error: code = 14 desc: grpc: the connection is unaivalable

Solution: We drain the node, reboot, put it back available and it’s all good and fixed…

The weird part during troubleshooting : During the time the containers were unhealthy, “docker service ls” on the master was showing all containers replicas to be up and running (it kinda freak me out here, cluster was considered healthy when you have a node going all crazy…)

Also once we drain the node, one of the service wasn’t able to replicate itself to another node (it was running with only 1 replica, and unfortunately it was on the problematic node). Since we drain the node, at least it was showing 0/1 in docker service ls…

I tried to scale it to 1 (to see if the swarm would move it to another server, but it didn’t worked), When I "docker service ps <the service> I had in the error column : “Network sandbox does not exist”. So the service was really stuck there.

I removed the stack and redeploy it so it could be release on another node… problem fixed !

So that’s our story lol, hope you can find some hints in it !

Edited : By the way the problematic node doesn’t show any hints in /var/log/messages or dmesg

Same issue here. Service is up for around 30 seconds and suddenly quits without any message.

/var/log/syslog says:

dockerd[625]: time="2017-03-30T16:35:15.798055005+02:00" level=warning msg="failed to deactivate service binding for container lamp_lamp.1.k11p9pykwyj4ztzckn46v46gf" error="network sandbox does not exist for container lamp_lamp.1.k11p9pykwyj4ztzckn46v46gf" module="node/agent"

Rebooting the node helps. After 10 to 20 service updates the node fails again.

docker info:

Containers: 16
 Running: 3
 Paused: 0
 Stopped: 13
Images: 25
Server Version: 17.03.1-ce
Storage Driver: overlay2
 Backing Filesystem: extfs
 Supports d_type: true
 Native Overlay Diff: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins: 
 Volume: local
 Network: bridge host macvlan null overlay
Swarm: active
 NodeID: nmbxvlyhdug84yyqephkezfg0
 Is Manager: true
 ClusterID: crvpu6ihmir5bh4ptyrmpoend
 Managers: 1
 Nodes: 1
 Orchestration:
  Task History Retention Limit: 5
 Raft:
  Snapshot Interval: 10000
  Number of Old Snapshots to Retain: 0
  Heartbeat Tick: 1
  Election Tick: 3
 Dispatcher:
  Heartbeat Period: 5 seconds
 CA Configuration:
  Expiry Duration: 3 months
 Node Address: xxxxxxxxx
 Manager Addresses:
  xxxxxxxxxx:2377
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 4ab9917febca54791c5f071a9d1f404867857fcc
runc version: 54296cf40ad8143b62dbcaa1d90e520a2136ddfe
init version: 949e6fa
Security Options:
 apparmor
 seccomp
  Profile: default
Kernel Version: 4.4.0-71-generic
Operating System: Ubuntu 16.04.2 LTS
OSType: linux
Architecture: x86_64
CPUs: 4
Total Memory: 3.788 GiB
Name: apple
ID: EBHH:A4TE:YSZ5:5LMB:UPQZ:OIU4:WSEF:WZSE:4QVS:OQMH:N6L4:TK4Y
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
WARNING: No swap limit support
Experimental: false
Insecure Registries:
 127.0.0.0/8
Live Restore Enabled: false

Eventually an entire node will become unusable and all services will fail, but will remain stuck on that node. Rebooting the node seems to help.