moby: Starting container failed: Address already in use when deploying service
After updating one of my stacks, I started to receive the following error for one of the services (1 container)
Mar 9 16:46:59 swarm-node-2 dockerd[19990]: time="2017-03-09T16:46:59.241934249Z" level=error msg="failed to deactivate service binding for container rita-l
atest_felix-on-swarm.1.6lekyq04rvlywlxkamgoam3hs" error="network sandbox does not exist for container rita-latest_felix-on-swarm.1.6lekyq04rvlywlxkamgoam3hs"
module="node/agent"
Mar 9 16:47:03 swarm-node-2 dockerd[19990]: time="2017-03-09T16:47:03.735473042Z" level=error msg="Could not delete service state for endpoint rita-latest_f
elix-on-swarm.1.9efhqytkij33mishx0g26acwc from cluster: cannot delete entry as the entry in table endpoint_table with network id v5bbmsf7fj618ghako1jyuk1x an
d key 0f21c22d4bf497ba654763e85770458bd390897bfb54137e3d5c46e5bc7bd97d does not exist"
Mar 9 16:47:04 swarm-node-2 dockerd[19990]: time="2017-03-09T16:47:04.606619285Z" level=error msg="failed to deactivate service binding for container rita-l
atest_felix-on-swarm.1.v9pt846f1kj97u1j2rvdxkjtg" error="network sandbox does not exist for container rita-latest_felix-on-swarm.1.v9pt846f1kj97u1j2rvdxkjtg"
module="node/agent"
Mar 9 16:47:09 swarm-node-2 dockerd[19990]: time="2017-03-09T16:47:09.865946709Z" level=error msg="failed to deactivate service binding for container rita-l
atest_felix-on-swarm.1.ypem47u5bu8wwn9j026qx6ppp" error="network sandbox does not exist for container rita-latest_felix-on-swarm.1.ypem47u5bu8wwn9j026qx6ppp"
module="node/agent"
When inspecting the failed container I see error “Starting confainter failed: Address already in use”
I performed stack rm and redeploy. I deleted the volumes and networks manually. I still get the issue, for this particular stack. As if the name of it is cached somewhere.
What could be the issue?
About this issue
- Original URL
- State: open
- Created 7 years ago
- Reactions: 8
- Comments: 103 (26 by maintainers)
After a round of preliminary debugging, got it to work without restarting the daemon/node. I guess the problem is that Swarm fails to deregister/remove the network endpoints from the overlay and the ingress networks after the containers are reaped.
A
docker stack rm <stack_name>
followed bydocker network inspect ingress
reveals a bunch of stale network endpoints. So does thedocker network inspect <overlay_network_name>
.A very hacky script which force removes these stale endpoints.
A subsequent
docker stack deploy ...
works without any hiccups. Maybe I’d spend the weekend digging deeper but this is what I could grok at the moment. YMMV.I think I face the same issue, with Docker 17.06.0-ce on Ubuntu 16.04. Either with a single-node or multi-node swarm. Edit: I see this issue when I create a service right after having removed a previous service, and the previous service’s container is not killed yet. I guess the new service was assigned the same IP ?
We upgraded to
17.10.0
and still seeing a similar issue. This time it’s not address already in use, but with thisendpoint_table
and that it doesn’t find entry with ID and KEY.Output from
journalctl
The strange thing is that this network ID
adwz1i3k94yb9d076a4qdss9e
is our ingress network and it does exist. So does thekafka_default
network. Becausenetwork ls
on that particular node, outputs…This is definitely somewhat related to https://github.com/docker/libnetwork/issues/2015
Thank you for submitting this one, because we have the same behavior, especially if the containers of a service are flapping. This produces so much noise on the raft I guess, that the network information is not “in sync” over the cluster. Therefore many thanks to @thaJeztah for pointing out
--update-failure-action
! And to @fcrisciani for giving the information about 17.10 because we will update soon!@gaui to reduce this occurrence from 17.10 the IP allocation is sequential so there would not be immediate IP reuse if your subnet has enough space
For anyone googling this, in our specific case, this error was related to “–stop-grace-period=120s”
Lowering it to 3 seconds or less was enough to remove the error message.
We can only guess that between two docker service updates, one has to wait “120sec” (or whatever the stop grace period is) to let the swarm update network settings, etc.
Hope this helps !
ok. I found the reason for my issue at least. Quite a stupid mistake, namely that I removed the CMD command in the Dockerfile. Swarm services requires such a CMD to exist, so of course the service did just start and die immediately. As stupid as this error was on my part, it still took me some days of hard work trying a lot of different approaches to find the problem and of course none was fruitful. So a note to docker maintainers, please add a proper error when a service is started and is missing the CMD ! 👍
I’ve fixed this by re-joining node to swarm.
just had this error when upgrading from 18.03 to 18.06, removing the stacks, networks, pruning and recreating everything “fixed” it
I met this problem yet . my docker version is 17.06
I found that
[root@docker40 ~]# docker network ls NETWORK ID NAME DRIVER SCOPE 724ef4769682 bridge bridge local a051b378203f docker_gwbridge bridge local lexdcg0de00j es19200_esnet4 overlay swarm 17b59d578279 host host local 1aiyumggom6k ingress overlay swarm c842691e98b3 none null local
this node docker40 had already left the swarm and rejoin. the network still exitst.
[root@docker40 ~]# docker network rm es19200_esnet4 Error response from daemon: network es19200_esnet4 id lexdcg0de00jwm54abwopqegk has active endpoints
the I disconnect [root@docker40 ~]# docker network disconnect es19200_esnet4 es19200_es_data.mllz63dw9q12jtkwp2o1w3azi.58cvya349yda6ewltt15y1479 Error response from daemon: No such container: es19200_es_data.mllz63dw9q12jtkwp2o1w3azi.58cvya349yda6ewltt15y1479
USE FORECE ,this work well :
docker network disconnect -f es19200_esnet4 es19200_es_data.mllz63dw9q12jtkwp2o1w3azi.58cvya349yda6ewltt15y1479
the rm the dirty network: [root@docker40 ~]# docker network rm es19200_esnet4 es19200_esnet4
then redeploy your service , all goes well .
@fcrisciani I haven’t experienced it yet, but I will report.
Found a really tricky workaround without restarting daemon,
Scaled the service to more replicas, such as 2 in my case. The swarm could start the myservice.2, but the myservice.1 always failed to be started due to same error. Then scale the service down to 1, the newly created myservice.2 was left.
Got a situation here as well, i’ll shoot up the infos so it might help you guys to find the issue !
We are running a swarm cluter with 7 machines (3 master, 4 slaves) Docker version 1.13.0, build 49bf474
Cluster is live since 5-6 months, overall it’s all running well and smoothly in production on good machines. On Preprod on virtual machines we have some kind of stability issues but that’s another story.
Issue we had: So in production… it occured 2 times in the last few months that one of the docker node is going suddently completely unstable. All containers becomes unhealthy.
I ran “docker inspect” on each containers and they all have the same error message in healthcheck logs :
rpc error: code = 13 desc = transport is closing followed by two other failures rpc error: code = 14 desc: grpc: the connection is unaivalable
Solution: We drain the node, reboot, put it back available and it’s all good and fixed…
The weird part during troubleshooting : During the time the containers were unhealthy, “docker service ls” on the master was showing all containers replicas to be up and running (it kinda freak me out here, cluster was considered healthy when you have a node going all crazy…)
Also once we drain the node, one of the service wasn’t able to replicate itself to another node (it was running with only 1 replica, and unfortunately it was on the problematic node). Since we drain the node, at least it was showing 0/1 in docker service ls…
I tried to scale it to 1 (to see if the swarm would move it to another server, but it didn’t worked), When I "docker service ps <the service> I had in the error column : “Network sandbox does not exist”. So the service was really stuck there.
I removed the stack and redeploy it so it could be release on another node… problem fixed !
So that’s our story lol, hope you can find some hints in it !
Edited : By the way the problematic node doesn’t show any hints in /var/log/messages or dmesg
Same issue here. Service is up for around 30 seconds and suddenly quits without any message.
/var/log/syslog says:
Rebooting the node helps. After 10 to 20 service updates the node fails again.
docker info:
Eventually an entire node will become unusable and all services will fail, but will remain stuck on that node. Rebooting the node seems to help.