moby: Cannot remove network due to active endpoint, but cannot stop/remove containers
Output of docker version
:
Client:
Version: 1.11.1
API version: 1.23
Go version: go1.5.4
Git commit: 5604cbe
Built: Tue Apr 26 23:30:23 2016
OS/Arch: linux/amd64
Server:
Version: 1.11.1
API version: 1.23
Go version: go1.5.4
Git commit: 5604cbe
Built: Tue Apr 26 23:30:23 2016
OS/Arch: linux/amd64
Output of docker info
:
Containers: 15
Running: 13
Paused: 0
Stopped: 2
Images: 215
Server Version: 1.11.1
Storage Driver: aufs
Root Dir: /var/lib/docker/aufs
Backing Filesystem: extfs
Dirs: 248
Dirperm1 Supported: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
Volume: local
Network: bridge null host overlay
Kernel Version: 4.4.0-22-generic
Operating System: Ubuntu 16.04 LTS
OSType: linux
Architecture: x86_64
CPUs: 4
Total Memory: 7.686 GiB
Docker Root Dir: /var/lib/docker
Debug mode (client): false
Debug mode (server): false
Registry: https://index.docker.io/v1/
WARNING: No swap limit support
Cluster store: consul://xxx
Cluster advertise: yyy
I am trying to delete a network with docker network rm <network>
, but it complains with
Error response from daemon: network xxx_default has active endpoints
Indeed, when I run docker inspect xxx_default
I got:
"Containers": {
"ep-3dd9d8a572c1bfa877da875f3f0640dba9fe0bdf7ff6090a2171dcbebc926b55": {
"Name": "release_diyaserver_1",
"EndpointID": "3dd9d8a572c1bfa877da875f3f0640dba9fe0bdf7ff6090a2171dcbebc926b55",
"MacAddress": "02:42:0a:00:03:04",
"IPv4Address": "10.0.3.4/24",
"IPv6Address": ""
},
"ep-da1587e9a9fed7d767d79e1ff724a6f6afe56126dae097d9967a9196022ad103": {
"Name": "release_server-postgresql_1",
"EndpointID": "da1587e9a9fed7d767d79e1ff724a6f6afe56126dae097d9967a9196022ad103",
"MacAddress": "02:42:0a:00:03:03",
"IPv4Address": "10.0.3.3/24",
"IPv6Address": ""
}
}
But when I try to docker stop/rm
any of these two containers (either by name or ID) I got:
Error response from daemon: No such container: release_diyaserver_1
So basically I’m stuck with a useless network, which I can’t rm, and this is a real problem because I need to recreate the containers having that same name, but it complains with I try to recreate them.
Is there a way I can get out of this?
It’s overlay networks, and I run consul as at the KV store. There is only one consul node, on the same host (because I don’t need the multi host network know)
Thx in advance.
About this issue
- Original URL
- State: closed
- Created 8 years ago
- Reactions: 33
- Comments: 66 (15 by maintainers)
Can you try using
--force
to disconnect the container?If I wanted to fix issues by restarting things, I’d have stayed on Microsoft Windows.
When no command works then do
sudo service docker restart
your problem will be solved@michaellenhart use
docker network disconnect -f [network] [container name|id]
to disconnect a not exist container from network.Restart docker daemon seemed to work for me as well and Im on 17.04.0-ce-rc1
Here,
docker network prune
did not remove the network because it still thought there were containers, due to the remaining endpoints, however there weren’t, and these endpioints are completely invalid stale hanging markers.Like I said here -> https://github.com/moby/moby/issues/17217 it still happens on
Docker version 18.06.1-ce, build e68fc7a
, I think one of these tickets might be a duplicate of the other (or was that already decided?)It’s as above - needs a network disconnect with force, with the
name
from thedocker inspect
call of the network - IDs won’t be found.I had the same issue with the swarm running on top of
etcd
, but was able to recover from it. Issue might have been caused by reboot of the CoreOS box.Recovery procedure was: find trouble network endpoint in
etcd
and delete it. Here is what I did:docker network inspect <network>
find trouble endpoint and it’s ID:etcdctl ls --recursive /docker/network
to find endpoint inetcd
and delete it:Hi again, I have some “news”. I keep getting this error and this has become a real problem since once of our production servers was shutdown, and restarted, all (docker) systems were unable to restart due to this error.
I have dug deeper into the problem and I think I have found a way to reproduce the error (at least I have been doing this three or four times and it failed every time, so I suppose this counts as reproducible!).
NOTE: I previously thought this was a problem due to the fact that I was running a single-node consul cluster and this node was itself a docker on the host I was making crash. But I have successfully ruled that out: I have created the consul cluster on a remote server, and the consul node that I run on my docker host is only a consul client that connects to the remote consul server.
So I’m going to show the steps to reproduce the error with a remote server, but I’m confident it will be the same if you don’t have a remote server: you just have to create the consul server on the same host—this was the way I was before.
So, first versions and stuff:
I’m running on Ubuntu 16.04.
Steps to reproduce the error
# On my remote server (same docker version), I create a consul server inside a docker container. Here is the docker-compose.yml file:
So nothing fancy: I simply create a consul node server, which bootstraps itself (because only 1 server is needed). Just in case you’re wondering about
192.168.0.1
, it’s because I’ve set up a OpenVPN tunnel between the remote and my computer, and it’s its interface.# On my local computer, I start a consul client that connects to this remote server. The compose file is:
Very similar: simply a consul client.
# My docker host daemon is configured with such (systemd service file:)
Nothing fancy: I set the
TaskMax
toinfinity
because usually I need to create a big number of stacks and it quickly reaches the max default number.The interesting line is the
--cluster-store=consul://127.0.0.1:8500
which instructs the docker daemon to contact the consul cluster. This is the address of our dockerized consul client.Note that we have
After=openvpn.service
to make sure docker waits for the VPN tunnel to be effective before trying to start and reach the consul server.# Now the containers stuff
First create some overlay networks:
Check that there have been created (but we did not have any error so…)
Second, create containers using the overlay network:
Just two containers, each one using an overlay network. The
-t
option and thebash
command keeps them running. Note the use of--restart=always
Last, issue a machine reboot:
reboot
.Upon machine reboot,
So my consul container restarted correctly: but now my two containers
cont1
andcont2
. When I query with the-a
option:So both of my containers have exited with return code
128
(I’m not sure what this means, btw).When trying to see the logs, nothing seems anormal:
It seems to have exited gracefully on host restart. But I have the errors that I mention in my first post: I cannot start the container because I’ve got the “network already has endpoint with name cont1”, then I try deleting the container, and disconnecting from the network. but then I’ve got the error saying there is not such container.
The only solution here is to log on the remote server running the consul server (which has not crashed of course), run
docker-compose down -v
to delete the persistent volume, and restart the consul.But there are many things that I don’t understand: why won’t the containers automatically restart upon reboot? They have been created with
--restart-always
, and besides, the container running the consul client does reboot! Why notecont1
andcont2
?And then, why is there this broken state? I previously thought it was due to the fact that the docker daemon was shutdown before it could commit the changes to it’s consul server—since it was itself a docker. But now the server is outside, so it seems it still doesn’t have a chance to commit the changes?
Does this mean that running overlay network makes docker installations not reliable and especially not reliable to host crash?
What worries me too, is that it was a graceful shutdown, run with
reboot
, what will it be when the kernel panics, or the machine crashes so hards that it has to restart?Reading the messages again I see @aboch suggested that it might be the
--cluster-advertise
parameter. I am re-running my tests right now and will keep you up to date, but with this new intel, do you guys see anything that might be the problem?Result of
docker version
:We also experienced this problem. When we tried to remove our network with
docker network rm <network_id>
we got this:When we inspect with
docker network inspect <network_id>
we get:So, according to the network there is still a container out there. First, we look for the container with
docker ps -a | grep myapp
and it does not exist:So we attempt to stop/remove this container we get:
We were stuck with this network we couldn’t remove. This broke our CI/CD process. Deploying the stack failed because it could not create the
myapp-prod_myapp-prod
as defined in the stack file since one already existed.We were able to eventually remove the network after disconnecting the zombie container with
docker network disconnect --force myapp-prod_myapp-prod myapp-prod_app-extranet.1.r4elcwsxnlpu4kscrbn0h3zxw
. Thanks @thaJeztah for that tip! After that the network easily removed withdocker network rm <id>
.While we do have a manual workaround, this is still an irritating issue for us.
I see the same issue on Docker 1.13.1 using a 2-host overlay network with a 3-node Consul 0.7.4 cluster. I can reproduce the issue by forcibly shutting down one of the Docker hosts.
The result is that after I start the host (and the container) I can see the endpoint with docker network inspect however it shows the old container’s ID. docker network disconnect -f doesn’t remove the container, it gives an error message that the endpoint doesn’t exist (it is using the new container ID I assume).
It would be great if container ID could be used in network disconnect and that would not be validated against the containers on the host.
You need to run
docker system prune -a -f
Hi, I have the same issue with these versions:
and, in a second node: calico/node:v1.1.3 etcdctl version: 3.1.7
Steps to reproduce:
The main problem, as stated in a previous comment, is that docker doesn’t remove the etcd entries, so my workaround is to delete them by hand, removing the stopped containers and finally, the created network:
I hope it helps someone… 😃
Hello there, I’m srry for responding so late. Well, I have changed the docker Version to
Client: Version: 17.03.1-ce API version: 1.27 Go version: go1.7.5 Git commit: c6d412e Built: Mon Mar 27 17:14:09 2017 OS/Arch: linux/amd64
Server: Version: 17.03.1-ce API version: 1.27 (minimum version 1.12) Go version: go1.7.5 Git commit: c6d412e Built: Mon Mar 27 17:14:09 2017 OS/Arch: linux/amd64 Experimental: false
Sometimes I get the same error message “Error response from daemon: network lager has active endpoints” by removing an network using command
docker network rm 8bde
, but I can disconnect active endpoints even if the containers are not “existent” anymore. I use the commanddocker network disconnect -f <network> <container name>
Example:
docker network inspect lager
Shows all active endpoints even if the containers are not available anymore, I think it’s called zombie containers 😉
If I want to disconnect the Service deploy_lager-client_1 I use
docker network disconnect -f 8bde deploy_lager-client_1
Or you can use
docker network disconnect -f lager deploy_lager-client_1
You can not use the container id, you must instead use the Container Name. This
docker network disconnect -f lager 252cd
is NOT working.If you have removed all active endpoints you can delete the network.
@thaJeztah Hi again, it was not long ^^
The issue happened again:
And neither containers exist when I do
docker ps -a
.Then I tried this:
Then
network disconnect
:and with the
-f
:Note: I have tried every above command by replacing the container name with the container ID, I got the same result.
Weird thing: the container ID doesn’t start with
ep-
this time.Note that this is an overlay network, and I have only had this problem with overlay networks.
Any idea? Solution?
Thanks @mavenugo for coming here 😃
I’ve noticed the
ep-
prefix of the container and suspected that indeed, it was something like that.However, I can confirm that
docker network disconnect <network> ep-xxx
did not solve the problem, because the daemon responded withno such container: ep-xxx
.And there were, in fact, no other nodes that are part of the overlay: this is a single-host overlay network (I don’t need multi-host yet, but I do need a high number of subnets, which
bridge
cannot give me yet (see #21776).For sanity let’s time it happens, I will re-check with the
--force
option todocker networks disconnect
, but I’m 90% sure I tried it and it failed 😕Thanks for your support!
Typically when you see containers in
docker network inspect
output with aep-
prefix, that means it can be either of 2 cases -docker network disconnect
should help.We faced a similar issue but with a bridge network that we use for as part of our docker compose. We tried
docker-compose down
and thendocker-compose up
which failed and gave this error:This is what ironman-composenet looked like:
I ran the following:
docker container rename backend backend2
docker network disconnect -f ironman-composenet backend
docker container rename backend2 backend
Which all ran fine. We then ran
docker-compose down
anddocker-compose up
and it worked as normal.Be aware there is a different flavour to this issue, where no container actually is attached to the network at all, still the network cannot be deleted #42119 - so people in here, if not watching closely, could run in the other case too.
So I was not seeing this issue until I recently updated to 17.12.0-ce.
We are deploying stacks onto docker swarms, running series of tests against containers in the stack, removing the stack and then immediately deploying a 2nd stack and running set of tests. Then removing and then deploying a 3rd stack. Somewhere (and it is random) between the remove and deploy of stacks we see this issue.
The process is randomly (but frequently) failing to remove the stack (5 out of 6 containers are removed from the stack). Which container fails to be removed is random.
When I try to manually remove the last container, I get the:
Failed to remove network vjbo7hqulyrf1p0uk0ka2nstk: Error response from daemon: network test_default id vjbo7hqulyrf1p0uk0ka2nstk has active endpoints Failed to remove some resources from stack: test
I then try to remove the test_default network manually - with same message regarding “has active endpoints”.
I tried restarting the docker daemon - which hung. I was forced to reboot the system.
This is definitely an issue for us - it is regularly breaking our CI/CD process.
@BSWANG As already stated the
--force
flag doesn’t work anymore. In my case it is version 1.12, but has worked with 1.10 and (but I’m not sure) 1.11.Update:
To remove the network in the case that a
docker disconnect rm -f <network> <container>
results in an error you have to pick the network ID (e.g.zp1hd8cmb9h5i1fkiylvsifag
for the mentioned gefahr network. Then go into consul K/V_Store (assuming you use consul and not etcd), navigate tokv/docker/network/v1.0/network/
andkv/docker/network/v1.0/overlay/
and remove the entry with the found ID from both directories. After this, the network should not be listed anymore. I’ve not observed any side effects, but can’t ensure this.