libnetwork: Error creating vxlan interface: file exists
Previous related threads:
- https://github.com/docker/libnetwork/issues/562
- https://github.com/docker/libnetwork/issues/751
- https://github.com/docker/libnetwork/issues/945
- https://github.com/moby/moby/issues/21482
- https://github.com/moby/moby/issues/28559
Comment at the current tail-end of #945 recommends opening a new ticket. I couldn’t find one opened by the original poster, so here we go.
I’ve been using swarm for the past couple of months, and frequently hit upon this problem. I have a modest swarm (~8-9 nodes) all running Ubuntu 16.04, now with Docker 17.05-ce on. There is not a great amount of container churn, but I do use a stack yaml file to deploy ~20 services across ~20 encrypted overlay networks.
I tend to find that after a couple of stack deploy / stack rm cycles, my containers get killed at startup with the “Error creating vxlan: file exists” error. This prevents the containers coming up on a host and forces them to attempt to relocate, which may / may not work.
I have noted in the above issues that the problems are, several times over, thought to have been rectified, but yet always creep back in for various users.
To rectify the issue, I have tried rebooting the node, restarting iptables, removing the stack and re-creating, all of which work to varying degrees but are most definitely workarounds and not solutions.
I cannot think how I can attempt to reproduce this error, but if anyone wants to suggest ways to debug, I am at your service.
About this issue
- Original URL
- State: open
- Created 7 years ago
- Reactions: 13
- Comments: 57 (3 by maintainers)
Next time, can you check if you have “vx-” interface on host: ip link show | grep vx
If so, delete them, it worked for me: ip link delete vx-xxxx
The correction that I propose is after reading the code, I do not have the environment to test. If a good soul, has a test environment, could he test my correction proposal.
You can find full information and “easy” resolution on docker.
In brief:
Check each node for any vx-* interfaces in /sys/class/net: $ ls -l /sys/class/net/ | grep vx
Once we have interface id’s pull more details: $ udevadm info /sys/class/net/<vxlanid>
If these interfaces exist we should be able to safely remove them. Replace vx-000000-xxxxx with the interface id from Step 2: $ sudo ip -d link show vx-000000-xxxxx $ sudo ip link delete vx-000000-xxxxx etc.
Redeploy the service.
Found a workaround for this issue, without the need of rebooting or restarting docker daemon. As @sanimej mentioned
So once you know which vxlan id fails to be created (did a strace of the docker daemon process, which is overkill for sure, but I was in a hurry)
4993 15:01:04.640588 recvfrom(30, "\254\0\0\0\2\0\0\0\267\273\0\0\212\265\372\377\357\377\377\377\230\0\0\0\20\0\5\6\267\273\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\24\0\3\0vx-000105-1158f\0\10\0\r\0\0\0\0\0\\\0\22\0\t\0\1\0vxlan\0\0\0L\0\2\0\10\0\1\0\5\1\0\0\5\0\5\0\0\0\0\0\5\0\6\0\0\0\0\0\5\0\7\0\1\0\0\0\5\0\v\0\1\0\0\0\5\0\f\0\0\0\0\0\5\0\r\0\1\0\0\0\5\0\16\0\1\0\0\0\6\0\17\0\22\265\0\0", 4096, 0, {sa_family=AF_NETLINK, pid=0, groups=00000000}, [12]) = 172So 000105-1158f aka 0x105 aka vxlan id 261 in my case.Build a list of active network namespaces and its vxlan’s on the failing host. For example:
# for i in $(ls /var/run/docker/netns/*); do echo ":::: $ns" >> ip.link.show; nsenter -m -t <PID of docker daemon> nsenter --net=$ns ip -d link show ; done >> ip.link.showNow that you know the affected network namespace, double nsenter into it
# nsenter -m -t <PID of docker daemon> bash# nsenter --net=/var/run/docker/netns/<affected namespace> bash# ip link delete vxlan1After that, the error is gone. Pretty sure Docker Inc. knows about that workaround, why they don’t share it is up to the imagination of the reader. Hope this helps.
@dang3r @dcrystalj @discotroy If you are still having this issue can you check if your host has any udev rules that might rename interface names that start with
vx. ?For overlay networks, docker daemon creates a vxlan device with the name like
vx-001001-a12emewhere 001001 is the VNI id in hex, followed by shortened network id. This device then gets moved to a overlay network specific namespace. When the overlay network is deleted, the device is moved back to the host namespace before its deleted. If there is a udev rule that could rename these interfaces and if the rename happens before docker daemon can delete it, the host will end up with an orphaned interface with that vni id. So subsequent attempts to create that interface will fail.This overall fixed problem, but it may be dangerous if the removed network is shared, ie. servers as a traefik proxy… How can I check what service use which interface?
This worked for me. Thanks, @fendo64 !
That resolved it for me on
docker stack deployonDocker 18.06.1-ceSwarmRemoving IP links does fix the problem however, please fix this permanently please.
If it’s helpful to anybody else I can confirm that this solution also worked for me - I iterated through the list of devices and did:
We were able to bring the cluster back to a happy state once this had been applied - thank you very much for sharing the solution, it solved a big headache at the end of a very stressful day.
Happened to me on a single node swarm on Ubuntu 16.04.6 LTS host / 4.4.0-169-generic, tried with Docker 18.09.1, 18.09.9 and 19.03.7.
@fendo64 trick worked for me (i.e.
ip link delete vx-xxx)@beckyjmcdabq essentially, if everything is correct,
ip link show | grep vxis empty.Only when I got the error this issue is all about, did I ever see a result on any of my machines (double digits) When deleting the network with
ip link deletethe problem was solved. other than doing this, a full restart of the node (not just docker, the machine) solved the problem as well, but of course takes longer and might have other side-effects.I assume that the deletion of those networks is side-effect free, as they do not exist if the problem is not there.
you could probably go all willy-nilly by running the command with xargs I guess, but do so at your own risk:
# use at your own risk: ip link show | grep vx | xargs -rn1 ip link deleteSame happens to our environment:
# docker -v# cat /proc/versionRan into the same issue. Docker version 18.03.0-ce, build 0520e24
ip link delete vx-xxxxresolved it.As per https://github.com/docker/libnetwork/issues/562
You can correct this by running:
sudo umount /var/run/docker/netns/* sudo rm /var/run/docker/netns/*
Not sure if this is a long term solution.