libnetwork: Error creating vxlan interface: file exists

Previous related threads:

Comment at the current tail-end of #945 recommends opening a new ticket. I couldn’t find one opened by the original poster, so here we go.

I’ve been using swarm for the past couple of months, and frequently hit upon this problem. I have a modest swarm (~8-9 nodes) all running Ubuntu 16.04, now with Docker 17.05-ce on. There is not a great amount of container churn, but I do use a stack yaml file to deploy ~20 services across ~20 encrypted overlay networks.

I tend to find that after a couple of stack deploy / stack rm cycles, my containers get killed at startup with the “Error creating vxlan: file exists” error. This prevents the containers coming up on a host and forces them to attempt to relocate, which may / may not work.

I have noted in the above issues that the problems are, several times over, thought to have been rectified, but yet always creep back in for various users.

To rectify the issue, I have tried rebooting the node, restarting iptables, removing the stack and re-creating, all of which work to varying degrees but are most definitely workarounds and not solutions.

I cannot think how I can attempt to reproduce this error, but if anyone wants to suggest ways to debug, I am at your service.

About this issue

Original URL
State: open
Created 7 years ago
Reactions: 13
Comments: 57 (3 by maintainers)

Most upvoted comments

Same issue here. the sudo umount /var/run/docker/netns/* sudo rm /var/run/docker/netns/* fix did not work Removing the stack and readding seems to have worked (for one stack it worked directly, the other stack I had to redo the steps)

Next time, can you check if you have “vx-” interface on host: ip link show | grep vx

If so, delete them, it worked for me: ip link delete vx-xxxx

The correction that I propose is after reading the code, I do not have the environment to test. If a good soul, has a test environment, could he test my correction proposal.

+65

fendo64 on Feb 15, 2019

You can find full information and “easy” resolution on docker.

In brief:

Check each node for any vx-* interfaces in /sys/class/net: $ ls -l /sys/class/net/ | grep vx
Once we have interface id’s pull more details: $ udevadm info /sys/class/net/<vxlanid>
If these interfaces exist we should be able to safely remove them. Replace vx-000000-xxxxx with the interface id from Step 2: $ sudo ip -d link show vx-000000-xxxxx $ sudo ip link delete vx-000000-xxxxx etc.
Redeploy the service.

+16

reachworld on Jun 17, 2020

Found a workaround for this issue, without the need of rebooting or restarting docker daemon. As @sanimej mentioned

For overlay networks, docker daemon creates a vxlan device with the name like vx-001001-a12eme where 001001 is the VNI id in hex, followed by shortened network id. This device then gets moved to a overlay network specific namespace. When the overlay network is deleted, the device is moved back to the host namespace before its deleted

So once you know which vxlan id fails to be created (did a strace of the docker daemon process, which is overkill for sure, but I was in a hurry) 4993 15:01:04.640588 recvfrom(30, "\254\0\0\0\2\0\0\0\267\273\0\0\212\265\372\377\357\377\377\377\230\0\0\0\20\0\5\6\267\273\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\24\0\3\0vx-000105-1158f\0\10\0\r\0\0\0\0\0\\\0\22\0\t\0\1\0vxlan\0\0\0L\0\2\0\10\0\1\0\5\1\0\0\5\0\5\0\0\0\0\0\5\0\6\0\0\0\0\0\5\0\7\0\1\0\0\0\5\0\v\0\1\0\0\0\5\0\f\0\0\0\0\0\5\0\r\0\1\0\0\0\5\0\16\0\1\0\0\0\6\0\17\0\22\265\0\0", 4096, 0, {sa_family=AF_NETLINK, pid=0, groups=00000000}, [12]) = 172 So 000105-1158f aka 0x105 aka vxlan id 261 in my case.

Build a list of active network namespaces and its vxlan’s on the failing host. For example: # for i in $(ls /var/run/docker/netns/*); do echo ":::: $ns" >> ip.link.show; nsenter -m -t <PID of docker daemon> nsenter --net=$ns ip -d link show ; done >> ip.link.show

Now that you know the affected network namespace, double nsenter into it # nsenter -m -t <PID of docker daemon> bash # nsenter --net=/var/run/docker/netns/<affected namespace> bash # ip link delete vxlan1

After that, the error is gone. Pretty sure Docker Inc. knows about that workaround, why they don’t share it is up to the imagination of the reader. Hope this helps.

+14

gitbensons on Sep 15, 2017

@dang3r @dcrystalj @discotroy If you are still having this issue can you check if your host has any udev rules that might rename interface names that start with vx. ?

For overlay networks, docker daemon creates a vxlan device with the name like vx-001001-a12eme where 001001 is the VNI id in hex, followed by shortened network id. This device then gets moved to a overlay network specific namespace. When the overlay network is deleted, the device is moved back to the host namespace before its deleted. If there is a udev rule that could rename these interfaces and if the rename happens before docker daemon can delete it, the host will end up with an orphaned interface with that vni id. So subsequent attempts to create that interface will fail.

sanimej on Jun 14, 2017

You can find full information and “easy” resolution on docker.

In brief:

Check each node for any vx-* interfaces in /sys/class/net: $ ls -l /sys/class/net/ | grep vx

Once we have interface id’s pull more details: $ udevadm info /sys/class/net/

If these interfaces exist we should be able to safely remove them. Replace vx-000000-xxxxx with the interface id from Step 2: $ sudo ip -d link show vx-000000-xxxxx $ sudo ip link delete vx-000000-xxxxx etc.

Redeploy the service.

This overall fixed problem, but it may be dangerous if the removed network is shared, ie. servers as a traefik proxy… How can I check what service use which interface?

ProteanCode on Nov 26, 2020

Next time, can you check if you have “vx-” interface on host: ip link show | grep vx

If so, delete them, it worked for me: ip link delete vx-xxxx

This worked for me. Thanks, @fendo64 !

hexmode on Feb 22, 2020

Same issue here. the sudo umount /var/run/docker/netns/* sudo rm /var/run/docker/netns/* fix did not work Removing the stack and readding seems to have worked (for one stack it worked directly, the other stack I had to redo the steps)

Next time, can you check if you have “vx-” interface on host: ip link show | grep vx

If so, delete them, it worked for me: ip link delete vx-xxxx

The correction that I propose is after reading the code, I do not have the environment to test. If a good soul, has a test environment, could he test my correction proposal.

That resolved it for me on docker stack deploy on Docker 18.06.1-ce Swarm

leojonathanoh on May 1, 2019

Removing IP links does fix the problem however, please fix this permanently please.

albertvveld on Jun 23, 2022

If it’s helpful to anybody else I can confirm that this solution also worked for me - I iterated through the list of devices and did:

ip -d link show "{device}" && ip link delete "${device}"

We were able to bring the cluster back to a happy state once this had been applied - thank you very much for sharing the solution, it solved a big headache at the end of a very stressful day.

bobf on Apr 14, 2021

Happened to me on a single node swarm on Ubuntu 16.04.6 LTS host / 4.4.0-169-generic, tried with Docker 18.09.1, 18.09.9 and 19.03.7.

@fendo64 trick worked for me (i.e. ip link delete vx-xxx)

elthariel on Mar 12, 2020

@beckyjmcdabq essentially, if everything is correct, ip link show | grep vx is empty.

Only when I got the error this issue is all about, did I ever see a result on any of my machines (double digits) When deleting the network with ip link delete the problem was solved. other than doing this, a full restart of the node (not just docker, the machine) solved the problem as well, but of course takes longer and might have other side-effects.

I assume that the deletion of those networks is side-effect free, as they do not exist if the problem is not there.

you could probably go all willy-nilly by running the command with xargs I guess, but do so at your own risk: # use at your own risk: ip link show | grep vx | xargs -rn1 ip link delete

wolfgangpfnuer on Feb 27, 2020

Same happens to our environment: # docker -v

Docker version 18.09.6, build 481bc77156

# cat /proc/version

Linux version 3.10.0-957.21.2.el7.x86_64 (mockbuild@kbuilder.bsys.centos.org) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-36) (GCC) ) #1 SMP Wed Jun 5 14:26:44 UTC 2019`

sderungs on Jun 12, 2019

Ran into the same issue. Docker version 18.03.0-ce, build 0520e24

ip link delete vx-xxxx resolved it.

hannseman on Apr 1, 2019

As per https://github.com/docker/libnetwork/issues/562

You can correct this by running:

sudo umount /var/run/docker/netns/* sudo rm /var/run/docker/netns/*

Not sure if this is a long term solution.

jgeyser on May 31, 2017