moby: Docker creates firewall rules in nat table that forward packets to wrong container IPs
At Spotify we run dockerd
with --bridge
to specify a network bridge. Sometimes we see Docker creates firewall rules in the nat table that forward packets to the wrong container IPs.
We don’t use the docker0 network bridge but our own. It’s called mybridge0.
dxia@myhost.com:~$ sudo ip link show
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP mode DEFAULT group default qlen 1000
link/ether 02:5e:ce:a0:af:cf brd ff:ff:ff:ff:ff:ff
3: mybridge0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default
link/ether 12:1c:9a:de:0a:70 brd ff:ff:ff:ff:ff:ff
5: veth6d40f6e: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master mybridge0 state UP mode DEFAULT group default
link/ether 12:1c:9a:de:0a:70 brd ff:ff:ff:ff:ff:ff
7: veth3337cc9: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master mybridge0 state UP mode DEFAULT group default
link/ether 4a:20:a7:a8:ae:ae brd ff:ff:ff:ff:ff:ff
It has the subnet 10.99.0.1/24.
dxia@myhost.com:~$ ip addr show mybridge0
3: mybridge0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default
link/ether 12:1c:9a:de:0a:70 brd ff:ff:ff:ff:ff:ff
inet 10.99.0.1/24 brd 10.99.0.255 scope global mybridge0
valid_lft forever preferred_lft forever
inet6 fe80::e83d:47ff:fe39:e3b6/64 scope link
valid_lft forever preferred_lft forever
We run dockerd like so.
dxia@myhost.com:~$ ps aux | grep dockerd
root 1662 0.0 0.0 1036284 47496 ? Ssl Mar17 7:21 /usr/bin/dockerd -H=unix:///var/run/docker.sock -H=tcp://127.0.0.1:2375 -H=tcp://10.99.0.1:2375 -b=mybridge0 --dns=10.99.0.1 --log-level=debug --storage-driver=aufs --raw-logs
dxia 9720 0.0 0.0 11752 2196 pts/15 S+ 16:24 0:00 grep dockerd
Docker creates these firewall rules in the nat table.
dxia@myhost.com:~$ sudo /sbin/iptables --table nat --list-rules DOCKER
-N DOCKER
-A DOCKER -i mybridge0 -j RETURN
-A DOCKER ! -i mybridge0 -p tcp -m tcp --dport 29103 -j DNAT --to-destination 10.99.0.2:20001
-A DOCKER ! -i mybridge0 -p tcp -m tcp --dport 27494 -j DNAT --to-destination 10.99.0.2:20000
These are the two running containers and their port mappings.
dxia@myhost.com:~$ docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
3623a04950a6 some/image:20161005T143810-e1beebd "/bin/bash -c 'exec /" 5 days ago Up 5 days 0.0.0.0:27494->20000/tcp, 0.0.0.0:29103->20001/tcp 2C11A25D5A11EDB19AABC4C2D363DE777B26AF8E
0b89cbd2635f some/other-image:0.14.0-SNAPSHOT-395d65b "/myscript.sh" 5 days ago Up 5 days 0.0.0.0:4567->4567/tcp, 0.0.0.0:5700->5700/tcp, 0.0.0.0:8080->8080/tcp, 0.0.0.0:9010->9010/tcp, 0.0.0.0:9110->9110/tcp 2C11A25D5A11EDB19AABC4C2D363DE777B26AF8E
The containers’ IPs are:
dxia@myhost.com:~$ docker inspect --format='{{range .NetworkSettings.Networks}}{{.IPAddress}}{{end}}' 3623a04950a6
10.99.0.3
dxia@myhost.com:~$ docker inspect --format='{{range .NetworkSettings.Networks}}{{.IPAddress}}{{end}}' 0b89cbd2635f
10.99.0.2
Notice that the two firewall rules forward packets to the wrong IP of 10.99.0.2. They should forward to 10.99.0.3. I’m also wondering if there should be nat rules for container 0b89cbd2635f.
Steps to reproduce the issue:
- Create a bridge interface
- Run
dockerd
using the bridge interface - Start containers with port mappings shown above
- Restart docker daemon a bunch of times, I think
Describe the results you received:
dxia@myhost.com:~$ sudo /sbin/iptables --table nat --list-rules DOCKER
-N DOCKER
-A DOCKER -i mybridge0 -j RETURN
-A DOCKER ! -i mybridge0 -p tcp -m tcp --dport 29103 -j DNAT --to-destination 10.99.0.2:20001
-A DOCKER ! -i mybridge0 -p tcp -m tcp --dport 27494 -j DNAT --to-destination 10.99.0.2:20000
netcat results from another host:
nc myhost.com 27494 -vz
nc: connectx to myhost.com port 27494 (tcp) failed: Connection refused
nc myhost.com 29103 -vz
nc: connectx to myhost.com port 29103 (tcp) failed: Connection refused
nc myhost.com 4567 -vz
[hangs]
nc myhost.com 5700 -vz
found 0 associations
found 1 connections:
1: flags=82<CONNECTED,PREFERRED>
outif en3
src 10.22.33.180 port 50834
dst 172.16.97.82 port 5700
rank info not available
TCP aux info available
Connection to myhost.com port 5700 [tcp/*] succeeded!
nc myhost.com 8080 -vz
found 0 associations
found 1 connections:
1: flags=82<CONNECTED,PREFERRED>
outif en3
src 10.22.33.180 port 50837
dst 172.16.97.82 port 8080
rank info not available
TCP aux info available
Connection to myhost.com port 8080 [tcp/http-alt] succeeded!
nc myhost.com 9010 -vz
found 0 associations
found 1 connections:
1: flags=82<CONNECTED,PREFERRED>
outif en3
src 10.22.33.180 port 50841
dst 172.16.97.82 port 9010
rank info not available
TCP aux info available
Connection to myhost.com port 9010 [tcp/*] succeeded!
nc myhost.com 9110 -vz
found 0 associations
found 1 connections:
1: flags=82<CONNECTED,PREFERRED>
outif en3
src 10.22.33.180 port 50842
dst 172.16.97.82 port 9110
rank info not available
TCP aux info available
Connection to myhost.com port 9110 [tcp/*] succeeded!
Describe the results you expected:
These two firewall rules
-A DOCKER ! -i mybridge0 -p tcp -m tcp --dport 29103 -j DNAT --to-destination 10.99.0.2:20001
-A DOCKER ! -i mybridge0 -p tcp -m tcp --dport 27494 -j DNAT --to-destination 10.99.0.2:20000
should forward packets to 10.99.0.3
instead of 10.99.0.2
. See IP addresses of each container above.
-A DOCKER ! -i mybridge0 -p tcp -m tcp --dport 29103 -j DNAT --to-destination 10.99.0.3:20001
-A DOCKER ! -i mybridge0 -p tcp -m tcp --dport 27494 -j DNAT --to-destination 10.99.0.3:20000
Additional information you deem important (e.g. issue happens only occasionally):
We are running dockerd on thousands on instances and have restart dockerd once a week on each instance. There are right now only ~20 instances with this issue. So it’s not common.
Output of docker version
:
dxia@myhost.com:~$ docker info
Containers: 2
Running: 2
Paused: 0
Stopped: 0
Images: 2
Server Version: 1.12.3
Storage Driver: aufs
Root Dir: /var/lib/docker/aufs
Backing Filesystem: extfs
Dirs: 32
Dirperm1 Supported: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
Volume: local
Network: host null bridge overlay
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Security Options: apparmor
Kernel Version: 3.16.0-45-generic
Operating System: Ubuntu 14.04 LTS
OSType: linux
Architecture: x86_64
CPUs: 8
Total Memory: 67.01 GiB
Name: myhost.com
ID: QWVL:OXUC:TXZ4:MF27:U6WC:XPSG:7LH7:AKOY:VYLO:BEVU:75BT:BUNX
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
WARNING: No swap limit support
Insecure Registries:
127.0.0.0/8
Output of docker info
:
dxia@myhost.com:~$ docker version
Client:
Version: 1.12.3
API version: 1.24
Go version: go1.6.3
Git commit: 6b644ec
Built: Wed Oct 26 21:44:32 2016
OS/Arch: linux/amd64
Server:
Version: 1.12.3
API version: 1.24
Go version: go1.6.3
Git commit: 6b644ec
Built: Wed Oct 26 21:44:32 2016
OS/Arch: linux/amd64
Additional environment details (AWS, VirtualBox, physical, etc.):
This happens on AWS instances, Google Compute instances, and on physical hardware.
About this issue
- Original URL
- State: open
- Created 7 years ago
- Reactions: 2
- Comments: 21 (14 by maintainers)
It’s definitely not consistent, but we see it intermittently across our fleet (on the order of thousands of running daemons).
IMO - we should address this at container start (i.e block on having the correct DNAT ports setup) as that will be easier to ensure consistency and catch versus changing the delete behavior to block on the removal. That’d at least mitigate the impact if something like this were to happen.
@arkodg @thaJeztah WRT more details and reproducing this. As @mnewswanger mentioned, we do not observe this behavior consistently. Our best guess is that across the fleet of thousands of instances (and more in production), the
DELETE
call for iptables to remove the rule inevitably times out or fails. I did not look at iptables code, but I would expect it to throw an error when an action fails, and possibly return a non-zero exit code. In fact, we have seen some errors around iptables failing to delete DNAT rules in the docker logs, just not exactly the rules that hit duplicate ports. I am wondering if this error “propagates” it to docker daemon container removal action, and if a retry ofiptables
command on error could help here.Other than the above, I think a very good first step would be to make this situation obvious and observable when this happens. For example, log an error during container creation/port binding that indicates a potential port conflict. That would at least make this visible, which is part of the problem - there is no detection of this issue currently, aside from checking
iptables
output.A better fix could be something like this:
iptables
rules in docker chainWith the above, it would make the behavior obvious and allow the users to decide how to handle the error - block until
iptables DELETE
succeeds, try a different port, etc. etc.