moby: Docker creates firewall rules in nat table that forward packets to wrong container IPs

At Spotify we run dockerd with --bridge to specify a network bridge. Sometimes we see Docker creates firewall rules in the nat table that forward packets to the wrong container IPs.

We don’t use the docker0 network bridge but our own. It’s called mybridge0.

dxia@myhost.com:~$ sudo ip link show
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP mode DEFAULT group default qlen 1000
    link/ether 02:5e:ce:a0:af:cf brd ff:ff:ff:ff:ff:ff
3: mybridge0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default
    link/ether 12:1c:9a:de:0a:70 brd ff:ff:ff:ff:ff:ff
5: veth6d40f6e: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master mybridge0 state UP mode DEFAULT group default
    link/ether 12:1c:9a:de:0a:70 brd ff:ff:ff:ff:ff:ff
7: veth3337cc9: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master mybridge0 state UP mode DEFAULT group default
    link/ether 4a:20:a7:a8:ae:ae brd ff:ff:ff:ff:ff:ff

It has the subnet 10.99.0.1/24.

dxia@myhost.com:~$ ip addr show mybridge0
3: mybridge0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default
    link/ether 12:1c:9a:de:0a:70 brd ff:ff:ff:ff:ff:ff
    inet 10.99.0.1/24 brd 10.99.0.255 scope global mybridge0
       valid_lft forever preferred_lft forever
    inet6 fe80::e83d:47ff:fe39:e3b6/64 scope link
       valid_lft forever preferred_lft forever

We run dockerd like so.

dxia@myhost.com:~$ ps aux | grep dockerd
root      1662  0.0  0.0 1036284 47496 ?       Ssl  Mar17   7:21 /usr/bin/dockerd -H=unix:///var/run/docker.sock -H=tcp://127.0.0.1:2375 -H=tcp://10.99.0.1:2375 -b=mybridge0 --dns=10.99.0.1 --log-level=debug --storage-driver=aufs --raw-logs
dxia      9720  0.0  0.0  11752  2196 pts/15   S+   16:24   0:00 grep dockerd

Docker creates these firewall rules in the nat table.

dxia@myhost.com:~$ sudo /sbin/iptables --table nat --list-rules DOCKER
-N DOCKER
-A DOCKER -i mybridge0 -j RETURN
-A DOCKER ! -i mybridge0 -p tcp -m tcp --dport 29103 -j DNAT --to-destination 10.99.0.2:20001
-A DOCKER ! -i mybridge0 -p tcp -m tcp --dport 27494 -j DNAT --to-destination 10.99.0.2:20000

These are the two running containers and their port mappings.

dxia@myhost.com:~$ docker ps
CONTAINER ID        IMAGE                                                                 COMMAND                  CREATED             STATUS              PORTS                                                                                                                    NAMES
3623a04950a6        some/image:20161005T143810-e1beebd     "/bin/bash -c 'exec /"   5 days ago          Up 5 days           0.0.0.0:27494->20000/tcp, 0.0.0.0:29103->20001/tcp                                                                       2C11A25D5A11EDB19AABC4C2D363DE777B26AF8E
0b89cbd2635f        some/other-image:0.14.0-SNAPSHOT-395d65b   "/myscript.sh"     5 days ago          Up 5 days           0.0.0.0:4567->4567/tcp, 0.0.0.0:5700->5700/tcp, 0.0.0.0:8080->8080/tcp, 0.0.0.0:9010->9010/tcp, 0.0.0.0:9110->9110/tcp   2C11A25D5A11EDB19AABC4C2D363DE777B26AF8E

The containers’ IPs are:

dxia@myhost.com:~$ docker inspect --format='{{range .NetworkSettings.Networks}}{{.IPAddress}}{{end}}' 3623a04950a6
10.99.0.3

dxia@myhost.com:~$ docker inspect --format='{{range .NetworkSettings.Networks}}{{.IPAddress}}{{end}}' 0b89cbd2635f
10.99.0.2

Notice that the two firewall rules forward packets to the wrong IP of 10.99.0.2. They should forward to 10.99.0.3. I’m also wondering if there should be nat rules for container 0b89cbd2635f.

Steps to reproduce the issue:

  1. Create a bridge interface
  2. Run dockerd using the bridge interface
  3. Start containers with port mappings shown above
  4. Restart docker daemon a bunch of times, I think

Describe the results you received:

dxia@myhost.com:~$ sudo /sbin/iptables --table nat --list-rules DOCKER
-N DOCKER
-A DOCKER -i mybridge0 -j RETURN
-A DOCKER ! -i mybridge0 -p tcp -m tcp --dport 29103 -j DNAT --to-destination 10.99.0.2:20001
-A DOCKER ! -i mybridge0 -p tcp -m tcp --dport 27494 -j DNAT --to-destination 10.99.0.2:20000

netcat results from another host:

nc myhost.com 27494 -vz
nc: connectx to myhost.com port 27494 (tcp) failed: Connection refused

nc myhost.com 29103 -vz
nc: connectx to myhost.com port 29103 (tcp) failed: Connection refused

nc myhost.com 4567 -vz
[hangs]

nc myhost.com 5700 -vz
found 0 associations
found 1 connections:
     1:	flags=82<CONNECTED,PREFERRED>
	outif en3
	src 10.22.33.180 port 50834
	dst 172.16.97.82 port 5700
	rank info not available
	TCP aux info available

Connection to myhost.com port 5700 [tcp/*] succeeded!

nc myhost.com 8080 -vz
found 0 associations
found 1 connections:
     1:	flags=82<CONNECTED,PREFERRED>
	outif en3
	src 10.22.33.180 port 50837
	dst 172.16.97.82 port 8080
	rank info not available
	TCP aux info available

Connection to myhost.com port 8080 [tcp/http-alt] succeeded!

nc myhost.com 9010 -vz
found 0 associations
found 1 connections:
     1:	flags=82<CONNECTED,PREFERRED>
	outif en3
	src 10.22.33.180 port 50841
	dst 172.16.97.82 port 9010
	rank info not available
	TCP aux info available

Connection to myhost.com port 9010 [tcp/*] succeeded!

nc myhost.com 9110 -vz
found 0 associations
found 1 connections:
     1:	flags=82<CONNECTED,PREFERRED>
	outif en3
	src 10.22.33.180 port 50842
	dst 172.16.97.82 port 9110
	rank info not available
	TCP aux info available

Connection to myhost.com port 9110 [tcp/*] succeeded!

Describe the results you expected:

These two firewall rules

-A DOCKER ! -i mybridge0 -p tcp -m tcp --dport 29103 -j DNAT --to-destination 10.99.0.2:20001
-A DOCKER ! -i mybridge0 -p tcp -m tcp --dport 27494 -j DNAT --to-destination 10.99.0.2:20000

should forward packets to 10.99.0.3 instead of 10.99.0.2. See IP addresses of each container above.

-A DOCKER ! -i mybridge0 -p tcp -m tcp --dport 29103 -j DNAT --to-destination 10.99.0.3:20001
-A DOCKER ! -i mybridge0 -p tcp -m tcp --dport 27494 -j DNAT --to-destination 10.99.0.3:20000

Additional information you deem important (e.g. issue happens only occasionally):

We are running dockerd on thousands on instances and have restart dockerd once a week on each instance. There are right now only ~20 instances with this issue. So it’s not common.

Output of docker version:

dxia@myhost.com:~$ docker info
Containers: 2
 Running: 2
 Paused: 0
 Stopped: 0
Images: 2
Server Version: 1.12.3
Storage Driver: aufs
 Root Dir: /var/lib/docker/aufs
 Backing Filesystem: extfs
 Dirs: 32
 Dirperm1 Supported: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
 Volume: local
 Network: host null bridge overlay
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Security Options: apparmor
Kernel Version: 3.16.0-45-generic
Operating System: Ubuntu 14.04 LTS
OSType: linux
Architecture: x86_64
CPUs: 8
Total Memory: 67.01 GiB
Name: myhost.com
ID: QWVL:OXUC:TXZ4:MF27:U6WC:XPSG:7LH7:AKOY:VYLO:BEVU:75BT:BUNX
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
WARNING: No swap limit support
Insecure Registries:
 127.0.0.0/8

Output of docker info:

dxia@myhost.com:~$ docker version
Client:
 Version:      1.12.3
 API version:  1.24
 Go version:   go1.6.3
 Git commit:   6b644ec
 Built:        Wed Oct 26 21:44:32 2016
 OS/Arch:      linux/amd64

Server:
 Version:      1.12.3
 API version:  1.24
 Go version:   go1.6.3
 Git commit:   6b644ec
 Built:        Wed Oct 26 21:44:32 2016
 OS/Arch:      linux/amd64

Additional environment details (AWS, VirtualBox, physical, etc.):

This happens on AWS instances, Google Compute instances, and on physical hardware.

About this issue

  • Original URL
  • State: open
  • Created 7 years ago
  • Reactions: 2
  • Comments: 21 (14 by maintainers)

Most upvoted comments

It’s definitely not consistent, but we see it intermittently across our fleet (on the order of thousands of running daemons).

IMO - we should address this at container start (i.e block on having the correct DNAT ports setup) as that will be easier to ensure consistency and catch versus changing the delete behavior to block on the removal. That’d at least mitigate the impact if something like this were to happen.

@arkodg @thaJeztah WRT more details and reproducing this. As @mnewswanger mentioned, we do not observe this behavior consistently. Our best guess is that across the fleet of thousands of instances (and more in production), the DELETE call for iptables to remove the rule inevitably times out or fails. I did not look at iptables code, but I would expect it to throw an error when an action fails, and possibly return a non-zero exit code. In fact, we have seen some errors around iptables failing to delete DNAT rules in the docker logs, just not exactly the rules that hit duplicate ports. I am wondering if this error “propagates” it to docker daemon container removal action, and if a retry of iptables command on error could help here.

Other than the above, I think a very good first step would be to make this situation obvious and observable when this happens. For example, log an error during container creation/port binding that indicates a potential port conflict. That would at least make this visible, which is part of the problem - there is no detection of this issue currently, aside from checking iptables output.

A better fix could be something like this:

  • upon container startup, check that for each port binding, there are no iptables rules in docker chain
  • for any ports with existing rules, check nothing is listening on the port, it should be safe to assume it can only be docker since it is in the docker chain
  • if there is nothing listening, attempt to remove the rule to enforce consistency
  • rule removal succeeds: proceed with next port and rest of the flow
  • rule removal fails: fail to start the container, this is a legit error of setting up port bindings IMO

With the above, it would make the behavior obvious and allow the users to decide how to handle the error - block until iptables DELETE succeeds, try a different port, etc. etc.