moby: containers in docker 1.11 does not get same MTU as host

This issue is the same/similar to the issues documented in #22028 and #12565, and is being opened under a new issue at the request of @thaJeztah.

BUG REPORT INFORMATION

Output of docker version:

Client:
 Version:      1.11.0
 API version:  1.23
 Go version:   go1.5.4
 Git commit:   4dc5990
 Built:        Wed Apr 13 18:40:36 2016
 OS/Arch:      linux/amd64

Server:
 Version:      1.11.0
 API version:  1.23
 Go version:   go1.5.4
 Git commit:   4dc5990
 Built:        Wed Apr 13 18:40:36 2016
 OS/Arch:      linux/amd64

Output of docker info:

Containers: 8
 Running: 8
 Paused: 0
 Stopped: 0
Images: 8
Server Version: 1.11.0
Storage Driver: devicemapper
 Pool Name: docker-253:1-260061060-pool
 Pool Blocksize: 65.54 kB
 Base Device Size: 10.74 GB
 Backing Filesystem: xfs
 Data file: /dev/loop0
 Metadata file: /dev/loop1
 Data Space Used: 2.418 GB
 Data Space Total: 107.4 GB
 Data Space Available: 79.62 GB
 Metadata Space Used: 6.398 MB
 Metadata Space Total: 2.147 GB
 Metadata Space Available: 2.141 GB
 Udev Sync Supported: true
 Deferred Removal Enabled: false
 Deferred Deletion Enabled: false
 Deferred Deleted Device Count: 0
 Data loop file: /var/lib/docker/devicemapper/devicemapper/data
 WARNING: Usage of loopback devices is strongly discouraged for production use. Either use `--storage-opt dm.thinpooldev` or use `--storage-opt dm.no_warn_on_loop_devices=true` to suppress this warning.
 Metadata loop file: /var/lib/docker/devicemapper/devicemapper/metadata
 Library Version: 1.02.107-RHEL7 (2015-12-01)
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
 Volume: local
 Network: null host bridge
Kernel Version: 3.10.0-327.13.1.el7.x86_64
Operating System: CentOS Linux 7 (Core)
OSType: linux
Architecture: x86_64
CPUs: 4
Total Memory: 7.64 GiB
Name: summit-training-gse.novalocal
ID: JOSZ:IGGB:P4V5:ADQH:HSZM:XFFE:TSNR:LUWM:FQAC:N6FC:TZOM:GNKC
Docker Root Dir: /var/lib/docker
Debug mode (client): false
Debug mode (server): false
Registry: https://index.docker.io/v1/
WARNING: bridge-nf-call-iptables is disabled
WARNING: bridge-nf-call-ip6tables is disabled

Additional environment details (AWS, VirtualBox, physical, etc.):

Docker running in a CentOS 7.2 VM on RedHat RDO (Liberty) OpenStack cloud.

Steps to reproduce the issue:

Create a docker container
Compare MTU on container with MTU on host
Try a command such as apt-get update that would typically have large packets that might fragment

Describe the results you received: Host interface info: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> **mtu 1400** qdisc pfifo_fast state UP qlen 1000

Container interface info: eth0@if32: <BROADCAST,MULTICAST,UP,LOWER_UP> **mtu 1500** qdisc noqueue state UP group default

Requests originating from the container that have packets larger than 1400 are dropped

Describe the results you expected:

I would expect functionality on par with pre-1.10 docker, where users could expect networking to work without user intervention in the form of setting the MTU on the daemon (since there is no sysconfig or other environmental configuration mechanism, this literally means editing the service script), editing the docker related iptables rules, or adjusting the MTU on the container.

Additional information you deem important (e.g. issue happens only occasionally):

The other tickets referenced have identified a couple workarounds including setting the --mtu flag on the docker daemon. That didn’t work for us. After dropping the container and image, adjusting daemon args, and starting the container again, the MTU in the container remained at 1500 while host was 1400.

Our workaround involves inserting an iptables rule to mangle the packets in transit between the host and container:

iptables -I FORWARD -p tcp --tcp-flags SYN,RST SYN -j TCPMSS --clamp-mss-to-pmtu (see: https://www.frozentux.net/iptables-tutorial/chunkyhtml/x4721.html)

I wouldn’t consider any of the workarounds referenced to be a fix for this issue. In my opinion, the fix for the issue is to have the container MTU match host interface MTU upon container creation without user intervention.

About this issue

Original URL
State: closed
Created 8 years ago
Reactions: 7
Comments: 16 (7 by maintainers)

Most upvoted comments

Simple workaround for OpenStack and compose 2: (I will use 1450 MTU in this example) (Works in docker 1.12.x and probably also 1.11.x)

Make sure to pass the correct --mtu=1450 to the deamon so the host network (docker0) gets the right mtu. The bridge networks created by compose will still get the default 1500 mtu.

EDIT: When using compose 2, the docker0 network will remain at 1500 mtu (unlike when using compose 1). This is perfectly fine. I’m guessing since no container is attached directly to that network, the mtu changes will not apply.

We can fix the additional networks by overriding the default network in the compose file (version 2)

networks:
  default:
    driver: bridge
    driver_opts:
      com.docker.network.driver.mtu: 1450

You probably have to manually delete the old network created by compose (docker network rm <network_name>) as you will probably see a message like ERROR: Network "<network_name>" needs to be recreated - options have changed. This need to be done every time you do any changes to the network. (optionally you can get compose to use a different network name so you are not obstructed by this issue as a quick fix if you have a lot of hosts to work with. Use com.docker.network.bridge.name for this)

This is equivalent of doing a docker network create -o "com.docker.network.driver.mtu"="1450" <network_name> (creates a bridge unless you specify otherwise), so if you prefer to manage your networks manually, this is what you do.

Just overriding the default network to your manually created external network is also easy.

networks:
  default:
    external:
      name: <network_name>

When creating bridge networks manually, do not get confused by the initial mtu set on the interface. It will report an mtu of 1500, but as soon as you run containers, the values will adjust.

The more confusing part was that engine reference docs lists the wrong parameter name for specifying bridge mtu (Found this related issue #24921): https://docs.docker.com/engine/reference/commandline/network_create/#/bridge-driver-options

It took a fair amount of digging to finally get this working. I’m sure this can be translated to other ways of configuring networks. I just used the default network birdge for simplicity. As long as you find the right values for the driver you are using, you should be fine.

NOTE: This test was done on Ubuntu Trusty. There might be some underlying issues related to the network configuration that needs to be solved. All I know is that the instance gets its MTU of 1450 through dhcp and that’s about it.

+22

einarf on Sep 2, 2016

I’ve seen the exact situation with Docker on OpenStack. Performance suffers, building an image with apt-get upgrade can take hours. We were setting the --mtu flag manually and restarting the docker server but with Docker 1.12 and docker-machine, it’s become problematic.

jterstriep on Aug 25, 2016

Yes, I have a system demonstrating this available to me now.

The hypervisor host has MTU 1500. Instance traffic is VXLAN encapsulated which means the instances get a lower MTU set, i.e. the 1400 shown here. (This is similar to the IPSEC encapsulation scenario at the start of this issue.)

root@trusty-instance:~# ip link
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default 
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1400 qdisc pfifo_fast state UP mode DEFAULT group default qlen 1000
    link/ether fa:16:3e:77:15:ec brd ff:ff:ff:ff:ff:ff
3: docker0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default 
    link/ether 02:42:49:1d:61:c3 brd ff:ff:ff:ff:ff:ff
9: veth9d6a511: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master docker0 state UP mode DEFAULT group default 
    link/ether 2a:dc:b6:b3:b1:c0 brd ff:ff:ff:ff:ff:ff

Docker containers set the MTU to 1500:

root@8a9c9641a12e:/# ip link
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default 
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
8: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default 
    link/ether 02:42:c0:a8:03:01 brd ff:ff:ff:ff:ff:ff

and large transfers hang after the initial handshake:

root@8a9c9641a12e:/# apt-get update
Ign http://archive.ubuntu.com trusty InRelease
12% [Waiting for headers]

tcpdump shows no ICMP fragmentation-needed packets on any of the interfaces. net.ipv4.ip_no_pmtu_disc = 0 on the instance and the hypervisor host.

If I set --mtu=1400 or do docker run --net=host then the problem does not appear. It’s due to Docker using a Linux bridge with a too-large MTU. Looks like this issue too: https://github.com/docker/docker/issues/12565#issuecomment-95226447

Let me know if there’s anything else I can provide to help.

davebiffuk on Jul 22, 2016

docker run --rm debian:jessie sh -c "ip a | grep mtu" does give me the expected MTU value as configured in with the --mtu flag in docker 1.11.x and 1.12.x …

… but doing a run or up through a compose setup (using default network_mode and no custom network config, docker-compose==1.8.0 format 2) the networks created (br-xxxxxxx) will use 1500 MTU regardless of the --mtu flag. (1500). docker-compose run --rm myservice sh -c "ip a | grep mtu" always returns 1500.

If I run everything in the host network, everything works perfectly fine… but that is painful.

I’m not sure if docker-compose is doing something wrong or dockerd itself, or maybe we are supposed to configure these additional networks ourselves in detail.

EDIT: This is explained in the post below

(Running in OpenStack, were MTU is 1450)

einarf on Sep 17, 2016

In my opinion, the fix for the issue is to have the container MTU match host interface MTU upon container creation without user intervention.

Setting the container MTU to the host MTU is a hack. We shouldn’t be implementing hacks to solve network issues. What if your box has a second route, with an MTU smaller than the default route, your containers will still have issues using this alternate route. The proper solution is to figure out why PTMU discovery is apparently not working.

Edited: s/bigger/smaller/

phemmer on Apr 26, 2016