moby: containers in docker 1.11 does not get same MTU as host
This issue is the same/similar to the issues documented in #22028 and #12565, and is being opened under a new issue at the request of @thaJeztah.
BUG REPORT INFORMATION
Output of docker version
:
Client:
Version: 1.11.0
API version: 1.23
Go version: go1.5.4
Git commit: 4dc5990
Built: Wed Apr 13 18:40:36 2016
OS/Arch: linux/amd64
Server:
Version: 1.11.0
API version: 1.23
Go version: go1.5.4
Git commit: 4dc5990
Built: Wed Apr 13 18:40:36 2016
OS/Arch: linux/amd64
Output of docker info
:
Containers: 8
Running: 8
Paused: 0
Stopped: 0
Images: 8
Server Version: 1.11.0
Storage Driver: devicemapper
Pool Name: docker-253:1-260061060-pool
Pool Blocksize: 65.54 kB
Base Device Size: 10.74 GB
Backing Filesystem: xfs
Data file: /dev/loop0
Metadata file: /dev/loop1
Data Space Used: 2.418 GB
Data Space Total: 107.4 GB
Data Space Available: 79.62 GB
Metadata Space Used: 6.398 MB
Metadata Space Total: 2.147 GB
Metadata Space Available: 2.141 GB
Udev Sync Supported: true
Deferred Removal Enabled: false
Deferred Deletion Enabled: false
Deferred Deleted Device Count: 0
Data loop file: /var/lib/docker/devicemapper/devicemapper/data
WARNING: Usage of loopback devices is strongly discouraged for production use. Either use `--storage-opt dm.thinpooldev` or use `--storage-opt dm.no_warn_on_loop_devices=true` to suppress this warning.
Metadata loop file: /var/lib/docker/devicemapper/devicemapper/metadata
Library Version: 1.02.107-RHEL7 (2015-12-01)
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
Volume: local
Network: null host bridge
Kernel Version: 3.10.0-327.13.1.el7.x86_64
Operating System: CentOS Linux 7 (Core)
OSType: linux
Architecture: x86_64
CPUs: 4
Total Memory: 7.64 GiB
Name: summit-training-gse.novalocal
ID: JOSZ:IGGB:P4V5:ADQH:HSZM:XFFE:TSNR:LUWM:FQAC:N6FC:TZOM:GNKC
Docker Root Dir: /var/lib/docker
Debug mode (client): false
Debug mode (server): false
Registry: https://index.docker.io/v1/
WARNING: bridge-nf-call-iptables is disabled
WARNING: bridge-nf-call-ip6tables is disabled
Additional environment details (AWS, VirtualBox, physical, etc.):
Docker running in a CentOS 7.2 VM on RedHat RDO (Liberty) OpenStack cloud.
Steps to reproduce the issue:
- Create a docker container
- Compare MTU on container with MTU on host
- Try a command such as apt-get update that would typically have large packets that might fragment
Describe the results you received:
Host interface info:
eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> **mtu 1400** qdisc pfifo_fast state UP qlen 1000
Container interface info:
eth0@if32: <BROADCAST,MULTICAST,UP,LOWER_UP> **mtu 1500** qdisc noqueue state UP group default
Requests originating from the container that have packets larger than 1400 are dropped
Describe the results you expected:
I would expect functionality on par with pre-1.10 docker, where users could expect networking to work without user intervention in the form of setting the MTU on the daemon (since there is no sysconfig or other environmental configuration mechanism, this literally means editing the service script), editing the docker related iptables rules, or adjusting the MTU on the container.
Additional information you deem important (e.g. issue happens only occasionally):
The other tickets referenced have identified a couple workarounds including setting the --mtu flag on the docker daemon. That didn’t work for us. After dropping the container and image, adjusting daemon args, and starting the container again, the MTU in the container remained at 1500 while host was 1400.
Our workaround involves inserting an iptables rule to mangle the packets in transit between the host and container:
iptables -I FORWARD -p tcp --tcp-flags SYN,RST SYN -j TCPMSS --clamp-mss-to-pmtu
(see: https://www.frozentux.net/iptables-tutorial/chunkyhtml/x4721.html)
I wouldn’t consider any of the workarounds referenced to be a fix for this issue. In my opinion, the fix for the issue is to have the container MTU match host interface MTU upon container creation without user intervention.
About this issue
- Original URL
- State: closed
- Created 8 years ago
- Reactions: 7
- Comments: 16 (7 by maintainers)
Simple workaround for OpenStack and compose 2: (I will use 1450 MTU in this example) (Works in docker 1.12.x and probably also 1.11.x)
Make sure to pass the correct
--mtu=1450
to the deamon so the host network (docker0) gets the right mtu. The bridge networks created by compose will still get the default 1500 mtu.EDIT: When using compose 2, the
docker0
network will remain at 1500 mtu (unlike when using compose 1). This is perfectly fine. I’m guessing since no container is attached directly to that network, the mtu changes will not apply.We can fix the additional networks by overriding the default network in the compose file (version 2)
You probably have to manually delete the old network created by compose (
docker network rm <network_name>
) as you will probably see a message likeERROR: Network "<network_name>" needs to be recreated - options have changed
. This need to be done every time you do any changes to the network. (optionally you can get compose to use a different network name so you are not obstructed by this issue as a quick fix if you have a lot of hosts to work with. Usecom.docker.network.bridge.name
for this)This is equivalent of doing a
docker network create -o "com.docker.network.driver.mtu"="1450" <network_name>
(creates a bridge unless you specify otherwise), so if you prefer to manage your networks manually, this is what you do.Just overriding the default network to your manually created external network is also easy.
When creating bridge networks manually, do not get confused by the initial mtu set on the interface. It will report an mtu of 1500, but as soon as you run containers, the values will adjust.
The more confusing part was that engine reference docs lists the wrong parameter name for specifying bridge mtu (Found this related issue #24921): https://docs.docker.com/engine/reference/commandline/network_create/#/bridge-driver-options
It took a fair amount of digging to finally get this working. I’m sure this can be translated to other ways of configuring networks. I just used the default network birdge for simplicity. As long as you find the right values for the driver you are using, you should be fine.
NOTE: This test was done on Ubuntu Trusty. There might be some underlying issues related to the network configuration that needs to be solved. All I know is that the instance gets its MTU of 1450 through dhcp and that’s about it.
I’ve seen the exact situation with Docker on OpenStack. Performance suffers, building an image with apt-get upgrade can take hours. We were setting the --mtu flag manually and restarting the docker server but with Docker 1.12 and docker-machine, it’s become problematic.
Yes, I have a system demonstrating this available to me now.
The hypervisor host has MTU 1500. Instance traffic is VXLAN encapsulated which means the instances get a lower MTU set, i.e. the 1400 shown here. (This is similar to the IPSEC encapsulation scenario at the start of this issue.)
Docker containers set the MTU to 1500:
and large transfers hang after the initial handshake:
tcpdump shows no ICMP fragmentation-needed packets on any of the interfaces. net.ipv4.ip_no_pmtu_disc = 0 on the instance and the hypervisor host.
If I set
--mtu=1400
or dodocker run --net=host
then the problem does not appear. It’s due to Docker using a Linux bridge with a too-large MTU. Looks like this issue too: https://github.com/docker/docker/issues/12565#issuecomment-95226447Let me know if there’s anything else I can provide to help.
docker run --rm debian:jessie sh -c "ip a | grep mtu"
does give me the expected MTU value as configured in with the--mtu
flag in docker 1.11.x and 1.12.x …… but doing a
run
orup
through a compose setup (using default network_mode and no custom network config, docker-compose==1.8.0 format 2) the networks created (br-xxxxxxx
) will use 1500 MTU regardless of the--mtu
flag. (1500).docker-compose run --rm myservice sh -c "ip a | grep mtu"
always returns 1500.If I run everything in the host network, everything works perfectly fine… but that is painful.
I’m not sure if docker-compose is doing something wrong or dockerd itself, or maybe we are supposed to configure these additional networks ourselves in detail.
EDIT: This is explained in the post below
(Running in OpenStack, were MTU is 1450)
Setting the container MTU to the host MTU is a hack. We shouldn’t be implementing hacks to solve network issues. What if your box has a second route, with an MTU smaller than the default route, your containers will still have issues using this alternate route. The proper solution is to figure out why PTMU discovery is apparently not working.
Edited: s/bigger/smaller/