moby: Bridge interface with Docker breaks Path MTU Discovery when host is using IPSEC
Description of problem: When the host system is using IPSEC (libreswan) for encrypting communications, applications running within docker run into issues serving files larger than (MTU - IPSEC Overhead), with timeouts being seen for such files. The same applications (e.g. Nginx) running outside of Docker do not have this problem. Equally serving files under the (MTU-IPSEC overhead) are served fine. - i.e. <8920 bytes in our case
I think it is possible that PMTUD under docker is just broken full stop, but it only becomes an issue under IPSEC, where the host can’t just fragment the traffic.
docker version
:
Client version: 1.6.0
Client API version: 1.18
Go version (client): go1.4.2
Git commit (client): 4749651
OS/Arch (client): linux/amd64
Server version: 1.6.0
Server API version: 1.18
Go version (server): go1.4.2
Git commit (server): 4749651
OS/Arch (server): linux/amd64
docker info
:
Containers: 10
Images: 28
Storage Driver: aufs
Root Dir: /mounts/xvdf/appdata/docker/aufs
Backing Filesystem: extfs
Dirs: 48
Dirperm1 Supported: false
Execution Driver: native-0.2
Kernel Version: 3.13.0-46-generic
Operating System: Ubuntu precise (12.04.5 LTS)
CPUs: 1
Total Memory: 3.676 GiB
Name: ipsectest01-uw2a
ID: TBAE:WRIE:X3XH:HXZH:CNTC:VFLV:L2AS:IWQX:VPKO:6TYI:ASJ5:IYNX
WARNING: No swap limit support
uname -a
:
Linux ipsectest01-uw2a 3.13.0-46-generic #75~precise1-Ubuntu SMP Wed Feb 11 19:21:25 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux
Environment details (AWS, VirtualBox, physical, etc.):
AWS, on m3.medium instances. eth0 interface configured with default MTU (9001 bytes)
$ /sbin/ifconfig
docker0 Link encap:Ethernet HWaddr 56:84:7a:fe:97:99
inet addr:172.17.42.1 Bcast:0.0.0.0 Mask:255.255.0.0
inet6 addr: fe80::5484:7aff:fefe:9799/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:9001 Metric:1
RX packets:5276 errors:0 dropped:0 overruns:0 frame:0
TX packets:12079 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:1713691 (1.7 MB) TX bytes:26215400 (26.2 MB)
eth0 Link encap:Ethernet HWaddr 06:39:78:8b:30:e5
inet addr:172.31.16.237 Bcast:172.31.31.255 Mask:255.255.240.0
inet6 addr: fe80::439:78ff:fe8b:30e5/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:9001 Metric:1
RX packets:3558290 errors:0 dropped:0 overruns:0 frame:0
TX packets:2956460 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:2542373946 (2.5 GB) TX bytes:667374951 (667.3 MB)
lo Link encap:Local Loopback
inet addr:127.0.0.1 Mask:255.0.0.0
inet6 addr: ::1/128 Scope:Host
UP LOOPBACK RUNNING MTU:65536 Metric:1
RX packets:1212019 errors:0 dropped:0 overruns:0 frame:0
TX packets:1212019 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:269564051 (269.5 MB) TX bytes:269564051 (269.5 MB)
How reproducible: Always
Steps to Reproduce:
- Install libreswan on two host systems, and configure a relationship between them - https://libreswan.org/wiki/Host_to_host_VPN
- Ensure the IPSEC tunnels are up
sudo ipsec status |grep established
- create a file larger than the MTU on both servers
dd if=/dev/urandom of=/var/tmp/nginx/testfile bs=512 count=24
- install nginx directly on one server, and configure it to server up the freshly created testfile -
apt-get install nginx ; sudo start nginx ; sudo cp /var/tmp/nginx/testfile /usr/share/nginx/html
- Bring up a docker container on the other server with nginx: https://registry.hub.docker.com/_/nginx/
sudo docker run -p 80:80 -v /var/tmp/nginx:/var/www/html dockerfile/nginx
- Retrieve that file from the only nginx server with configured ipsec -
curl -o /dev/null http:/onlynginx/testfile
- Attempt to retrieve that file from the nginx under docker server -
curl -o /dev/null http://nginxdocker/testfile
Actual Results:
Curl from pure nginx works fine, and file downloaded very quickly
~$ curl -o /dev/null http://onlynginx/testfile
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 12288 100 12288 0 0 2002k 0 --:--:-- --:--:-- --:--:-- 2400k
Curl from docker never responds…
$ curl -o /dev/null http://nginxdocker/testfile
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
0 0 0 0 0 0 0 0 --:--:-- 0:17:30 --:--:-- 0
tcpdump on the docker0 interface of the server shows it attempting to to send packets of size 9001 bytes, and a tcp max segment size of 8961 bytes.
$ sudo tcpdump -nvv -i docker0
tcpdump: listening on docker0, link-type EN10MB (Ethernet), capture size 65535 bytes
19:44:46.082719 IP (tos 0x0, ttl 63, id 15325, offset 0, flags [DF], proto TCP (6), length 60)
172.31.15.88.43265 > 172.17.0.1.80: Flags [S], cksum 0x90ed (correct), seq 2992186550, win 26883, options [mss 8961,sackOK,TS val 694951787 ecr 0,nop,wscale 7], length 0
19:44:46.082758 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto TCP (6), length 60)
172.17.0.1.80 > 172.31.15.88.43265: Flags [S.], cksum 0x67b8 (incorrect -> 0xcb5f), seq 1826792915, ack 2992186551, win 26847, options [mss 8961,sackOK,TS val 197175082 ecr 694951787,nop,wscale 7], length 0
19:44:46.083976 IP (tos 0x0, ttl 63, id 15326, offset 0, flags [DF], proto TCP (6), length 52)
172.31.15.88.43265 > 172.17.0.1.80: Flags [.], cksum 0x7f84 (correct), seq 1, ack 1, win 211, options [nop,nop,TS val 694951788 ecr 197175082], length 0
19:44:46.090379 IP (tos 0x0, ttl 63, id 15327, offset 0, flags [DF], proto TCP (6), length 228)
172.31.15.88.43265 > 172.17.0.1.80: Flags [P.], cksum 0xd22f (correct), seq 1:177, ack 1, win 211, options [nop,nop,TS val 694951789 ecr 197175082], length 176
19:44:46.090395 IP (tos 0x0, ttl 64, id 766, offset 0, flags [DF], proto TCP (6), length 52)
172.17.0.1.80 > 172.31.15.88.43265: Flags [.], cksum 0x67b0 (incorrect -> 0x7ec9), seq 1, ack 177, win 219, options [nop,nop,TS val 197175084 ecr 694951789], length 0
19:44:46.090514 IP (tos 0x0, ttl 64, id 767, offset 0, flags [DF], proto TCP (6), length 9001)
172.17.0.1.80 > 172.31.15.88.43265: Flags [.], cksum 0x8aa5 (incorrect -> 0x7024), seq 1:8950, ack 177, win 219, options [nop,nop,TS val 197175084 ecr 694951789], length 8949
19:44:46.090599 IP (tos 0x0, ttl 64, id 768, offset 0, flags [DF], proto TCP (6), length 3646)
172.17.0.1.80 > 172.31.15.88.43265: Flags [P.], cksum 0x75ba (incorrect -> 0xc625), seq 8950:12544, ack 177, win 219, options [nop,nop,TS val 197175084 ecr 694951789], length 3594
19:44:46.091831 IP (tos 0x0, ttl 63, id 15328, offset 0, flags [DF], proto TCP (6), length 64)
172.31.15.88.43265 > 172.17.0.1.80: Flags [.], cksum 0xcecb (correct), seq 177, ack 1, win 350, options [nop,nop,TS val 694951790 ecr 197175084,nop,nop,sack 1 {8950:12544}], length 0
19:44:46.093634 IP (tos 0x0, ttl 64, id 769, offset 0, flags [DF], proto TCP (6), length 9001)
172.17.0.1.80 > 172.31.15.88.43265: Flags [.], cksum 0x8aa5 (incorrect -> 0x7022), seq 1:8950, ack 177, win 219, options [nop,nop,TS val 197175085 ecr 694951790], length 8949
19:44:46.297635 IP (tos 0x0, ttl 64, id 770, offset 0, flags [DF], proto TCP (6), length 9001)
172.17.0.1.80 > 172.31.15.88.43265: Flags [.], cksum 0x8aa5 (incorrect -> 0x6fef), seq 1:8950, ack 177, win 219, options [nop,nop,TS val 197175136 ecr 694951790], length 8949
19:44:46.705639 IP (tos 0x0, ttl 64, id 771, offset 0, flags [DF], proto TCP (6), length 9001)
Expected Results:
nginx (or any network server) should work the same under docker as it does when running directly on a host.
There are some workarounds, but they should not be necessary, seeing as they are not required when running natively.
Additional info:
Some work arounds:
- Partial - but severely degraded performance, and not always working.
set sysctl net.ipv4.tcp_mtu_probing=1
$ curl -o /dev/null http://nginxdocker/testfile
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 12288 100 12288 0 0 3948 0 0:00:03 0:00:03 --:--:-- 3952
note with this, performance is very poor. you can see it’s taken 3 seconds to download a file which was otherwise served sub second.
Additionally, whilst this workaround works with NGINX, it doesn’t seem to work with other services (e.g. Java based webs)
You can see from TCP dump output that initially it tried to das before, and send packets of size 9001 bytes and tcp max segment size of 8961 bytes.
After a few retries from this, it gives up and sends them in small (564 byte fragments)
20:04:55.280669 IP (tos 0x0, ttl 63, id 28674, offset 0, flags [DF], proto TCP (6), length 60)
172.31.15.88.45202 > 172.17.0.1.80: Flags [S], cksum 0x8396 (correct), seq 1940900933, win 26883, options [mss 8961,sackOK,TS val 695254087 ecr 0,nop,wscale 7], length 0
20:04:55.280708 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto TCP (6), length 60)
172.17.0.1.80 > 172.31.15.88.45202: Flags [S.], cksum 0x67b8 (incorrect -> 0x7c2c), seq 72659806, ack 1940900934, win 26847, options [mss 8961,sackOK,TS val 197477381 ecr 695254087,nop,wscale 7], length 0
20:04:55.282082 IP (tos 0x0, ttl 63, id 28675, offset 0, flags [DF], proto TCP (6), length 52)
172.31.15.88.45202 > 172.17.0.1.80: Flags [.], cksum 0x3052 (correct), seq 1, ack 1, win 211, options [nop,nop,TS val 695254087 ecr 197477381], length 0
20:04:55.299373 IP (tos 0x0, ttl 63, id 28676, offset 0, flags [DF], proto TCP (6), length 228)
172.31.15.88.45202 > 172.17.0.1.80: Flags [P.], cksum 0x82fa (correct), seq 1:177, ack 1, win 211, options [nop,nop,TS val 695254091 ecr 197477381], length 176
20:04:55.299388 IP (tos 0x0, ttl 64, id 63321, offset 0, flags [DF], proto TCP (6), length 52)
172.17.0.1.80 > 172.31.15.88.45202: Flags [.], cksum 0x67b0 (incorrect -> 0x2f91), seq 1, ack 177, win 219, options [nop,nop,TS val 197477386 ecr 695254091], length 0
20:04:55.299506 IP (tos 0x0, ttl 64, id 63322, offset 0, flags [DF], proto TCP (6), length 9001)
172.17.0.1.80 > 172.31.15.88.45202: Flags [.], cksum 0x8aa5 (incorrect -> 0x2eea), seq 1:8950, ack 177, win 219, options [nop,nop,TS val 197477386 ecr 695254091], length 8949
20:04:55.299593 IP (tos 0x0, ttl 64, id 63323, offset 0, flags [DF], proto TCP (6), length 3646)
172.17.0.1.80 > 172.31.15.88.45202: Flags [P.], cksum 0x75ba (incorrect -> 0x76ed), seq 8950:12544, ack 177, win 219, options [nop,nop,TS val 197477386 ecr 695254091], length 3594
20:04:55.300861 IP (tos 0x0, ttl 63, id 28677, offset 0, flags [DF], proto TCP (6), length 64)
172.31.15.88.45202 > 172.17.0.1.80: Flags [.], cksum 0x359a (correct), seq 177, ack 1, win 350, options [nop,nop,TS val 695254092 ecr 197477386,nop,nop,sack 1 {8950:12544}], length 0
20:04:55.301634 IP (tos 0x0, ttl 64, id 63324, offset 0, flags [DF], proto TCP (6), length 9001)
172.17.0.1.80 > 172.31.15.88.45202: Flags [.], cksum 0x8aa5 (incorrect -> 0x2ee8), seq 1:8950, ack 177, win 219, options [nop,nop,TS val 197477387 ecr 695254092], length 8949
20:04:55.505633 IP (tos 0x0, ttl 64, id 63325, offset 0, flags [DF], proto TCP (6), length 9001)
172.17.0.1.80 > 172.31.15.88.45202: Flags [.], cksum 0x8aa5 (incorrect -> 0x2eb5), seq 1:8950, ack 177, win 219, options [nop,nop,TS val 197477438 ecr 695254092], length 8949
20:04:55.913648 IP (tos 0x0, ttl 64, id 63326, offset 0, flags [DF], proto TCP (6), length 9001)
172.17.0.1.80 > 172.31.15.88.45202: Flags [.], cksum 0x8aa5 (incorrect -> 0x2e4f), seq 1:8950, ack 177, win 219, options [nop,nop,TS val 197477540 ecr 695254092], length 8949
20:04:56.729645 IP (tos 0x0, ttl 64, id 63327, offset 0, flags [DF], proto TCP (6), length 9001)
172.17.0.1.80 > 172.31.15.88.45202: Flags [.], cksum 0x8aa5 (incorrect -> 0x2d83), seq 1:8950, ack 177, win 219, options [nop,nop,TS val 197477744 ecr 695254092], length 8949
20:04:58.365655 IP (tos 0x0, ttl 64, id 63328, offset 0, flags [DF], proto TCP (6), length 564)
172.17.0.1.80 > 172.31.15.88.45202: Flags [.], cksum 0x69b0 (incorrect -> 0x1374), seq 1:513, ack 177, win 219, options [nop,nop,TS val 197478153 ecr 695254092], length 512
20:04:58.366950 IP (tos 0x0, ttl 63, id 28678, offset 0, flags [DF], proto TCP (6), length 64)
172.31.15.88.45202 > 172.17.0.1.80: Flags [.], cksum 0x2d94 (correct), seq 177, ack 513, win 359, options [nop,nop,TS val 695254858 ecr 197478153,nop,nop,sack 1 {8950:12544}], length 0
20:04:58.366974 IP (tos 0x0, ttl 64, id 63329, offset 0, flags [DF], proto TCP (6), length 564)
172.17.0.1.80 > 172.31.15.88.45202: Flags [.], cksum 0x69b0 (incorrect -> 0xf426), seq 513:1025, ack 177, win 219, options [nop,nop,TS val 197478153 ecr 695254858], length 512
20:04:58.366976 IP (tos 0x0, ttl 64, id 63330, offset 0, flags [DF], proto TCP (6), length 564)
172.17.0.1.80 > 172.31.15.88.45202: Flags [.], cksum 0x69b0 (incorrect -> 0x5619), seq 1025:1537, ack 177, win 219, options [nop,nop,TS val 197478153 ecr 695254858], length 512
20:04:58.368187 IP (tos 0x0, ttl 63, id 28679, offset 0, flags [DF], proto TCP (6), length 64)
172.31.15.88.45202 > 172.17.0.1.80: Flags [.], cksum 0x2b8b (correct), seq 177, ack 1025, win 367, options [nop,nop,TS val 695254859 ecr 197478153,nop,nop,sack 1 {8950:12544}], length 0
20:04:58.368200 IP (tos 0x0, ttl 64, id 63331, offset 0, flags [DF], proto TCP (6), length 564)
172.17.0.1.80 > 172.31.15.88.45202: Flags [.], cksum 0x69b0 (incorrect -> 0x5a24), seq 1537:2049, ack 177, win 219, options [nop,nop,TS val 197478153 ecr 695254859], length 512
20:04:58.368202 IP (tos 0x0, ttl 64, id 63332, offset 0, flags [DF], proto TCP (6), length 564)
172.17.0.1.80 > 172.31.15.88.45202: Flags [.], cksum 0x69b0 (incorrect -> 0xce4b), seq 2049:2561, ack 177, win 219, options [nop,nop,TS val 197478153 ecr 695254859], length 512
20:04:58.368238 IP (tos 0x0, ttl 63, id 28680, offset 0, flags [DF], proto TCP (6), length 64)
172.31.15.88.45202 > 172.17.0.1.80: Flags [.], cksum 0x2983 (correct), seq 177, ack 1537, win 375, options [nop,nop,TS val 695254859 ecr 197478153,nop,nop,sack 1 {8950:12544}], length 0
20:04:58.368247 IP (tos 0x0, ttl 64, id 63333, offset 0, flags [DF], proto TCP (6), length 564)
172.17.0.1.80 > 172.31.15.88.45202: Flags [.], cksum 0x69b0 (incorrect -> 0x9184), seq 2561:3073, ack 177, win 219, options [nop,nop,TS val 197478153 ecr 695254859], length 512
20:04:58.368248 IP (tos 0x0, ttl 64, id 63334, offset 0, flags [DF], proto TCP (6), length 564)
172.17.0.1.80 > 172.31.15.88.45202: Flags [.], cksum 0x69b0 (incorrect -> 0x1ed3), seq 3073:3585, ack 177, win 219, options [nop,nop,TS val 197478153 ecr 695254859], length 512
20:04:58.369430 IP (tos 0x0, ttl 63, id 28681, offset 0, flags [DF], proto TCP (6), length 64)
172.31.15.88.45202 > 172.17.0.1.80: Flags [.], cksum 0x277b (correct), seq 177, ack 2049, win 383, options [nop,nop,TS val 695254859 ecr 197478153,nop,nop,sack 1 {8950:12544}], length 0
20:04:58.369442 IP (tos 0x0, ttl 64, id 63335, offset 0, flags [DF], proto TCP (6), length 564)
172.17.0.1.80 > 172.31.15.88.45202: Flags [.], cksum 0x69b0 (incorrect -> 0x6a4c), seq 3585:4097, ack 177, win 219, options [nop,nop,TS val 197478153 ecr 695254859], length 512
20:04:58.369444 IP (tos 0x0, ttl 64, id 63336, offset 0, flags [DF], proto TCP (6), length 564)
172.17.0.1.80 > 172.31.15.88.45202: Flags [.], cksum 0x69b0 (incorrect -> 0xa2ad), seq 4097:4609, ack 177, win 219, options [nop,nop,TS val 197478153 ecr 695254859], length 512
20:04:58.369484 IP (tos 0x0, ttl 63, id 28682, offset 0, flags [DF], proto TCP (6), length 64)
172.31.15.88.45202 > 172.17.0.1.80: Flags [.], cksum 0x2573 (correct), seq 177, ack 2561, win 391, options [nop,nop,TS val 695254859 ecr 197478153,nop,nop,sack 1 {8950:12544}], length 0
20:04:58.369492 IP (tos 0x0, ttl 64, id 63337, offset 0, flags [DF], proto TCP (6), length 564)
172.17.0.1.80 > 172.31.15.88.45202: Flags [.], cksum 0x69b0 (incorrect -> 0x72fc), seq 4609:5121, ack 177, win 219, options [nop,nop,TS val 197478153 ecr 695254859], length 512
20:04:58.369493 IP (tos 0x0, ttl 64, id 63338, offset 0, flags [DF], proto TCP (6), length 564)
172.17.0.1.80 > 172.31.15.88.45202: Flags [.], cksum 0x69b0 (incorrect -> 0xd764), seq 5121:5633, ack 177, win 219, options [nop,nop,TS val 197478153 ecr 695254859], length 512
20:04:58.369525 IP (tos 0x0, ttl 63, id 28683, offset 0, flags [DF], proto TCP (6), length 64)
172.31.15.88.45202 > 172.17.0.1.80: Flags [.], cksum 0x236b (correct), seq 177, ack 3073, win 399, options [nop,nop,TS val 695254859 ecr 197478153,nop,nop,sack 1 {8950:12544}], length 0
20:04:58.369531 IP (tos 0x0, ttl 63, id 28684, offset 0, flags [DF], proto TCP (6), length 64)
172.31.15.88.45202 > 172.17.0.1.80: Flags [.], cksum 0x2163 (correct), seq 177, ack 3585, win 407, options [nop,nop,TS val 695254859 ecr 197478153,nop,nop,sack 1 {8950:12544}], length 0
20:04:58.369538 IP (tos 0x0, ttl 64, id 63339, offset 0, flags [DF], proto TCP (6), length 564)
172.17.0.1.80 > 172.31.15.88.45202: Flags [.], cksum 0x69b0 (incorrect -> 0x9ad5), seq 5633:6145, ack 177, win 219, options [nop,nop,TS val 197478153 ecr 695254859], length 512
20:04:58.369540 IP (tos 0x0, ttl 64, id 63340, offset 0, flags [DF], proto TCP (6), length 564)
172.17.0.1.80 > 172.31.15.88.45202: Flags [.], cksum 0x69b0 (incorrect -> 0x42fa), seq 6145:6657, ack 177, win 219, options [nop,nop,TS val 197478153 ecr 695254859], length 512
20:04:58.369544 IP (tos 0x0, ttl 64, id 63341, offset 0, flags [DF], proto TCP (6), length 564)
172.17.0.1.80 > 172.31.15.88.45202: Flags [.], cksum 0x69b0 (incorrect -> 0x02b3), seq 6657:7169, ack 177, win 219, options [nop,nop,TS val 197478153 ecr 695254859], length 512
20:04:58.369546 IP (tos 0x0, ttl 64, id 63342, offset 0, flags [DF], proto TCP (6), length 564)
172.17.0.1.80 > 172.31.15.88.45202: Flags [.], cksum 0x69b0 (incorrect -> 0xcf93), seq 7169:7681, ack 177, win 219, options [nop,nop,TS val 197478153 ecr 695254859], length 512
20:04:58.370746 IP (tos 0x0, ttl 63, id 28685, offset 0, flags [DF], proto TCP (6), length 64)
172.31.15.88.45202 > 172.17.0.1.80: Flags [.], cksum 0x1f5b (correct), seq 177, ack 4097, win 415, options [nop,nop,TS val 695254859 ecr 197478153,nop,nop,sack 1 {8950:12544}], length 0
20:04:58.370759 IP (tos 0x0, ttl 64, id 63343, offset 0, flags [DF], proto TCP (6), length 564)
172.17.0.1.80 > 172.31.15.88.45202: Flags [.], cksum 0x69b0 (incorrect -> 0x6d86), seq 7681:8193, ack 177, win 219, options [nop,nop,TS val 197478154 ecr 695254859], length 512
20:04:58.370831 IP (tos 0x0, ttl 63, id 28686, offset 0, flags [DF], proto TCP (6), length 64)
172.31.15.88.45202 > 172.17.0.1.80: Flags [.], cksum 0x1d53 (correct), seq 177, ack 4609, win 423, options [nop,nop,TS val 695254859 ecr 197478153,nop,nop,sack 1 {8950:12544}], length 0
20:04:58.370839 IP (tos 0x0, ttl 63, id 28687, offset 0, flags [DF], proto TCP (6), length 64)
172.31.15.88.45202 > 172.17.0.1.80: Flags [.], cksum 0x1b4b (correct), seq 177, ack 5121, win 431, options [nop,nop,TS val 695254859 ecr 197478153,nop,nop,sack 1 {8950:12544}], length 0
20:04:58.370865 IP (tos 0x0, ttl 63, id 28688, offset 0, flags [DF], proto TCP (6), length 64)
172.31.15.88.45202 > 172.17.0.1.80: Flags [.], cksum 0x1943 (correct), seq 177, ack 5633, win 439, options [nop,nop,TS val 695254859 ecr 197478153,nop,nop,sack 1 {8950:12544}], length 0
20:04:58.370872 IP (tos 0x0, ttl 63, id 28689, offset 0, flags [DF], proto TCP (6), length 64)
172.31.15.88.45202 > 172.17.0.1.80: Flags [.], cksum 0x173b (correct), seq 177, ack 6145, win 447, options [nop,nop,TS val 695254859 ecr 197478153,nop,nop,sack 1 {8950:12544}], length 0
20:04:58.370854 IP (tos 0x0, ttl 64, id 63344, offset 0, flags [DF], proto TCP (6), length 564)
172.17.0.1.80 > 172.31.15.88.45202: Flags [.], cksum 0x69b0 (incorrect -> 0xd9e7), seq 8193:8705, ack 177, win 219, options [nop,nop,TS val 197478154 ecr 695254859], length 512
20:04:58.370858 IP (tos 0x0, ttl 64, id 63345, offset 0, flags [DF], proto TCP (6), length 297)
172.17.0.1.80 > 172.31.15.88.45202: Flags [.], cksum 0x68a5 (incorrect -> 0x3740), seq 8705:8950, ack 177, win 219, options [nop,nop,TS val 197478154 ecr 695254859], length 245
20:04:58.370916 IP (tos 0x0, ttl 63, id 28690, offset 0, flags [DF], proto TCP (6), length 64)
172.31.15.88.45202 > 172.17.0.1.80: Flags [.], cksum 0x1533 (correct), seq 177, ack 6657, win 455, options [nop,nop,TS val 695254859 ecr 197478153,nop,nop,sack 1 {8950:12544}], length 0
20:04:58.370922 IP (tos 0x0, ttl 63, id 28691, offset 0, flags [DF], proto TCP (6), length 64)
172.31.15.88.45202 > 172.17.0.1.80: Flags [.], cksum 0x132b (correct), seq 177, ack 7169, win 463, options [nop,nop,TS val 695254859 ecr 197478153,nop,nop,sack 1 {8950:12544}], length 0
20:04:58.370937 IP (tos 0x0, ttl 63, id 28692, offset 0, flags [DF], proto TCP (6), length 64)
172.31.15.88.45202 > 172.17.0.1.80: Flags [.], cksum 0x1123 (correct), seq 177, ack 7681, win 471, options [nop,nop,TS val 695254859 ecr 197478153,nop,nop,sack 1 {8950:12544}], length 0
20:04:58.371910 IP (tos 0x0, ttl 63, id 28693, offset 0, flags [DF], proto TCP (6), length 64)
172.31.15.88.45202 > 172.17.0.1.80: Flags [.], cksum 0x0f19 (correct), seq 177, ack 8193, win 479, options [nop,nop,TS val 695254860 ecr 197478154,nop,nop,sack 1 {8950:12544}], length 0
20:04:58.372054 IP (tos 0x0, ttl 63, id 28694, offset 0, flags [DF], proto TCP (6), length 64)
172.31.15.88.45202 > 172.17.0.1.80: Flags [.], cksum 0x0d11 (correct), seq 177, ack 8705, win 487, options [nop,nop,TS val 695254860 ecr 197478154,nop,nop,sack 1 {8950:12544}], length 0
20:04:58.372114 IP (tos 0x0, ttl 63, id 28695, offset 0, flags [DF], proto TCP (6), length 52)
172.31.15.88.45202 > 172.17.0.1.80: Flags [.], cksum 0xf77c (correct), seq 177, ack 12544, win 495, options [nop,nop,TS val 695254860 ecr 197478154], length 0
20:04:58.372204 IP (tos 0x0, ttl 63, id 28696, offset 0, flags [DF], proto TCP (6), length 52)
172.31.15.88.45202 > 172.17.0.1.80: Flags [F.], cksum 0xf77b (correct), seq 177, ack 12544, win 495, options [nop,nop,TS val 695254860 ecr 197478154], length 0
20:04:58.372287 IP (tos 0x0, ttl 64, id 63346, offset 0, flags [DF], proto TCP (6), length 52)
172.17.0.1.80 > 172.31.15.88.45202: Flags [F.], cksum 0x67b0 (incorrect -> 0xf88e), seq 12544, ack 178, win 219, options [nop,nop,TS val 197478154 ecr 695254860], length 0
20:04:58.373422 IP (tos 0x0, ttl 63, id 28697, offset 0, flags [DF], proto TCP (6), length 52)
172.31.15.88.45202 > 172.17.0.1.80: Flags [.], cksum 0xf77a (correct), seq 178, ack 12545, win 495, options [nop,nop,TS val 695254860 ecr 197478154], length 0
- Full, but requires manual configuration. Provides native performance.
Adjust the MTU of docker0 interface accordingly - allow the 40 bytes for IPSEC overhead (we’re using transport mode). If using tunnel mode, allow 60 bytes.
In our case set MTU to 8960:
DOCKER_OPTS=" --host=unix:///var/run/docker.sock --mtu=8960 --storage-driver=aufs -g $(readlink -f /data/appdata/docker)"
Re-run the curl:
$ curl -o /dev/null http://nginxdocker/testfile
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 12288 100 12288 0 0 2242k 0 --:--:-- --:--:-- --:--:-- 3000k
This enables performance as per outside of docker.
About this issue
- Original URL
- State: open
- Created 9 years ago
- Reactions: 6
- Comments: 15 (2 by maintainers)
Is there any chance this issue can be moved up to a new milestone? On PaaS setups such as App Engine we cannot set MTU for the host, and setting MTU per container is patchwork at best.
Further to my last comment, setting MTU works within the environment where everything is on the same network, and all nodes have the same MTU.
However if I were to have interregional traffic, with devices in between with a 1500 bytes MTU in-between two servers, this would break again. - and yet would work if the app is running outside of Docker.
Hence my conclusion that Path MTU discovery is broken in docker.
Ok, so I’ve gone and re-read your other tickets as well to get more of an understanding on what’s going on.
So it seems you’re running on Digital Ocean… who from your pastes don’t seem to support jumbo frames so telling Docker to set the MTU to 8960 was never going to work (it would have to be less than the MTU of your normal interface, i.e. 1500).
Therefor for your case if you pass
--mtu 1460
to docker as a parameter, it would work as desired. (to calculate this you need to take the MTU of your normal network interface (in your case eth1 and 1500) and subtract 40 bytes for IPSec overhead. That’s how I got to 1460.I set up a test 2 node ES cluster, with host to host ipsec just now to test a few things. It’s in EC2, but I disabled jumbo frames on the interfaces (set MTU to 1500 for eth0 on each node), to mimic your setup, and ran through a few scenarios
Steps used each time: elastic search run with the following parameters: Host1:
-Des.network.host=0.0.0.0 -Des.cluster.name=bob -Des.discovery.zen.ping.unicast.hosts=172.16.3.73,172.16.3.41 -Des.network.publish_host=172.16.3.43 -Des.discovery.zen.minimum_master_nodes=2
Host2:-Des.network.host=0.0.0.0 -Des.cluster.name=bob -Des.discovery.zen.ping.unicast.hosts=172.16.3.73,172.16.3.41 -Des.network.publish_host=172.16.3.73 -Des.discovery.zen.minimum_master_nodes=2
Data load done via:
curl -XPOST '172.16.3.73:9200/bank/account/_bulk?pretty' --data-binary "@accounts.json"
Verification of documents replicated done via:
curl '172.16.3.41:9200/_cat/indices?v'
--mtu 1460
iptables -t mangle -A POSTROUTING -p tcp --tcp-flags SYN,RST SYN -o eth1 -j TCPMSS --set-mss 1460 --clamp-mss-to-pmtu
iptables -t mangle -A FORWARD -o docker0 -p tcp -m tcp --tcp-flags SYN,RST SYN -m tcpmss --mss 1460:9001 -j TCPMSS --set-mss 1460
Hope this is of help for you. FWIW, I’d recommend doing
2.
as mss clamping only works for TCP, so potentially you may see other issues.