moby: [Swarm-mode] Encrypted overlay traffic fails to transit NAT (AWS/GCE) on Debian kernel > 4.4
Description
In swarm-mode, we’ve found that we’re unable to send any data traffic through an encrypted overlay to an endpoint behind AWS/GCE 1-1 NAT on Debian 9, Ubuntu 18.04, or anything with a kernel > 4.4. Running without 1-1 NAT (public IPs mounted directly on the VM) works fine with all kernels.
All ports are open, including protocol 50, and we can see the ESP traffic on the receiving node. However, we’re only seeing ESP packets being exchanged in one direction (inbound to the node) with no return.
Similar results are described in https://github.com/moby/moby/issues/30727 and quite possibly https://github.com/moby/moby/issues/33133
The only modern Debian variant that works is Ubuntu 16.04.x with the 4.4 kernel. If both sides are 16.04.x, the encrypted traffic is able to transit though the NAT and arrive at the container.
If only one side is 16.04.x and the other is a kernel > 4.4, it fails. Feels like there was there a change to the IPSEC handling > 4.4?
Steps to reproduce the issue:
Simple repro: node-1
on DigitalOcean, node-2
on GCE/AWS.
-
Spin up a Debian Stretch or Ubuntu Bionic Beaver host (4.9 / 4.15 kernels) on both DigitalOcean and GCE or AWS with the latest stable or edge (18.03.1 o 18.05.0)
-
Initialize the swarm on DigitalOcean
node-1
using--advertise-addr external_ip
-
By default, DigitalOcean has no firewall so
node-1
is wide open. Open TCP 2377, TCP/UDP 7946, UDP 4789, Protocol 50 onnode-2
in the AWS Security Group/GCP VPC Firewall Rules and join the swarm as a worker using--advertise-addr node-2-external-ip
-
docker create network --attachable --driver overlay --opt encrypted encryption_test
-
docker run -ti --rm --network encryption_test debian bash
If you do this with a non-encrypted overlay, traffic flows with no issues. We’re able to ping between containers, run iperf3; all is well. But on the encrypted overlay, the traffic simply won’t transmit.
Describe the results you received:
ip xfrm state
gives:
src 172.31.38.243 dst 207.106.235.21
proto esp spi 0x3e7a19d6 reqid 13681891 mode transport
replay-window 0
aead rfc4106(gcm(aes)) 0xdcb885a138afc1d801f86a6b379dd22e3e7a19d6 64
anti-replay context: seq 0x0, oseq 0xa, bitmap 0x00000000
sel src 0.0.0.0/0 dst 0.0.0.0/0
src 207.106.235.21 dst 172.31.38.243
proto esp spi 0x617bc0ba reqid 13681891 mode transport
replay-window 0
aead rfc4106(gcm(aes)) 0xdcb885a138afc1d801f86a6b379dd22e617bc0ba 64
anti-replay context: seq 0x0, oseq 0x0, bitmap 0x00000000
sel src 0.0.0.0/0 dst 0.0.0.0/0
We can see ESP traffic on the remote host (not the one doing the pinging) but there is no return:
18:31:24.652640 IP ec2-18-176-21-33.us-east-2.compute.amazonaws.com > hostname.mydomain.com: ESP(spi=0x3e7a19d6,seq=0x1), length 140
18:31:25.653615 IP ec2-18-176-21-33.us-east-2.compute.amazonaws.com > hostname.mydomain.com: ESP(spi=0x3e7a19d6,seq=0x2), length 140
18:31:26.654775 IP ec2-18-176-21-33.us-east-2.compute.amazonaws.com > hostname.mydomain.com: ESP(spi=0x3e7a19d6,seq=0x3), length 140
18:31:27.655940 IP ec2-18-176-21-33.us-east-2.compute.amazonaws.com > hostname.mydomain.com: ESP(spi=0x3e7a19d6,seq=0x4), length 140
18:31:28.657156 IP ec2-18-176-21-33.us-east-2.compute.amazonaws.com > hostname.mydomain.com: ESP(spi=0x3e7a19d6,seq=0x5), length 140
No data traffic is able to flow over the encrypted overlay.
Describe the results you expected:
Data would each both endpoints, and we’d expect the ESP packets to be bidirectional like so:
18:31:08.427964 IP hostname.mydomain.com > docker-18.03.1-minimal-c-2-4gib-sfo2-01: ESP(spi=0x1eccb96e,seq=0x11), length 140
18:31:09.428954 IP docker-18.03.1-minimal-c-2-4gib-sfo2-01 > hostname.mydomain.com: ESP(spi=0xe0184462,seq=0x12), length 140
18:31:09.429103 IP hostname.mydomain.com > docker-18.03.1-minimal-c-2-4gib-sfo2-01: ESP(spi=0x1eccb96e,seq=0x12), length 140
18:31:10.429918 IP docker-18.03.1-minimal-c-2-4gib-sfo2-01 > hostname.mydomain.com: ESP(spi=0xe0184462,seq=0x13), length 140
Additional information you deem important (e.g. issue happens only occasionally):
It seems like this issue with encrypted overlays would be extremely common and yet we don’t see much talk of it online.
Output of docker version
:
Client:
Version: 18.05.0-ce
API version: 1.37
Go version: go1.9.5
Git commit: f150324
Built: Wed May 9 22:16:13 2018
OS/Arch: linux/amd64
Experimental: false
Orchestrator: swarm
Server:
Engine:
Version: 18.05.0-ce
API version: 1.37 (minimum version 1.12)
Go version: go1.9.5
Git commit: f150324
Built: Wed May 9 22:14:23 2018
OS/Arch: linux/amd64
Experimental: false
Output of docker info
:
Containers: 0
Running: 0
Paused: 0
Stopped: 0
Images: 1
Server Version: 18.05.0-ce
Storage Driver: overlay2
Backing Filesystem: extfs
Supports d_type: true
Native Overlay Diff: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
Volume: local
Network: bridge host macvlan null overlay
Log: awslogs fluentd gcplogs gelf journald json-file logentries splunk syslog
Swarm: active
NodeID: kwufy0z71h25c1g95jlbdhmtu
Is Manager: true
ClusterID: n35zkjqsag1kkzpb793u4s646
Managers: 1
Nodes: 3
Orchestration:
Task History Retention Limit: 5
Raft:
Snapshot Interval: 10000
Number of Old Snapshots to Retain: 0
Heartbeat Tick: 1
Election Tick: 10
Dispatcher:
Heartbeat Period: 5 seconds
CA Configuration:
Expiry Duration: 3 months
Force Rotate: 0
Autolock Managers: false
Root Rotation In Progress: false
Node Address: 207.106.235.21
Manager Addresses:
207.106.235.21:2377
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 773c489c9c1b21a6d78b5c538cd395416ec50f88
runc version: 4fc53a81fb7c994640722ac585fa9ca548971871
init version: 949e6fa
Security Options:
apparmor
seccomp
Profile: default
Kernel Version: 4.15.0-20-generic
Operating System: Ubuntu 18.04 LTS
OSType: linux
Architecture: x86_64
CPUs: 2
Total Memory: 1.947GiB
Name: ubuntu-s-2vcpu-2gb-sfo2-01
ID: PUMQ:FAA7:LEAF:5NXM:W7CU:TXAJ:6LJA:GXU6:4VTQ:SX4D:EHYQ:MCLD
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
Labels:
Experimental: false
Insecure Registries:
127.0.0.0/8
Live Restore Enabled: false
WARNING: No swap limit support
Additional environment details (AWS, VirtualBox, physical, etc.):
Tested between GCE, AWS, DigitalOcean, and our own private cloud.
About this issue
- Original URL
- State: open
- Created 6 years ago
- Reactions: 7
- Comments: 18 (4 by maintainers)
Commits related to this issue
- Remove encryption of LIGO overlay network The encryption of the LIGO (and other) overlay network has a bug that is causing issues for one service to reach another over the overlay network. See https:... — committed to cilogon/comanage-registry-docker by skoranda 4 years ago
- Remove encryption of LIGO overlay network The encryption of the LIGO (and other) overlay network has a bug that is causing issues for one service to reach another over the overlay network. See https:... — committed to cilogon/comanage-registry-docker by skoranda 4 years ago
- Remove encryption of LIGO overlay network The encryption of the LIGO (and other) overlay network has a bug that is causing issues for one service to reach another over the overlay network. See https:... — committed to cilogon/comanage-registry-docker by skoranda 4 years ago
TL;DR: UDP checksum is wrong because ESP packets go through NAT. Configuring the VXLAN interfaces to not set the UDP checksum solves the issue.
My previous post on this issue contains a messy workaround while this post contains a real solution. It consists of configuring the VXLAN interfaces on the different nodes to not calculate the UDP checksum when transmitting packets. According to the ip-link man page such configuration is accomplished with the
noudpcsum
parameter. There are a two minor obstacles to carrying out this configuration: first, the VXLAN interfaces exist in a namespace automatically created by Docker, which we need to find, and second, thenoudpcsum
parameter cannot be changed so it is necessary to delete and recreate the interface.1. Find the correct namespace
We will start by finding the namespace which contains the VXLAN interface which constitute the overlay network. For this we inspect the ingress network like so:
and note the “Id” parameter. Now we go to
/var/run/docker/netns
and look for the namespace that matches the “Id” of the ingress. On my nodes the ingress network has an “Id” starting with “de0d3f6935”. There is a corresponding namespace with filename/var/run/docker/netns/1-de0d3f6935
, which we take note of.2. Make the namespace accesible to ip netns
If you run
ip netns list
you will find that the namespaces created by Docker do not appear in the output. This is because they are not in the path whereip netns
expects the namespaces, which is/var/run/netns
. To solve this we create the/var/run/netns
directory if necessary and create a soft link in that directory to the namespace found in step 1.Now
ip netns list
will show that namespace and, more importantly, we will be able to run commands inside that namespace withip netns exec 1-de0d3f6935
.3. Gather information about the VXLAN interface
It seems that it is not possible to set the
noudpcsum
parameter on an already created interface so we must remove the VXLAN interface and recreate it. First we need to gather some information shown on the command:Specifically we need the following:
4. Remove the VXLAN interface
Now that we have the necessary information we delete the interface with:
5. Recreate the VXLAN interface
An important point here is that the interface must be created in the default namespace because the socket on which the driver listens must be in that namespace. We will move the interface to the correct namespace on the next step.
Note that we used the values that we gathered on step 3 and also that here we disable the UDP checksum. We must specify the port 4789 even though it is the default IANA port for VXLAN because Linux has a different default. Also note that we need the
proxy
,l2miss
, andl3miss
arguments because that is how the original interface was configured.6. Move the VXLAN interface to the correct namespace
In this step we put the interface where it belongs. The socket listening on port 4789 will remain in the default namespace.
7. Enslave the VXLAN interface to the bridge
The original interface was enslaved to a bridge so we do that with our new interface as well.
8. Bring up the interface
Once we do this in all the nodes everything will be ready but I found that I needed to restart the services which use the ingress network for things to work.
When creating an overlay network with encryption Docker should disable the UDP checksum on the VXLAN interfaces as integrity of the data is already guaranteed by ESP.
TL;DR: UDP checksum is wrong because ESP packets go through NAT. Recalculating the checksum solves the problem.
The overlay network is based on VXLAN, which works on top of UDP. When UDP datagrams traverse a NAT the source or destination IP address is changed, which means that the device providing the NAT must recalculate the UDP checksum. When the overlay network is configured to use encryption the VXLAN packets are protected with ESP in transport mode. In this case the NAT device can change the source or destination IP address but cannot update the UDP checksum. When these packets are decrypted they will have the original (i.e. before NAT) UDP checksum, which will be recognized as a bad checksum by the destination. The UDP checksum must be recalculated at the destination before the packets reach the VXLAN socket, otherwise they will be dropped. Another option, at least in theory, would be to somehow configure the VXLAN driver to not use the UDP checksum, that is, to always send the VXLAN packets with an UDP checksum of zero.
I was unable to correct the UDP checksum using iptables but I succeeded by using the tc-csum action. As far as I can tell, the tc actions must be used in conjunction with an interface. This means that I had to route the VXLAN packets through an interface onto which the tc-csum action was applied. I used a configuration similar to the following:
If you take a look at the iptables rules that Docker configures for the VXLAN traffic you’ll see that they match the VNI (by using the u32 match option). It would probably be a good idea to do this but I left that out for brevity. Note that even if you have NAT only on one node you will need to correct the checksum in all the nodes, because the NAT causes the checksum to be wrong on both ways.
@bijeebuss It cannot. IPSec NAT-T is not supported still.