moby: [Swarm-mode] Encrypted overlay traffic fails to transit NAT (AWS/GCE) on Debian kernel > 4.4

Description

In swarm-mode, we’ve found that we’re unable to send any data traffic through an encrypted overlay to an endpoint behind AWS/GCE 1-1 NAT on Debian 9, Ubuntu 18.04, or anything with a kernel > 4.4. Running without 1-1 NAT (public IPs mounted directly on the VM) works fine with all kernels.

All ports are open, including protocol 50, and we can see the ESP traffic on the receiving node. However, we’re only seeing ESP packets being exchanged in one direction (inbound to the node) with no return.

Similar results are described in https://github.com/moby/moby/issues/30727 and quite possibly https://github.com/moby/moby/issues/33133

The only modern Debian variant that works is Ubuntu 16.04.x with the 4.4 kernel. If both sides are 16.04.x, the encrypted traffic is able to transit though the NAT and arrive at the container.

If only one side is 16.04.x and the other is a kernel > 4.4, it fails. Feels like there was there a change to the IPSEC handling > 4.4?

Steps to reproduce the issue:

Simple repro: node-1 on DigitalOcean, node-2 on GCE/AWS.

Spin up a Debian Stretch or Ubuntu Bionic Beaver host (4.9 / 4.15 kernels) on both DigitalOcean and GCE or AWS with the latest stable or edge (18.03.1 o 18.05.0)
Initialize the swarm on DigitalOcean node-1 using --advertise-addr external_ip
By default, DigitalOcean has no firewall so node-1 is wide open. Open TCP 2377, TCP/UDP 7946, UDP 4789, Protocol 50 on node-2 in the AWS Security Group/GCP VPC Firewall Rules and join the swarm as a worker using --advertise-addr node-2-external-ip
docker create network --attachable --driver overlay --opt encrypted encryption_test
docker run -ti --rm --network encryption_test debian bash

If you do this with a non-encrypted overlay, traffic flows with no issues. We’re able to ping between containers, run iperf3; all is well. But on the encrypted overlay, the traffic simply won’t transmit.

Describe the results you received:

ip xfrm state gives:

src 172.31.38.243 dst 207.106.235.21
	proto esp spi 0x3e7a19d6 reqid 13681891 mode transport
	replay-window 0
	aead rfc4106(gcm(aes)) 0xdcb885a138afc1d801f86a6b379dd22e3e7a19d6 64
	anti-replay context: seq 0x0, oseq 0xa, bitmap 0x00000000
	sel src 0.0.0.0/0 dst 0.0.0.0/0
src 207.106.235.21 dst 172.31.38.243
	proto esp spi 0x617bc0ba reqid 13681891 mode transport
	replay-window 0
	aead rfc4106(gcm(aes)) 0xdcb885a138afc1d801f86a6b379dd22e617bc0ba 64
	anti-replay context: seq 0x0, oseq 0x0, bitmap 0x00000000
	sel src 0.0.0.0/0 dst 0.0.0.0/0

We can see ESP traffic on the remote host (not the one doing the pinging) but there is no return:

18:31:24.652640 IP ec2-18-176-21-33.us-east-2.compute.amazonaws.com > hostname.mydomain.com: ESP(spi=0x3e7a19d6,seq=0x1), length 140
18:31:25.653615 IP ec2-18-176-21-33.us-east-2.compute.amazonaws.com > hostname.mydomain.com: ESP(spi=0x3e7a19d6,seq=0x2), length 140
18:31:26.654775 IP ec2-18-176-21-33.us-east-2.compute.amazonaws.com > hostname.mydomain.com: ESP(spi=0x3e7a19d6,seq=0x3), length 140
18:31:27.655940 IP ec2-18-176-21-33.us-east-2.compute.amazonaws.com > hostname.mydomain.com: ESP(spi=0x3e7a19d6,seq=0x4), length 140
18:31:28.657156 IP ec2-18-176-21-33.us-east-2.compute.amazonaws.com > hostname.mydomain.com: ESP(spi=0x3e7a19d6,seq=0x5), length 140

No data traffic is able to flow over the encrypted overlay.

Describe the results you expected:

Data would each both endpoints, and we’d expect the ESP packets to be bidirectional like so:

18:31:08.427964 IP hostname.mydomain.com > docker-18.03.1-minimal-c-2-4gib-sfo2-01: ESP(spi=0x1eccb96e,seq=0x11), length 140
18:31:09.428954 IP docker-18.03.1-minimal-c-2-4gib-sfo2-01 > hostname.mydomain.com: ESP(spi=0xe0184462,seq=0x12), length 140
18:31:09.429103 IP hostname.mydomain.com > docker-18.03.1-minimal-c-2-4gib-sfo2-01: ESP(spi=0x1eccb96e,seq=0x12), length 140
18:31:10.429918 IP docker-18.03.1-minimal-c-2-4gib-sfo2-01 > hostname.mydomain.com: ESP(spi=0xe0184462,seq=0x13), length 140

Additional information you deem important (e.g. issue happens only occasionally):

It seems like this issue with encrypted overlays would be extremely common and yet we don’t see much talk of it online.

Output of docker version:

Client:
 Version:      18.05.0-ce
 API version:  1.37
 Go version:   go1.9.5
 Git commit:   f150324
 Built:        Wed May  9 22:16:13 2018
 OS/Arch:      linux/amd64
 Experimental: false
 Orchestrator: swarm

Server:
 Engine:
  Version:      18.05.0-ce
  API version:  1.37 (minimum version 1.12)
  Go version:   go1.9.5
  Git commit:   f150324
  Built:        Wed May  9 22:14:23 2018
  OS/Arch:      linux/amd64
  Experimental: false

Output of docker info:

Containers: 0
 Running: 0
 Paused: 0
 Stopped: 0
Images: 1
Server Version: 18.05.0-ce
Storage Driver: overlay2
 Backing Filesystem: extfs
 Supports d_type: true
 Native Overlay Diff: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
 Volume: local
 Network: bridge host macvlan null overlay
 Log: awslogs fluentd gcplogs gelf journald json-file logentries splunk syslog
Swarm: active
 NodeID: kwufy0z71h25c1g95jlbdhmtu
 Is Manager: true
 ClusterID: n35zkjqsag1kkzpb793u4s646
 Managers: 1
 Nodes: 3
 Orchestration:
  Task History Retention Limit: 5
 Raft:
  Snapshot Interval: 10000
  Number of Old Snapshots to Retain: 0
  Heartbeat Tick: 1
  Election Tick: 10
 Dispatcher:
  Heartbeat Period: 5 seconds
 CA Configuration:
  Expiry Duration: 3 months
  Force Rotate: 0
 Autolock Managers: false
 Root Rotation In Progress: false
 Node Address: 207.106.235.21
 Manager Addresses:
  207.106.235.21:2377
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 773c489c9c1b21a6d78b5c538cd395416ec50f88
runc version: 4fc53a81fb7c994640722ac585fa9ca548971871
init version: 949e6fa
Security Options:
 apparmor
 seccomp
  Profile: default
Kernel Version: 4.15.0-20-generic
Operating System: Ubuntu 18.04 LTS
OSType: linux
Architecture: x86_64
CPUs: 2
Total Memory: 1.947GiB
Name: ubuntu-s-2vcpu-2gb-sfo2-01
ID: PUMQ:FAA7:LEAF:5NXM:W7CU:TXAJ:6LJA:GXU6:4VTQ:SX4D:EHYQ:MCLD
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
Labels:
Experimental: false
Insecure Registries:
 127.0.0.0/8
Live Restore Enabled: false

WARNING: No swap limit support

Additional environment details (AWS, VirtualBox, physical, etc.):

Tested between GCE, AWS, DigitalOcean, and our own private cloud.

About this issue

Original URL
State: open
Created 6 years ago
Reactions: 7
Comments: 18 (4 by maintainers)

Commits related to this issue

Remove encryption of LIGO overlay network The encryption of the LIGO (and other) overlay network has a bug that is causing issues for one service to reach another over the overlay network. See https:... — committed to cilogon/comanage-registry-docker by skoranda 4 years ago
Remove encryption of LIGO overlay network The encryption of the LIGO (and other) overlay network has a bug that is causing issues for one service to reach another over the overlay network. See https:... — committed to cilogon/comanage-registry-docker by skoranda 4 years ago
Remove encryption of LIGO overlay network The encryption of the LIGO (and other) overlay network has a bug that is causing issues for one service to reach another over the overlay network. See https:... — committed to cilogon/comanage-registry-docker by skoranda 4 years ago

Most upvoted comments

TL;DR: UDP checksum is wrong because ESP packets go through NAT. Configuring the VXLAN interfaces to not set the UDP checksum solves the issue.

My previous post on this issue contains a messy workaround while this post contains a real solution. It consists of configuring the VXLAN interfaces on the different nodes to not calculate the UDP checksum when transmitting packets. According to the ip-link man page such configuration is accomplished with the noudpcsum parameter. There are a two minor obstacles to carrying out this configuration: first, the VXLAN interfaces exist in a namespace automatically created by Docker, which we need to find, and second, the noudpcsum parameter cannot be changed so it is necessary to delete and recreate the interface.

1. Find the correct namespace

We will start by finding the namespace which contains the VXLAN interface which constitute the overlay network. For this we inspect the ingress network like so:

docker network inspect ingress

and note the “Id” parameter. Now we go to /var/run/docker/netns and look for the namespace that matches the “Id” of the ingress. On my nodes the ingress network has an “Id” starting with “de0d3f6935”. There is a corresponding namespace with filename /var/run/docker/netns/1-de0d3f6935, which we take note of.

2. Make the namespace accesible to ip netns

If you run ip netns list you will find that the namespaces created by Docker do not appear in the output. This is because they are not in the path where ip netns expects the namespaces, which is /var/run/netns. To solve this we create the /var/run/netns directory if necessary and create a soft link in that directory to the namespace found in step 1.

ln -s /var/run/docker/netns/1-de0d3f6935

Now ip netns list will show that namespace and, more importantly, we will be able to run commands inside that namespace with ip netns exec 1-de0d3f6935.

3. Gather information about the VXLAN interface

It seems that it is not possible to set the noudpcsum parameter on an already created interface so we must remove the VXLAN interface and recreate it. First we need to gather some information shown on the command:

ip netns exec 1-de0d3f6935 ip -d link show

Specifically we need the following:

The name of the interface [vxlan0]
The MTU [1424]
The VNI [4109]
The bridge to which the interface is enslaved to [br0]

4. Remove the VXLAN interface

Now that we have the necessary information we delete the interface with:

ip netns exec 1-de0d3f6935 ip link delete vxlan0

5. Recreate the VXLAN interface

An important point here is that the interface must be created in the default namespace because the socket on which the driver listens must be in that namespace. We will move the interface to the correct namespace on the next step.

ip link add vxlan0 mtu 1424 type vxlan vni 4109 dstport 4789 noudpcsum proxy l2miss l3miss

Note that we used the values that we gathered on step 3 and also that here we disable the UDP checksum. We must specify the port 4789 even though it is the default IANA port for VXLAN because Linux has a different default. Also note that we need the proxy, l2miss, and l3miss arguments because that is how the original interface was configured.

6. Move the VXLAN interface to the correct namespace

In this step we put the interface where it belongs. The socket listening on port 4789 will remain in the default namespace.

ip link set dev vxlan0 netns 1-de0d3f6935

7. Enslave the VXLAN interface to the bridge

The original interface was enslaved to a bridge so we do that with our new interface as well.

ip netns exec 1-de0d3f6935 ip link set dev vxlan0 master br0

8. Bring up the interface

ip netns exec 1-de0d3f6935 ip link set dev vxlan0 up

Once we do this in all the nodes everything will be ready but I found that I needed to restart the services which use the ingress network for things to work.

When creating an overlay network with encryption Docker should disable the UDP checksum on the VXLAN interfaces as integrity of the data is already guaranteed by ESP.

dicristina on Nov 12, 2020

TL;DR: UDP checksum is wrong because ESP packets go through NAT. Recalculating the checksum solves the problem.

The overlay network is based on VXLAN, which works on top of UDP. When UDP datagrams traverse a NAT the source or destination IP address is changed, which means that the device providing the NAT must recalculate the UDP checksum. When the overlay network is configured to use encryption the VXLAN packets are protected with ESP in transport mode. In this case the NAT device can change the source or destination IP address but cannot update the UDP checksum. When these packets are decrypted they will have the original (i.e. before NAT) UDP checksum, which will be recognized as a bad checksum by the destination. The UDP checksum must be recalculated at the destination before the packets reach the VXLAN socket, otherwise they will be dropped. Another option, at least in theory, would be to somehow configure the VXLAN driver to not use the UDP checksum, that is, to always send the VXLAN packets with an UDP checksum of zero.

I was unable to correct the UDP checksum using iptables but I succeeded by using the tc-csum action. As far as I can tell, the tc actions must be used in conjunction with an interface. This means that I had to route the VXLAN packets through an interface onto which the tc-csum action was applied. I used a configuration similar to the following:

# Create a network namespace.
ip netns add csum

# Create two pairs of veth interfaces each with one member on 
# the default namespace and the other on the csum namespace.
ip link add csum_in type veth peer name in netns csum
ip link add csum_out type veth peer name out netns csum

# Assign IP addresses for the two pair of veth interfaces
# [csum_in, 192.168.117.1] <-> [in, 192.168.117.2]
# [csum_out, 192.168.118.2] <-> [out, 192.168.118.1]
# Remember that csum_in and csum_out belong to the default
# namespace while in and out beling to the csum namespace.

# Make it so that packets with bad checksums are considered
# for connection tracking.  This is necessary for the NAT rule
# that we'll configure next.
sysctl net.netfilter.nf_conntrack_checksum=0

# We want to route the VXLAN packets to a specific interface
# but they are addressed to this host, so we must change the
# destination address for now.  After we correct the checksum
# we'll restore the original destination address.  For this example
# lets say that the original IP address is 172.21.0.1 and the 
# temporary address is 172.31.0.1.
iptables -t nat -I PREROUTING 1 -d 172.21.0.1 -p udp --dport 4789 -j DNAT --to-destination 172.31.0.1
iptables -t nat -I PREROUTING 1 -d 172.31.0.1 -p udp --dport 4789 -j DNAT --to-destination 172.21.0.1

# We need to route the packets with the temporary address 
# so that they trigger the tc-csum action.
ip route add 172.31.0.1 via 192.168.117.2

# We need to be able to route the packets from the csum 
# namespace back to the default namespace.
ip netns exec csum ip route add 172.31.0.1 via 192.168.118.2

# If the FORWARD chain on the filter table has a DROP
# policy don't forget to ACCEPT the traffic on its way to
# the csum namespace.
iptables -t filter -A FORWARD -d 172.31.0.1 -p udp --dport 4789 -j ACCEPT

# Docker configures a rule in the INPUT chain of the filter
# table that DROPs VXLAN traffic that does not come via
# IPsec.  Either delete that rule or ACCEPT the traffic
# before it is dropped.
iptables -t filter -I INPUT <number> -d 172.21.0.1 -p udp --dport 4789 -j ACCEPT

# Be aware that some distributions have rp_filter=1.
# sysctl net.ipv4.conf.all.rp_filter=0
# sysctl net.ipv4.conf.csum_out.rp_filter=0
# ip netns exec csum sysctl net.ipv4.conf.all.rp_filter=0
# ip netns exec csum sysctl net.ipv4.conf.in.rp_filter=0

# This will recalculate the checksum of the traffic
# ingressing the csum namespace by the "in" interface.
ip netns exec csum tc qdisc add dev in ingress handle ffff:
ip netns exec csum tc filter add dev in prio 1 protocol ip parent ffff: u32 match ip src 172.31.0.1/32 flowid :1 action csum ip and udp

If you take a look at the iptables rules that Docker configures for the VXLAN traffic you’ll see that they match the VNI (by using the u32 match option). It would probably be a good idea to do this but I left that out for brevity. Note that even if you have NAT only on one node you will need to correct the checksum in all the nodes, because the NAT causes the checksum to be wrong on both ways.

dicristina on Nov 10, 2020

@bijeebuss It cannot. IPSec NAT-T is not supported still.

cpuguy83 on Feb 21, 2019