Flatcar: Upgrade to systemd 243+ breaks pod networking with AWS CNI due to veth MAC Address getting overwritten

Description

Our flatcar image was auto-updated from 2512.4.0 to 2605.5.0, this somehow broke the ability for the Node to talk to pods running on it.

Impact

Pods on worker nodes are not able to communicate with Master Nodes API server pods.

Environment and steps to reproduce

  1. Set-up:
  • Kubernetes Client Version: version.Info Major:“1”, Minor:“19”, GitVersion:“v1.19.1”
  • Kubernetes Server Version: version.Info Major:“1”, Minor:“16”, GitVersion:“v1.16.13”
  • Running on AWS instances using Flatcar 2605.5.0 (also tested with 2605.7.0)
  • Cilium v1.7.5 (also tested with Cilium v1.8.5)
  • AWS VPC CNI (v1.6.3)
  1. Task: Reach a pod running on the Node

  2. Action(s): a. Upgrading from Flatcar 2512.4.0 to 2605.5.0

  3. Error: The node cannot reach the pod running on it.

Node (ip-10-64-52-104.eu-west-1.compute.internal ) to POD (10.64.36.243) on Master-newFC (ip-10-64-32-253.eu-west-1.compute.internal)

tracepath 10.64.36.243
 1?: [LOCALHOST]                                         pmtu 9001
 1:  ip-10-64-32-253.eu-west-1.compute.internal            0.503ms 
 1:  ip-10-64-32-253.eu-west-1.compute.internal            0.464ms 
 2:  no reply
 3:  no reply
 4:  no reply
 5:  no reply
 6:  no reply
...
30:  no reply
     Too many hops: pmtu 9001
     Resume: pmtu 9001 

Expected behavior

Node (ip-10-64-52-104.eu-west-1.compute.internal ) to POD (10.64.33.129) on Master-oldFC (ip-10-64-34-191.eu-west-1.compute.internal)

tracepath 10.64.33.129
 1?: [LOCALHOST]                                         pmtu 9001
 1:  ip-10-64-34-191.eu-west-1.compute.internal            0.538ms 
 1:  ip-10-64-34-191.eu-west-1.compute.internal            0.460ms 
 2:  ip-10-64-33-129.eu-west-1.compute.internal            0.475ms reached
     Resume: pmtu 9001 hops 2 back 2 

Additional information Cilium-monitor output when trying to run tracepath on a node with a pod running on it

level=info msg="Initializing dissection cache..." subsys=monitor
-> endpoint 1077 flow 0xd4db6b68 identity 1->66927 state new ifindex 0 orig-ip 10.64.32.253: 10.64.32.253:36282 -> 10.64.39.43:44444 udp
-> stack flow 0xa466c6d3 identity 66927->1 state related ifindex 0 orig-ip 0.0.0.0: 10.64.39.43 -> 10.64.32.253 DestinationUnreachable(Port)

TCP dump on Node trying to reach a pod running on it.

15:18:00.676152 IP ip-10-64-32-253.eu-west-1.compute.internal.58914 > ip-10-64-52-104.eu-west-1.compute.internal.4240: Flags [.], ack 548860955, win 491, options [nop,nop,TS val 3987550058 ecr 3030925508], length 0
15:18:00.676520 IP ip-10-64-52-104.eu-west-1.compute.internal.4240 > ip-10-64-32-253.eu-west-1.compute.internal.58914: Flags [.], ack 1, win 489, options [nop,nop,TS val 3030955756 ecr 3987534941], length 0
15:18:00.919448 IP ip-10-64-52-104.eu-west-1.compute.internal.4240 > ip-10-64-32-253.eu-west-1.compute.internal.58914: Flags [.], ack 1, win 489, options [nop,nop,TS val 3030955999 ecr 3987534941], length 0
15:18:00.919497 IP ip-10-64-32-253.eu-west-1.compute.internal.58914 > ip-10-64-52-104.eu-west-1.compute.internal.4240: Flags [.], ack 1, win 491, options [nop,nop,TS val 3987550301 ecr 3030955756], length 0
15:18:01.465589 IP ip-10-64-52-104.eu-west-1.compute.internal.34294 > ip-10-64-36-243.eu-west-1.compute.internal.44448: UDP, length 8973
15:18:01.465630 IP ip-10-64-52-104.eu-west-1.compute.internal.34294 > ip-10-64-36-243.eu-west-1.compute.internal.44448: UDP, length 8973
15:18:01.465647 IP ip-10-64-36-243.eu-west-1.compute.internal > ip-10-64-52-104.eu-west-1.compute.internal: ICMP ip-10-64-36-243.eu-west-1.compute.internal udp port 44448 unreachable, length 556

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Reactions: 1
  • Comments: 23 (12 by maintainers)

Commits related to this issue

Most upvoted comments

I’ve applied the fix to all flatcar branches. I’ve verified that the test case provided by @dvulpe passes with this fix applied.

This fix will be included in the next set of flatcar releases.

I’ve tested this and verified that indeed the MACAdressPolicy is the likely culprit. To a broken node, I added:

core@ip-192-168-4-227 ~ $ cat /etc/systemd/network/50-veth.link 
[Match]
Driver=veth

[Link]
MACAddressPolicy=none

And after rebooting, new virtual interfaces work correctly. We had dealt with a similar issue for flannel here: https://github.com/kinvolk/coreos-overlay/pull/282. In my tests, I also tried matching by name (eni*) and it also solved the problem.

As with in the linked flannel issue, the problem is visible when using ip monitor all and catching the interface getting created:

[LINK]4: eni4bd1086a8e2@if3: <BROADCAST,MULTICAST> mtu 9001 qdisc noop state DOWN group default 
    link/ether fe:f3:b1:0e:64:f5 brd ff:ff:ff:ff:ff:ff link-netnsid 0
[LINK]4: eni4bd1086a8e2@if3: <BROADCAST,MULTICAST> mtu 9001 qdisc noop state DOWN group default 
    link/ether 0e:33:4a:64:18:29 brd ff:ff:ff:ff:ff:ff link-netnsid 0
[LINK]4: eni4bd1086a8e2@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9001 qdisc noqueue state UP group default 
    link/ether 0e:33:4a:64:18:29 brd ff:ff:ff:ff:ff:ff link-netnsid 0

We have a few files that instruct systemd not to manage links for various CNIs, we do this by matching by interface names in the .network files, so we need to keep listing more and more with each different CNI. And also, at least in the case of flannel it was not enough, to avoid the mac address being managed we had to add the explicit .link file with the MacAddressPolicy=none setting.

These are the network files that currently state links should be unmanaged:

core@ip-192-168-4-227 /usr/lib64/systemd/network $ grep -ir -b1 unmanaged .
./ipsec-vti.network-69-[Link]
./ipsec-vti.network:76:Unmanaged=yes
--
./yy-azure-sriov-coreos.network-556-[Link]
./yy-azure-sriov-coreos.network:563:Unmanaged=yes
--
./yy-azure-sriov.network-557-[Link]
./yy-azure-sriov.network:564:Unmanaged=yes
--
./calico.network-79-[Link]
./calico.network:86:Unmanaged=yes
--
./50-flannel.network-23-[Link]
./50-flannel.network:30:Unmanaged=yes
--
./cni.network-74-[Link]
./cni.network:81:Unmanaged=yes

I think we should add a file like the one I showed above (matching on the veth driver) to our default setup. Alternatively, if we think that’s too broad, we could also match by name.

Thanks @marga-kinvolk, I deployed the latest image today and the test i conducted before are now passing!

Hi! First of all, thanks Dan for the reproduction case. I’ve used it and was able to verify that indeed this breaks when switching from 2512 to 2605. BTW, the repro case uses AWS CNI, not Cillium, so it already confirms what Greg commented.

I spent quite a few hours trying to figure out what exactly the problem is, but I wasn’t able to find the root cause. In the repro case, when using 2605 the coredns pod is unable to send or receive packets. They get dropped, but I couldn’t find what’s causing the drop.

I tried comparing the sysctl values across both versions and overriding some of those that were different, but that had no effect. The generated firewall rules have a few differences, including a comment that says that the Cluster IP is not reachable, but it’s unclear whether the differences are cause or effect of the problem.

I can also confirm that the 2605 series has broken the AWS CNI. Pods with host networking (e.g. kube-proxy or the CNI daemonset itself) are able to communicate outside the node fine, but pods without host networking fail to connect to anything as if the packets are being dropped, whether the destination is in-cluster, out-of-cluster, or even on the same node. This includes calls to the kubernetes.default service that exposes the control plane in the cluster.

Interestingly after reboot this works, but as the pod get’s rescheduled (on the same node) with a different IP it stops working again.