cilium: Cilium is using wrong interface for MTU detection

Bug report

We’ve deployed Cilium 1.9.1 via Helm in a bare metal Kubernetes 1.19.4 cluster running MetalLB as a load balancer. It’s a 4-node cluster (one control + 3 worker nodes); each node runs Ubuntu 20.04.1 and has two primary interfaces, one for external connectivity and one to talk to our storage backend. These instances are provisioned using Ubuntu MaaS and the interfaces are aliased using netplan for consistency, named external and storage respectively.

What we’re noticing is that Cilium appears to default to the storage interface rather than the external interface. The tell is that the storage interface has an MTU size of 9000 vs the default 1500 for the external interface; after the cluster is stood up and Cilium and MetalLB are running, any new pods created have an MTU size of 9000 on their primary interface:

~ > kubectl exec -it alpine-deployment-5dbfb45ff5-jpbtn -- sh
/ # ifconfig
eth0      Link encap:Ethernet  HWaddr FE:DA:85:63:64:91
          inet addr:10.0.3.87  Bcast:0.0.0.0  Mask:255.255.255.255
          UP BROADCAST RUNNING MULTICAST  MTU:9000  Metric:1
          RX packets:13 errors:0 dropped:0 overruns:0 frame:0
          TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:1006 (1006.0 B)  TX bytes:0 (0.0 B)

This suggests that Cilium is not respecting the Helm setting devices: "external".

General Information

  • Cilium version (run cilium version)
Client: 1.9.1 975b66772 2020-12-04T18:16:09+01:00 go version go1.15.5 linux/amd64
Daemon: 1.9.1 975b66772 2020-12-04T18:16:09+01:00 go version go1.15.5 linux/amd64
  • Kernel version (run uname -a) Linux k8s01 5.4.0-54-generic #60-Ubuntu SMP Fri Nov 6 10:37:59 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux

  • Orchestration system version in use (e.g. kubectl version, Mesos, …)

ubuntu@k8s01:~$ kubectl version
Client Version: version.Info{Major:"1", Minor:"19", GitVersion:"v1.19.4", GitCommit:"d360454c9bcd1634cf4cc52d1867af5491dc9c5f", GitTreeState:"clean", BuildDate:"2020-11-11T13:17:17Z", GoVersion:"go1.15.2", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"19", GitVersion:"v1.19.4", GitCommit:"d360454c9bcd1634cf4cc52d1867af5491dc9c5f", GitTreeState:"clean", BuildDate:"2020-11-11T13:09:17Z", GoVersion:"go1.15.2", Compiler:"gc", Platform:"linux/amd64"}
  • Link to relevant artifacts (policies, deployments scripts, …)

values.yaml:

containerRuntime:
  integration: "auto"
policyEnforcementMode: "never"
devices: "external"
etcd:
  enabled: true
  ssl: true
  endpoints:
  - "https://172.18.73.31:2379"
hubble:
  enabled: true
  listenAddress: ":4244"
  metrics:
    enabled:
    - dns
    - drop
    - tcp
    - flow
    - port-distribution
    - icmp
    - http
  relay:
    enabled: true
  ui:
    enabled: true
  • Generate and upload a system zip:
curl -sLO https://git.io/cilium-sysdump-latest.zip && python cilium-sysdump-latest.zip

https://novacoast-my.sharepoint.com/:u:/p/sboynton/Ea_SbyQeXKBKrLlb-JKOknEBtPHScZxG54JTEhO229wAlw?e=oCHRbG

How to reproduce the issue

  1. Build cluster nodes with two interfaces, one configured for jumbo frames (e.g. MTU size of 9000)
  2. Standardize interface names using e.g. netplan
  3. Provision nodes as kubernetes nodes
  4. Install Cilium via Helm with devices: "external"
  5. Check MTU size of any new pods to see if they have inherited settings from the external or storage interface

About this issue

  • Original URL
  • State: open
  • Created 4 years ago
  • Comments: 48 (37 by maintainers)

Most upvoted comments

Do you have an example configuration of a node that you’re concerned about here where we may do the “wrong” thing? That could help us to reason through whether the above is a reasonable assumption to make or not.

For example, I may have a interface with a small MTU 500. So by default, cilium will pick this 500 as MTU for pods which I don’t think is the right thing to do by default.

MTU 500 doesn’t make any sense. RFC 791 (IPv4 protocol) states in clear terms that “All hosts must be prepared to accept datagrams of up to 576 octets”. For IPv6, the minimum MTU is 1280.

I was hoping to solicit any nuance that I may be missing with practical examples. If we don’t have practical examples (ie ip link output with context about how the links should be used) then sure we can continue the debate based on the theory.

The main point I think you’re making here is that selecting the lowest MTU from available networks will not guarantee to optimize throughput. What I’m advocating for is to take care of networking first, then provide a way for administrators to address any potential performance gaps. One way to look at this is Greg Ferro’s Hierarchy of Networking Need, image inline below:

Hierarchy of Networking Need

With the above in mind, establishing base connectivity with the most reliable initial configuration (even if suboptimal) will provide the “need” of the baseline connectivity in the hierarchy. On top of this, we should provide the observability and configurability so that more advanced users can optimize performance. Depending on the business importance to the administrator, this might be categorized under the “Required System” (Performance) or “Dependence” (High Speed) categories further up the hierarchy.

This could look something like providing visibility via cilium status to display the MTU configuration and how Cilium decided which MTU to use. For example,

# cilium status
...
MTU configuration: 1450 (cilium_vxlan)
# cilium status --verbose
...
MTU configuration:
- [x] 1450 (cilium_vxlan)
- [ ] 1500 (eth0, eth1)

Based on this, we could easily identify that Cilium selected 1450 to be safe, but if you know what you’re doing, you could modify the configuration to better optimize throughput. Then, sure, we can debate about the best way to present those configuration options to the administrator.

I believe today most users don’t specify devices at all

➕ , and I don’t think we should require users to configure this by default.

and changing default behavior too much is risky.

The default behaviour today is already picking the wrong thing today in some cases. The best bet is to fully think through how we can provide connectivity to all available networks and solve that problem properly rather than implementing a partial solution for the Nth time and kick the can down the road further.

I think the right behavior is to just use k8sIP dev or default route dev (based on auto detection logic) in default case and don’t try to be too smart.

Regarding the K8s IP device, the reason I disagree is that this is simply more complicated than scanning all devices and picking an MTU from them. Coordinating with k8s requires understanding what k8s IP is configured (which also requires establishing connectivity with the APIserver before allowing any kind of node configuration), what kind of address we should care about (InternalIP / ExternalIP), which interface on the node that the IP corresponds to, then what the MTU is of that device. All the while, the actual MTU is not being dictated by Kubernetes, it’s actually the nodes that join and the actual state of the network devices that is the ultimate source of truth.

Regarding default route dev, my main concern is that if you deploy Cilium on a system with two networks and the default route is on a device with higher MTU, then Cilium will configure pods in such a way that connectivity via the secondary network is just broken. Why should Cilium, by default, break connectivity for some paths if the user is not fully aware of the MTU configuration across multiple networks and how to specially configure Cilium’s MTU handling? This doesn’t make sense to me, the first priority should be whether connectivity works; Once connectivity is working, then the next question is how we can optimize it and make it more efficient.

There are also degenerate cases for using default route, like it’s possible to configure multiple default routes and Linux will pick one arbitrarily, partially depending on the health of the L2 neighbour entry for the route.

Overall, I think that the “select min from available devices” (max for MRU) is exactly the idea of “don’t try to be too smart”.

I got a very similar issue and i’m happy to create a new issue if my issue is different enough:

I’m using cilium through rke2 (from rancher) and they changed cilium from 1.9 to 1.10 which broke my cluster.

All bigger kubernetes api requests broke with: "Error: unable to build kubernetes objects from release manifest: Get "https://192.168.0.1:443/openapi/v2?timeout=32s": http2: client connection lost + exit"

It took me ages to find but basically my dedicated worker is connected through a vlan. The vlan interface has an MTU of 1400 while the same interface without vlan attachment has an MTU of 1500.

Cilium agent mentions that it detects an MTU of 1500. Setting the MTU to 1300 made both interfaces MTU be set to 1300 and cilium detected it properly after deleting the cilium pod. Then the kubernetes api timeout disappeared.

The connectivity test from cilium though did not detect this.

@joestringer it’s an interesting idea. I was thinking we could ignore link-local addresses (which could help with the node-local-dns case) For kube-ipvs0 I think the interface should have the MTU of the VM interface (but this is out of the scope of cilium of course)

A quick additional comment here: we have the same problem on a very simple setup: we run https://github.com/kubernetes/dns/tree/master/cmd/node-cache on all our nodes (local coredns cache, very likely the same thing as @dsexton’s node-local-dns) and it creates a dummy interface (which defaults to 1500 MTU). The agent regularly ends up using this interface as the MTU:

level=info msg="Inheriting MTU from external network interface" device=nodelocaldns ipAddr=169.254.20.10 mtu=1500 subsys=mtu

We will likely just end up using --mtu to control the MTU used by the agent but I figured it might of interest as we are not the only ones using this DNS cache. (we are also making sure the dummy interface has a 9001 MTU so everything is consistent)

@joestringer yes, this MTU is much lower than the host’s external network interface. When we restart the Cilium agent it changes the external interface it selects from the kube-proxy interface to the EC2 ENI interface:

level=info msg="Inheriting MTU from external network interface" device=ens5 ipAddr=10.0.0.1 mtu=9001 subsys=mtu

So now exiting nodes have an MTU of 9001 and new nodes use 1500. One of the biggest issues we are seeing is that large DNS query responses from CoreDNS > node-local-dns are being dropped requiring us to manually restart the agents. We are currently rolling out a change to manually set the MTU to 9001.

The fix looks OK, but it doesn’t address one case. When running with kube-proxy-replacement or host-firewall, users are free to select on which netdevs to run BPF programs (via --devices; that was done in the description of this issue). If they have enabled any of the feature, but haven’t selected netdevs, then cilium-agent will do the device auto-detection (https://github.com/cilium/cilium/blob/master/daemon/cmd/kube_proxy_replacement.go#L374).

The selected or detected devices give us a good idea which netdevs will be used by Cilium. So, we can always choose a device with a lower MTU.

@brb fixing --devices is not what we want. If you think this issue is for that, we can file a new issue tracking the change to use k8s node IP to pick MTU. We can’t expect every node share the same interface name.

Thanks @pchaigno and @brb. We verified this morning that traffic is indeed going over the desired interface and that only the MTU is being derived from the wrong one. After setting the MTU manually, everything looks good.