kubernetes: DualStack: Ip family of status.podIP can be wrong depending on CNI-plugin

Ip family of status.podIP can be wrong depending on CNI-plugin

What happened:

The status.podIP is taken from status.podIPs[0] without regard for the main family of the cluster. No warning or error is given if status.podIP gets the wrong family.

What you expected to happen:

That status.podIP is taken from the “main” ip family, or at the very least the POD is not started and an error is given if it’s not.

How to reproduce it (as minimally and precisely as possible):

Use the bridge CNI-plugin and host-local ipam. Use a single node cluster to avoid that host-local assign the same address to PODs in different nodes. This may be CRI-plugin dependent, I use cri-o/1.18.3.

$ cat /etc/cni/net.d/10-bridge.conf 
{
  "cniVersion": "0.4.0",
  "name": "cni-x",
  "type": "bridge",
  "bridge": "cbr0",
  "isDefaultGateway": true,
  "hairpinMode": true,
  "ipam": {
    "type": "host-local",
    "ranges": [
      [ { "subnet": "1100::100/120" } ],
      [ { "subnet": "11.0.1.0/24" } ]
    ]
  }
}
$ ls /opt/cni/bin/
bridge*      host-local*    loopback*

The PODs are assigned dual addresses which has been supported since v1.9. Note the order; IPv6 first.

In a dual-stack or a single-stack cluster with main family IPv4 status.podIP will get an IPv6 address even if the main family is IPv4;

$ kubectl get pod alpine-daemonset-4t6w5 -o json | jq .status.podIP
"1100::102"

Anything else we need to know?:

First, note that this is not a dual-stack problem. status.podIP may get the wrong family in a single-stack cluster if the PODs has dual addresses. The bug may however have been introduced with the dual-stack support, before status.podIPs K8s may have selected the correct family, I have not checked that.

This is reported and discussed in https://github.com/kubernetes/kubernetes/issues/94505. There it is considered as a CNI-plugin problem, the CNI-plugin must present addresses in the order K8s wants them. Problem is that CNI-plugins doesn’t know this.

The addresses takes the path;

CNI-plugin -> CRI-plugin -> kubelet -> API-server

The problem can be addressed in either of these places.

The current situation is that the full responsibility lies on the CNI-plugin, it must send the addresses in the “correct order”. This is not the best place. The CNI-plugins are not a K8s-only thing and to require that a “prefered-family” or something similar must be supported by CNI-plugins to be “Kubernetes compliant” should be avoided.

The CRI-plugin is a K8s-only thing and a “prefered-family” can be introduced so the CRI-plugin can sort the address array before sending it to kubelet. This is however undesirable since it adds a configuration complexity, it put a responsibility on the user or installation tool to configure the “prefered-family” to all CNI-plugins now and in the future.

The best is to handle this in K8s itself where the “main” family is known.

    // IP addresses allocated to the pod. This list
    // is inclusive, i.e. it includes the default IP address stored in the
    // "PodIP" field, and this default IP address must be recorded in the
    // 0th entry (PodIPs[0]) of the slice. The list is empty if no IPs have
    // been allocated yet.
    PodIPs []PodIP `json:"podIPs,omitempty" protobuf:"bytes,6,opt,name=podIPs"`

Is this really necessary? All communication works even if podIPs[0] is not of the main family. EndpointSlices seem to get this right regardless of order. Endpoint on the other hand seem to take the (old) podIP. So perhaps;

The "PodIP" field is set to the first address in podIPs that matches the main ip-family.

is sufficient?

If podIP really must be podIPs[0], then K8s should sort the array so podIPs[0] belongs to the main family.

The problem exist on all supported K8s versions and on “master”.

Current situation for dual-stack supporting CNI-plugins

If installed in the default way the situation is;

  • Calico always send IPv4 first
  • Cilum always send IPv6 first

I don’t think you can control the order, but I have not asked.

Environment:

  • Kubernetes versions: v1.17.12, v1.18.9, v1.19.2, v1.20.0-alpha.1, master v1.20.0-alpha.1.257+112dbd55860e60
  • Cloud provider or hardware configuration: None
  • OS (e.g: cat /etc/os-release): xcluster
  • Kernel: linux-5.8.1
  • Install tools: None
  • Network plugin and version: bridge, host-local v0.8.7
  • Others: CRI-plugin: cri-o 1.18.3

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Reactions: 3
  • Comments: 36 (34 by maintainers)

Commits related to this issue

Most upvoted comments

So, to sum up:

  • Kubelet will always use the first IP returned by the CNI plugin as pod.Spec.PodIP, regardless of any other cluster configuration.
  • Some CNI plugins, when configured to do dual stack, do not give you a way to configure which IP comes first, meaning kubelet may pick a pod.Spec.PodIP that is not the IP family the admin considers the cluster’s primary IP family
  • In dual-stack clusters, this does not appear to cause any actual problems, other than that the “default” pod IPs shown in “kubectl get” may not be the ones the administrator wants
  • Nothing in the current kubelet configuration is supposed to indicate whether kubelet should prefer IPv4 or IPv6 pod IPs. It would be possible to add a new config option indicating this, or we could just declare that --node-ip should also affect the sorting of pod IPs, because really why wouldn’t you want that?
  • In single-stack clusters, it has been suggested that the right answer is “don’t do that then”; ie, the administrator should configure the CNI plugin to only return a single IP, and if they can’t do that, then that’s the CNI plugin’s problem, not Kubernetes’s

oh, and:

  • if the CNI plugin returns multiple IPs which are not a dual-stack pair (eg, if it returns 2 IPv4 IPs, or 1 IPv4 IP and 2 IPv6 IPs), then kubelet will fail to update the pod at all, because it assumes it can just copy the IPs returned from CNI into podIPs, but the apiserver will only accept podIPs if it is either a single IP or a dual-stack pair.

The podIP itself should be the requirement is for backwards compatibility with clients that do not read podIPs. Since they doesn’t read podIPs the value of podIPs[0] should not matter to them, or?

The reason podIP must match podIPs[0] is related to updates to an existing object by an old client. See https://github.com/kubernetes/kubernetes/pull/88505 for a related problem when this was not done.

  • An old client updated an existing object, setting only the podIP field (which is the only one it was aware of).
  • The API server needed to detect that and populate the podIPs field correctly for new clients to read.
  • The detection is based on the mismatch between podIP and podIPs[0], and the server responds by populating podIPs=[podIP]. That means that new clients (aware of both fields) must set podIP and podIPs[0] to match.