cilium: CiliumEndpoint IP does not match Pod IP
Bug report
There is a CiliumEndpoint IP and Pod IP mismatch for one of the kube-dns Pods. This results in dropped traffic because Cilium thinks the target IP is unmanaged and thus not affected by network policies allowing traffic.
kubectl -n kube-system get po kube-dns-7c976ddbdb-77p6w -owide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
kube-dns-7c976ddbdb-77p6w 4/4 Running 0 5d 10.12.3.7 gke-dev-dev-39261880-f5rb <none> <none>
kubectl -n kube-system get cep kube-dns-7c976ddbdb-77p6w
NAME ENDPOINT ID IDENTITY ID INGRESS ENFORCEMENT EGRESS ENFORCEMENT ENDPOINT STATE IPV4 IPV6
kube-dns-7c976ddbdb-77p6w 1645 16773543 ready 10.12.3.93
Have only seen this happen for one of the kube-dns Pods (although the phenomenon is present in multiple clusters), but it may affect other Pods without us noticing yet.
General Information
- Cilium version (run
cilium version) Client: 1.8.1 5ce2bc7b3 2020-07-02T20:04:47+02:00 go version go1.14.4 linux/amd64 Daemon: 1.8.1 5ce2bc7b3 2020-07-02T20:04:47+02:00 go version go1.14.4 linux/amd64 - Kernel version (run
uname -a) Linux gke-dev-dev-39261880-f5rb 4.19.112+ #1 SMP Thu May 21 12:32:38 PDT 2020 x86_64 x86_64 x86_64 GNU/Linux - Orchestration system version in use (e.g.
kubectl version, Mesos, …) Server Version: version.Info{Major:“1”, Minor:“16+”, GitVersion:“v1.16.13-gke.1”, GitCommit:“688c6543aa4b285355723f100302d80431e411cc”, GitTreeState:“clean”, BuildDate:“2020-07-21T02:37:26Z”, GoVersion:“go1.13.9b4”, Compiler:“gc”, Platform:“linux/amd64”} - Link to relevant artifacts (policies, deployments scripts, …)
- Generate and upload a system zip:
How to reproduce the issue
- Create a GKE cluster
- Install Cilium (following https://docs.cilium.io/en/stable/gettingstarted/k8s-install-gke/). Cilium is installed in the
ciliumnamespace. Only difference beingnodeinit.restartPods=trueandglobal.kubeProxyReplacement=disabled - Check the
kube-dnsPod IP:s and their respective CiliumEndpoint IP:s.kubectl -n kube-system get po -l k8s-app=kube-dns -owidekubectl -n kube-system get cep | grep kube-dns
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Comments: 18 (7 by maintainers)
Commits related to this issue
- node-init: Fixing pods restart on COS containerd The startup script for the node-init was not properly working for nodes running containerd on COS for two reasons: 1. crictl is not available in the ... — committed to fallard84/cilium by fallard84 3 years ago
- node-init: Fixing pods restart on COS containerd The startup script for the node-init was not properly working for nodes running containerd on COS for two reasons: 1. crictl is not available in the ... — committed to cilium/cilium by fallard84 3 years ago
- node-init: Fixing pods restart on COS containerd [ upstream commit c29525560c15bd7b6f0e7fcb0b6b3c9c71b6c3ec ] The startup script for the node-init was not properly working for nodes running containe... — committed to christarazi/cilium by fallard84 3 years ago
- node-init: Fixing pods restart on COS containerd [ upstream commit c29525560c15bd7b6f0e7fcb0b6b3c9c71b6c3ec ] The startup script for the node-init was not properly working for nodes running containe... — committed to cilium/cilium by fallard84 3 years ago
@aanm
Upgraded to
1.8.4yesterday in a cluster with these issues. A day later there is no mismatch 👍Will apply it to some more clusters and wait a few days before closing. But it looks promising!
Upon investigating further, only pods which have their network created before the kubelet gets restarted with Cilium (through the node-init daemonset) end up not being sync. The
nodeinit.restartPodswas not working on GKE with cos_containerd, so I hard turned it off, but now I see why it is mandatory to have it on. I will look into opening some PRs to fix the nodeinit.restartPods for GKE with code_containerd as I have previously identified the problem.@aanm Yes, no problem!
Tried to reproduce this some more times. It does not always happen directly. But it seem to always happen eventually.
For example, had this issue in another cluster yesterday. A
kubectl -n kube-system rollout restart deploy/kube-dnsgot us newkube-dnspods where the pods’ IP:s matches their respectiveCiliumEndpoints. Today when checking there is a mismatch for one of thekube-dnspods:My initial thought was that this is happening when Kubernetes nodes are added/removed (we use some preemptible nodes) and thus moving the
kube-dnsworkloads. But no nodes have been added/removed since yesterday (when there were no mismatch). So either theCiliumEndpointIP or thekube-dnspod IP has changed since then. Unfortunately I didn’t keep track of the exact IP:s.My bad. We actually have
kubeProxyReplacement=disabledand Cilium is running in theciliumnamespace. Will update the description.