cilium: Endpoints unreachable k8s 1.19.0
Bug report
We wanted to upgrade our Kubernetes cluster from v1.18.8 to v1.19.0. Updated the following order: API Server -> Controller Manager -> Scheduler. The last one is the kubelet. After updated the kubelet to v1.19.0 on a worker node (prod-k8s-worker-8da7da49), than readinessProbe fails, or the Pod can not reach world.
Downgrade worker nodes brings back a healthy state. Only kept one with never version for debug purpose.
Let’s check what I can see from the Pod running on new kubelet:
kubectl -n kube-system exec -ti cilium-hqr9x -- cilium status
KVStore: Ok Disabled
Kubernetes: Ok 1.19 (v1.19.0) [linux/amd64]
Kubernetes APIs: ["CustomResourceDefinition", "cilium/v2::CiliumClusterwideNetworkPolicy", "cilium/v2::CiliumEndpoint", "cilium/v2::CiliumNetworkPolicy", "cilium/v2::CiliumNode", "core/v1::Namespace", "core/v1::Node", "core/v1::Pods", "core/v1::Service", "discovery/v1beta1::EndpointSlice", "networking.k8s.io/v1::NetworkPolicy"]
KubeProxyReplacement: Probe [eth0 (DR)] [NodePort (SNAT, 30000-32767, XDP: DISABLED), HostPort, ExternalIPs, HostReachableServices (TCP, UDP), SessionAffinity]
Cilium: Ok OK
NodeMonitor: Listening for events on 8 CPUs with 64x4096 of shared memory
Cilium health daemon: Ok
IPAM: IPv4: 7/255 allocated from 10.0.1.0/24,
Masquerading: BPF [eth0] 10.0.1.0/24
Controller Status: 39/39 healthy
Proxy Status: OK, ip 10.0.1.177, 0 redirects active on ports 10000-20000
Hubble: Ok Current/Max Flows: 3218/4096 (78.56%), Flows/s: 6.50 Metrics: Ok
Cluster health: 3/8 reachable (2020-09-04T15:32:29Z)
Name IP Node Endpoints
prod-k8s-master-1 192.249.66.180 reachable unreachable
prod-k8s-master-2 192.249.66.181 reachable unreachable
prod-k8s-worker-8ddbacc3 192.249.66.188 reachable unreachable
prod-k8s-worker-f08f772d 192.249.66.187 reachable unreachable
prod-k8s-worker-fc988125 192.249.66.185 reachable unreachable
Let’s check what I can see from a Pod running on old kubelet:
$ kubectl -n kube-system exec -ti cilium-qtsf7 -- cilium status
KVStore: Ok Disabled
Kubernetes: Ok 1.19 (v1.19.0) [linux/amd64]
Kubernetes APIs: ["CustomResourceDefinition", "cilium/v2::CiliumClusterwideNetworkPolicy", "cilium/v2::CiliumEndpoint", "cilium/v2::CiliumNetworkPolicy", "cilium/v2::CiliumNode", "core/v1::Namespace", "core/v1::Node", "core/v1::Pods", "core/v1::Service", "discovery/v1beta1::EndpointSlice", "networking.k8s.io/v1::NetworkPolicy"]
KubeProxyReplacement: Probe [eth0 (DR)] [NodePort (SNAT, 30000-32767, XDP: DISABLED), HostPort, ExternalIPs, HostReachableServices (TCP, UDP), SessionAffinity]
Cilium: Ok OK
NodeMonitor: Listening for events on 8 CPUs with 64x4096 of shared memory
Cilium health daemon: Ok
IPAM: IPv4: 19/255 allocated from 10.0.4.0/24,
Masquerading: BPF [eth0] 10.0.4.0/24
Controller Status: 102/102 healthy
Proxy Status: OK, ip 10.0.4.117, 0 redirects active on ports 10000-20000
Hubble: Ok Current/Max Flows: 4096/4096 (100.00%), Flows/s: 55.16 Metrics: Ok
Cluster health: 7/8 reachable (2020-09-04T15:41:31Z)
Name IP Node Endpoints
prod-k8s-worker-8da7da49 192.249.66.186 reachable unreachable
General Information
- Cilium version (run
cilium version)
Client: 1.8.3 54cf3810d 2020-09-04T14:01:53+02:00 go version go1.14.7 linux/amd64
Daemon: 1.8.3 54cf3810d 2020-09-04T14:01:53+02:00 go version go1.14.7 linux/amd64
- Kernel version (run
uname -a)
Linux prod-k8s-worker-8da7da49 5.8.5-talos #1 SMP Tue Sep 1 19:36:01 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
- Orchestration system version in use (e.g.
kubectl version, Mesos, …)
Client Version: version.Info{Major:"1", Minor:"18", GitVersion:"v1.18.6", GitCommit:"dff82dc0de47299ab66c83c626e08b245ab19037", GitTreeState:"clean", BuildDate:"2020-07-15T16:58:53Z", GoVersion:"go1.13.9", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"19", GitVersion:"v1.19.0", GitCommit:"e19964183377d0ec2052d1f1fa930c4d7575bd50", GitTreeState:"clean", BuildDate:"2020-08-26T14:23:04Z", GoVersion:"go1.15", Compiler:"gc", Platform:"linux/amd64"}
-
Link to relevant artifacts (policies, deployments scripts, …) N/A cilium-sysdump-20200904-173209.zip
-
Generate and upload a system zip:
curl -sLO https://git.io/cilium-sysdump-latest.zip && python cilium-sysdump-latest.zip
How to reproduce the issue This is tricky because our playground does not produce this symptom - I believe because it does not have the same load, pods, network policies etc.
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Comments: 19 (14 by maintainers)
I made lots of investigation and I found thee root cause. Let me share the results
VMware
ESXi 7.0and latter introduced a new revision ofVMXNET3Network Adapter.Using
6.5.0 build 9298722with Kernelv5.5.15=> OK Using6.5.0 build 9298722with Kernelv5.8.10=> OK Using7.0.1 build 16850804with Kernelv5.5.15=> OK Using7.0.1 build 16850804with Kernelv5.8.10 / v5.8.15=> NOKUntil a potential fix from VMware we have to use
E1000Eas Network Adapter.I am pretty confident that is not related to Cilium.
Thank you for your help
@alex1989hu there isn’t an ETA for
1.8.5at the moment.