cilium: Cilium Helm install does not finish properly when BGP CP is enabled.
Is there an existing issue for this?
- I have searched the existing issues
What happened?
Cilium Helm install does not finish properly when BGP CP is enabled.
Run helm install cilium cilium/cilium --version 1.14.4 -f values.yaml -n kube-system.
Values:
cluster:
name: "{{ cluster_name }}"
id: 0
tunnel: "geneve"
encryption:
enabled: true
type: wireguard
nodeEncryption: true
bgpControlPlane:
enabled: true
ipam:
mode: "kubernetes"
envoy:
# -- Enable Envoy Proxy in standalone DaemonSet.
enabled: true
# Hubble
hubble:
metrics:
enabled:
- dns:query;ignoreAAAA
- drop
- tcp
- flow
- icmp
- http
peerService:
clusterDomain: "{{ cluster_name }}"
relay:
enabled: true
tls:
server:
enabled: true
prometheus:
enabled: true
ui:
enabled: true
prometheus:
enabled: true
Installation ends by cilium agents not coming up complaining about health probe:
Warning Unhealthy 3m34s (x4 over 3m40s) kubelet Startup probe failed: Get "http://127.0.0.1:9879/healthz": dial tcp 127.0.0.1:9879: connect: connection refused
agent pod log throws errors about crd:
level=info msg="Start hook executed" duration="2.475µs" function="*resource.resource[*github.com/cilium/cilium/pkg/k8s/slim/k8s/api/core/v1.Service].Start" subsys=hive
level=info msg="Start hook executed" duration=101.003996ms function="*manager.diffStore[*github.com/cilium/cilium/pkg/k8s/slim/k8s/api/core/v1.Service].Start" subsys=hive
level=info msg="Start hook executed" duration="4.157µs" function="*resource.resource[*github.com/cilium/cilium/pkg/k8s.Endpoints].Start" subsys=hive
level=info msg="Using discoveryv1.EndpointSlice" subsys=k8s
level=info msg="Start hook executed" duration=100.727737ms function="*manager.diffStore[*github.com/cilium/cilium/pkg/k8s.Endpoints].Start" subsys=hive
level=info msg="Start hook executed" duration="4.99µs" function="*resource.resource[*github.com/cilium/cilium/pkg/k8s/apis/cilium.io/v2alpha1.CiliumBGPPeeringPolicy].Start" subsys=hive
level=info msg="Start hook executed" duration="2.502µs" function="*resource.resource[*github.com/cilium/cilium/pkg/k8s/apis/cilium.io/v2.CiliumNode].Start" subsys=hive
level=info msg="Start hook executed" duration="14.413µs" function="*agent.kubernetesNodeSpecer.Start" subsys=hive
level=warning msg="github.com/cilium/cilium/pkg/k8s/resource/resource.go:305: failed to list *v2alpha1.CiliumBGPPeeringPolicy: the server could not find the requested resource (get ciliumbgppeeringpolicies.cilium.io)" subsys=klog
level=error msg=k8sError error="github.com/cilium/cilium/pkg/k8s/resource/resource.go:305: Failed to watch *v2alpha1.CiliumBGPPeeringPolicy: failed to list *v2alpha1.CiliumBGPPeeringPolicy: the server could not find the requested resource (get ciliumbgppeeringpolicies.cilium.io)" subsys=k8s
level=warning msg="github.com/cilium/cilium/pkg/k8s/resource/resource.go:305: failed to list *v2alpha1.CiliumBGPPeeringPolicy: the server could not find the requested resource (get ciliumbgppeeringpolicies.cilium.io)" subsys=klog
level=error msg=k8sError error="github.com/cilium/cilium/pkg/k8s/resource/resource.go:305: Failed to watch *v2alpha1.CiliumBGPPeeringPolicy: failed to list *v2alpha1.CiliumBGPPeeringPolicy: the server could not find the requested resource (get ciliumbgppeeringpolicies.cilium.io)" subsys=k8s
level=warning msg="github.com/cilium/cilium/pkg/k8s/resource/resource.go:305: failed to list *v2alpha1.CiliumBGPPeeringPolicy: the server could not find the requested resource (get ciliumbgppeeringpolicies.cilium.io)" subsys=klog
level=error msg=k8sError error="github.com/cilium/cilium/pkg/k8s/resource/resource.go:305: Failed to watch *v2alpha1.CiliumBGPPeeringPolicy: failed to list *v2alpha1.CiliumBGPPeeringPolicy: the server could not find the requested resource (get ciliumbgppeeringpolicies.cilium.io)" subsys=k8s
level=warning msg="github.com/cilium/cilium/pkg/k8s/resource/resource.go:305: failed to list *v2alpha1.CiliumBGPPeeringPolicy: the server could not find the requested resource (get ciliumbgppeeringpolicies.cilium.io)" subsys=klog
level=error msg=k8sError error="github.com/cilium/cilium/pkg/k8s/resource/resource.go:305: Failed to watch *v2alpha1.CiliumBGPPeeringPolicy: failed to list *v2alpha1.CiliumBGPPeeringPolicy: the server could not find the requested resource (get ciliumbgppeeringpolicies.cilium.io)" subsys=k8s
level=warning msg="github.com/cilium/cilium/pkg/k8s/resource/resource.go:305: failed to list *v2alpha1.CiliumBGPPeeringPolicy: the server could not find the requested resource (get ciliumbgppeeringpolicies.cilium.io)" subsys=klog
level=error msg=k8sError error="github.com/cilium/cilium/pkg/k8s/resource/resource.go:305: Failed to watch *v2alpha1.CiliumBGPPeeringPolicy: failed to list *v2alpha1.CiliumBGPPeeringPolicy: the server could not find the requested resource (get ciliumbgppeeringpolicies.cilium.io)" subsys=k8s
level=warning msg="github.com/cilium/cilium/pkg/k8s/resource/resource.go:305: failed to list *v2alpha1.CiliumBGPPeeringPolicy: the server could not find the requested resource (get ciliumbgppeeringpolicies.cilium.io)" subsys=klog
level=error msg=k8sError error="github.com/cilium/cilium/pkg/k8s/resource/resource.go:305: Failed to watch *v2alpha1.CiliumBGPPeeringPolicy: failed to list *v2alpha1.CiliumBGPPeeringPolicy: the server could not find the requested resource (get ciliumbgppeeringpolicies.cilium.io)" subsys=k8s
level=warning msg="github.com/cilium/cilium/pkg/k8s/resource/resource.go:305: failed to list *v2alpha1.CiliumBGPPeeringPolicy: the server could not find the requested resource (get ciliumbgppeeringpolicies.cilium.io)" subsys=klog
level=error msg=k8sError error="github.com/cilium/cilium/pkg/k8s/resource/resource.go:305: Failed to watch *v2alpha1.CiliumBGPPeeringPolicy: failed to list *v2alpha1.CiliumBGPPeeringPolicy: the server could not find the requested resource (get ciliumbgppeeringpolicies.cilium.io)" subsys=k8s
level=warning msg="github.com/cilium/cilium/pkg/k8s/resource/resource.go:305: failed to list *v2alpha1.CiliumBGPPeeringPolicy: the server could not find the requested resource (get ciliumbgppeeringpolicies.cilium.io)" subsys=klog
level=error msg=k8sError error="github.com/cilium/cilium/pkg/k8s/resource/resource.go:305: Failed to watch *v2alpha1.CiliumBGPPeeringPolicy: failed to list *v2alpha1.CiliumBGPPeeringPolicy: the server could not find the requested resource (get ciliumbgppeeringpolicies.cilium.io)" subsys=k8s
level=warning msg="github.com/cilium/cilium/pkg/k8s/resource/resource.go:305: failed to list *v2alpha1.CiliumBGPPeeringPolicy: the server could not find the requested resource (get ciliumbgppeeringpolicies.cilium.io)" subsys=klog
level=error msg=k8sError error="github.com/cilium/cilium/pkg/k8s/resource/resource.go:305: Failed to watch *v2alpha1.CiliumBGPPeeringPolicy: failed to list *v2alpha1.CiliumBGPPeeringPolicy: the server could not find the requested resource (get ciliumbgppeeringpolicies.cilium.io)" subsys=k8s
I am guessing issue here is that agent is trying to get BGP crds but crds are created later by operator once it is up running, but it can’t run until network (agent) is ready. race condition?
Once BGP CP is disabled:
bgpControlPlane:
enabled: false
Cilium is deployed properly.
Cilium Version
1.14.3 1.14.4
Kernel Version
6.2.0-37-generic
Kubernetes Version
v1.27.6
Sysdump
No response
Relevant log output
No response
Anything else?
No response
Code of Conduct
- I agree to follow this project’s Code of Conduct
About this issue
- Original URL
- State: closed
- Created 7 months ago
- Comments: 15 (6 by maintainers)
@stelucz Thanks, I managed to reproduce the issue (setting the Helm value
operator.tolerationsto[]ensures that the operator does not start before CNI config is installed on the node and triggers this issue).The root cause seems to be that BGP CP creates a
resource.Storefor necessary resources in theStarthive hook, which blocks if the CRD is not yet installed: https://github.com/cilium/cilium/blob/61b9a21175844fdbc4d963e1f9846358d83510c6/pkg/bgpv1/manager/store/diffstore.go#L77-L83This causes that another
Starthive hook, which is supposed to install the Cilium CNI config files, is never called, as it seems to be called after the BGP hooks:https://github.com/cilium/cilium/blob/61b9a21175844fdbc4d963e1f9846358d83510c6/daemon/cmd/cni/config.go#L171-L178
Will look into how we can solve this.
@rastislavs Yes, not just in CRDs installation as pointed out here https://github.com/cilium/cilium/issues/29371#issuecomment-1829720624 but also in CRDs evaluation/expectation to be present in cluster by agent.
Do you need any other info from me now? I am happy to help.
@stelucz
yeah, there seems to be some inconsistency in how/when CRDs are being installed it seems. Partially it is expected, as cilium-agent is modular and each component can do things slightly differently, but maybe we should double-check the BGP CRD installation logic in comparison to other components.
@stelucz I believe your issue is somehow related to the the tolerations set on the
cilium-operator. By default it is deployed with[{"operator":"Exists"}]. From the sysdump it seems yours is deployed with the following tolerations (I guess you set them via Helm values / or it is done by Kubespray?):That might be the reason why it normally works for other users.
Note that
NetworkPluginNotReadyshould normally not prevent cilium-operator to be started, as it runs withhostNetwork: true(so it does not really require CNI to be set up) and it has the “wildcard” toleration[{"operator":"Exists"}]- so normally the operator can start before the cilium-agents are ready.