cilium: Cilium Helm install does not finish properly when BGP CP is enabled.

Is there an existing issue for this?

  • I have searched the existing issues

What happened?

Cilium Helm install does not finish properly when BGP CP is enabled.

Run helm install cilium cilium/cilium --version 1.14.4 -f values.yaml -n kube-system.

Values:

cluster:
  name: "{{ cluster_name }}"
  id: 0
tunnel: "geneve"
encryption:
  enabled: true
  type: wireguard
  nodeEncryption: true
bgpControlPlane:
  enabled: true
ipam:
  mode: "kubernetes"
envoy:
  # -- Enable Envoy Proxy in standalone DaemonSet.
  enabled: true
# Hubble
hubble:
  metrics:
    enabled:
    - dns:query;ignoreAAAA
    - drop
    - tcp
    - flow
    - icmp
    - http
  peerService:
    clusterDomain: "{{ cluster_name }}"
  relay:
    enabled: true
    tls:
      server:
        enabled: true
    prometheus:
      enabled: true
  ui:
    enabled: true
prometheus:
  enabled: true

Installation ends by cilium agents not coming up complaining about health probe:

Warning  Unhealthy  3m34s (x4 over 3m40s)  kubelet            Startup probe failed: Get "http://127.0.0.1:9879/healthz": dial tcp 127.0.0.1:9879: connect: connection refused

agent pod log throws errors about crd:

level=info msg="Start hook executed" duration="2.475µs" function="*resource.resource[*github.com/cilium/cilium/pkg/k8s/slim/k8s/api/core/v1.Service].Start" subsys=hive
level=info msg="Start hook executed" duration=101.003996ms function="*manager.diffStore[*github.com/cilium/cilium/pkg/k8s/slim/k8s/api/core/v1.Service].Start" subsys=hive
level=info msg="Start hook executed" duration="4.157µs" function="*resource.resource[*github.com/cilium/cilium/pkg/k8s.Endpoints].Start" subsys=hive
level=info msg="Using discoveryv1.EndpointSlice" subsys=k8s
level=info msg="Start hook executed" duration=100.727737ms function="*manager.diffStore[*github.com/cilium/cilium/pkg/k8s.Endpoints].Start" subsys=hive
level=info msg="Start hook executed" duration="4.99µs" function="*resource.resource[*github.com/cilium/cilium/pkg/k8s/apis/cilium.io/v2alpha1.CiliumBGPPeeringPolicy].Start" subsys=hive
level=info msg="Start hook executed" duration="2.502µs" function="*resource.resource[*github.com/cilium/cilium/pkg/k8s/apis/cilium.io/v2.CiliumNode].Start" subsys=hive
level=info msg="Start hook executed" duration="14.413µs" function="*agent.kubernetesNodeSpecer.Start" subsys=hive
level=warning msg="github.com/cilium/cilium/pkg/k8s/resource/resource.go:305: failed to list *v2alpha1.CiliumBGPPeeringPolicy: the server could not find the requested resource (get ciliumbgppeeringpolicies.cilium.io)" subsys=klog
level=error msg=k8sError error="github.com/cilium/cilium/pkg/k8s/resource/resource.go:305: Failed to watch *v2alpha1.CiliumBGPPeeringPolicy: failed to list *v2alpha1.CiliumBGPPeeringPolicy: the server could not find the requested resource (get ciliumbgppeeringpolicies.cilium.io)" subsys=k8s
level=warning msg="github.com/cilium/cilium/pkg/k8s/resource/resource.go:305: failed to list *v2alpha1.CiliumBGPPeeringPolicy: the server could not find the requested resource (get ciliumbgppeeringpolicies.cilium.io)" subsys=klog
level=error msg=k8sError error="github.com/cilium/cilium/pkg/k8s/resource/resource.go:305: Failed to watch *v2alpha1.CiliumBGPPeeringPolicy: failed to list *v2alpha1.CiliumBGPPeeringPolicy: the server could not find the requested resource (get ciliumbgppeeringpolicies.cilium.io)" subsys=k8s
level=warning msg="github.com/cilium/cilium/pkg/k8s/resource/resource.go:305: failed to list *v2alpha1.CiliumBGPPeeringPolicy: the server could not find the requested resource (get ciliumbgppeeringpolicies.cilium.io)" subsys=klog
level=error msg=k8sError error="github.com/cilium/cilium/pkg/k8s/resource/resource.go:305: Failed to watch *v2alpha1.CiliumBGPPeeringPolicy: failed to list *v2alpha1.CiliumBGPPeeringPolicy: the server could not find the requested resource (get ciliumbgppeeringpolicies.cilium.io)" subsys=k8s
level=warning msg="github.com/cilium/cilium/pkg/k8s/resource/resource.go:305: failed to list *v2alpha1.CiliumBGPPeeringPolicy: the server could not find the requested resource (get ciliumbgppeeringpolicies.cilium.io)" subsys=klog
level=error msg=k8sError error="github.com/cilium/cilium/pkg/k8s/resource/resource.go:305: Failed to watch *v2alpha1.CiliumBGPPeeringPolicy: failed to list *v2alpha1.CiliumBGPPeeringPolicy: the server could not find the requested resource (get ciliumbgppeeringpolicies.cilium.io)" subsys=k8s
level=warning msg="github.com/cilium/cilium/pkg/k8s/resource/resource.go:305: failed to list *v2alpha1.CiliumBGPPeeringPolicy: the server could not find the requested resource (get ciliumbgppeeringpolicies.cilium.io)" subsys=klog
level=error msg=k8sError error="github.com/cilium/cilium/pkg/k8s/resource/resource.go:305: Failed to watch *v2alpha1.CiliumBGPPeeringPolicy: failed to list *v2alpha1.CiliumBGPPeeringPolicy: the server could not find the requested resource (get ciliumbgppeeringpolicies.cilium.io)" subsys=k8s
level=warning msg="github.com/cilium/cilium/pkg/k8s/resource/resource.go:305: failed to list *v2alpha1.CiliumBGPPeeringPolicy: the server could not find the requested resource (get ciliumbgppeeringpolicies.cilium.io)" subsys=klog
level=error msg=k8sError error="github.com/cilium/cilium/pkg/k8s/resource/resource.go:305: Failed to watch *v2alpha1.CiliumBGPPeeringPolicy: failed to list *v2alpha1.CiliumBGPPeeringPolicy: the server could not find the requested resource (get ciliumbgppeeringpolicies.cilium.io)" subsys=k8s
level=warning msg="github.com/cilium/cilium/pkg/k8s/resource/resource.go:305: failed to list *v2alpha1.CiliumBGPPeeringPolicy: the server could not find the requested resource (get ciliumbgppeeringpolicies.cilium.io)" subsys=klog
level=error msg=k8sError error="github.com/cilium/cilium/pkg/k8s/resource/resource.go:305: Failed to watch *v2alpha1.CiliumBGPPeeringPolicy: failed to list *v2alpha1.CiliumBGPPeeringPolicy: the server could not find the requested resource (get ciliumbgppeeringpolicies.cilium.io)" subsys=k8s
level=warning msg="github.com/cilium/cilium/pkg/k8s/resource/resource.go:305: failed to list *v2alpha1.CiliumBGPPeeringPolicy: the server could not find the requested resource (get ciliumbgppeeringpolicies.cilium.io)" subsys=klog
level=error msg=k8sError error="github.com/cilium/cilium/pkg/k8s/resource/resource.go:305: Failed to watch *v2alpha1.CiliumBGPPeeringPolicy: failed to list *v2alpha1.CiliumBGPPeeringPolicy: the server could not find the requested resource (get ciliumbgppeeringpolicies.cilium.io)" subsys=k8s
level=warning msg="github.com/cilium/cilium/pkg/k8s/resource/resource.go:305: failed to list *v2alpha1.CiliumBGPPeeringPolicy: the server could not find the requested resource (get ciliumbgppeeringpolicies.cilium.io)" subsys=klog
level=error msg=k8sError error="github.com/cilium/cilium/pkg/k8s/resource/resource.go:305: Failed to watch *v2alpha1.CiliumBGPPeeringPolicy: failed to list *v2alpha1.CiliumBGPPeeringPolicy: the server could not find the requested resource (get ciliumbgppeeringpolicies.cilium.io)" subsys=k8s

I am guessing issue here is that agent is trying to get BGP crds but crds are created later by operator once it is up running, but it can’t run until network (agent) is ready. race condition?

Once BGP CP is disabled:

bgpControlPlane:
  enabled: false

Cilium is deployed properly.

Cilium Version

1.14.3 1.14.4

Kernel Version

6.2.0-37-generic

Kubernetes Version

v1.27.6

Sysdump

No response

Relevant log output

No response

Anything else?

No response

Code of Conduct

  • I agree to follow this project’s Code of Conduct

About this issue

  • Original URL
  • State: closed
  • Created 7 months ago
  • Comments: 15 (6 by maintainers)

Most upvoted comments

@stelucz Thanks, I managed to reproduce the issue (setting the Helm value operator.tolerations to [] ensures that the operator does not start before CNI config is installed on the node and triggers this issue).

The root cause seems to be that BGP CP creates a resource.Store for necessary resources in the Start hive hook, which blocks if the CRD is not yet installed: https://github.com/cilium/cilium/blob/61b9a21175844fdbc4d963e1f9846358d83510c6/pkg/bgpv1/manager/store/diffstore.go#L77-L83

This causes that another Start hive hook, which is supposed to install the Cilium CNI config files, is never called, as it seems to be called after the BGP hooks:

https://github.com/cilium/cilium/blob/61b9a21175844fdbc4d963e1f9846358d83510c6/daemon/cmd/cni/config.go#L171-L178

Will look into how we can solve this.

@rastislavs Yes, not just in CRDs installation as pointed out here https://github.com/cilium/cilium/issues/29371#issuecomment-1829720624 but also in CRDs evaluation/expectation to be present in cluster by agent.

Do you need any other info from me now? I am happy to help.

@stelucz

This means there is some inconsistency in startup process between having BGP CP disabled and enabled in the agent.

yeah, there seems to be some inconsistency in how/when CRDs are being installed it seems. Partially it is expected, as cilium-agent is modular and each component can do things slightly differently, but maybe we should double-check the BGP CRD installation logic in comparison to other components.

@stelucz I believe your issue is somehow related to the the tolerations set on the cilium-operator. By default it is deployed with [{"operator":"Exists"}]. From the sysdump it seems yours is deployed with the following tolerations (I guess you set them via Helm values / or it is done by Kubespray?):

    tolerations:
    - effect: NoSchedule
      key: nodeType
      operator: Equal
      value: infrastructure
    - effect: NoExecute
      key: node.kubernetes.io/not-ready
      operator: Exists
      tolerationSeconds: 300
    - effect: NoExecute
      key: node.kubernetes.io/unreachable
      operator: Exists
      tolerationSeconds: 300

That might be the reason why it normally works for other users.

Note that NetworkPluginNotReady should normally not prevent cilium-operator to be started, as it runs with hostNetwork: true (so it does not really require CNI to be set up) and it has the “wildcard” toleration [{"operator":"Exists"}] - so normally the operator can start before the cilium-agents are ready.