cilium: CI: FAIL: TestControlPlane/CiliumNodes/v1.24

CI: FAIL: TestControlPlane/CiliumNodes/v1.24 [TravisCI]

-- FAIL: TestControlPlane (20.56s)

    --- pass: testcontrolplane/cnpstatusnodesgc (1.53s)
    --- pass: testcontrolplane/localnodestore (0.43s)
    --- pass: testcontrolplane/nodehandler (0.48s)
    --- fail: testcontrolplane/ciliumnodes (11.75s)
        --- fail: testcontrolplane/ciliumnodes/v1.24 (10.40s)
        --- pass: testcontrolplane/ciliumnodes/v1.25 (0.67s)
        --- pass: testcontrolplane/ciliumnodes/v1.26 (0.68s)
    --- pass: testcontrolplane/services/dualstack (1.44s)
        --- pass: testcontrolplane/services/dualstack/v1.24 (0.52s)
        --- pass: testcontrolplane/services/dualstack/v1.25 (0.41s)
        --- pass: testcontrolplane/services/dualstack/v1.26 (0.52s)
    --- pass: testcontrolplane/services/gracefultermination (0.62s)
    --- pass: testcontrolplane/services/nodeport (4.32s)
        --- pass: testcontrolplane/services/nodeport/v1.24/nodeport-control-plane (0.51s)
        --- pass: testcontrolplane/services/nodeport/v1.24/nodeport-worker (0.52s)
        --- pass: testcontrolplane/services/nodeport/v1.24/nodeport-worker2 (0.41s)
        --- pass: testcontrolplane/services/nodeport/v1.25/nodeport-control-plane (0.50s)
        --- pass: testcontrolplane/services/nodeport/v1.25/nodeport-worker (0.50s)
        --- pass: testcontrolplane/services/nodeport/v1.25/nodeport-worker2 (0.42s)
        --- pass: testcontrolplane/services/nodeport/v1.26/nodeport-control-plane (0.52s)
        --- pass: testcontrolplane/services/nodeport/v1.26/nodeport-worker (0.44s)
        --- pass: testcontrolplane/services/nodeport/v1.26/nodeport-worker2 (0.50s)
fail
	github.com/cilium/cilium/test/controlplane	coverage: 100.0% of statements
fail	github.com/cilium/cilium/test/controlplane	20.811s

--- fail: testcontrolplane (20.56s)

    --- fail: testcontrolplane/ciliumnodes (11.75s)

        --- fail: testcontrolplane/ciliumnodes/v1.24 (10.40s)

level=warning msg="running cilium with \"kvstore\"=\"\" requires identity allocation via crds. changing identity-allocation-mode to \"crd\"" subsys=config
level=info msg=invoked duration="984.666µs" function="metrics.newregistry (registry.go:65)" subsys=hive
level=info msg=invoked duration=129.470369ms function="cmd.glob..func4 (daemon_main.go:1574)" subsys=hive
level=info msg="spire delegate api client is disabled as no socket path is configured" subsys=spire-delegate
level=info msg="mutual authentication handler is disabled as no port is configured" subsys=auth
level=info msg=invoked duration="668.229µs" function="auth.newmanager (cell.go:87)" subsys=hive
level=info msg=invoked duration="5.062µs" function="suite.startciliumagent.func8 (agent.go:95)" subsys=hive
level=info msg=starting subsys=hive
level=info msg="start hook executed" duration="2.076µs" function="metrics.newregistry.func1 (registry.go:86)" subsys=hive
level=info msg="start hook executed" duration="2.962µs" function="*resource.resource[*github.com/cilium/cilium/pkg/k8s/slim/k8s/api/core/v1.node].start" subsys=hive
level=info msg="using autogenerated ipv4 allocation range" subsys=node v4prefix=10.148.0.0/16
level=info msg="serving prometheus metrics on :9962" subsys=metrics
level=info msg="start hook executed" duration="512.899µs" function="node.newlocalnodestore.func1 (local_node_store.go:77)" subsys=hive
level=info msg="start hook executed" duration="1.625µs" function="*resource.resource[*github.com/cilium/cilium/pkg/k8s/slim/k8s/api/core/v1.service].start" subsys=hive
level=info msg="start hook executed" duration=100.182304ms function="*manager.diffstore[*github.com/cilium/cilium/pkg/k8s/slim/k8s/api/core/v1.service].start" subsys=hive
level=info msg="start hook executed" duration="1.527µs" function="*resource.resource[*github.com/cilium/cilium/pkg/k8s.endpoints].start" subsys=hive
level=info msg="using discoveryv1.endpointslice" subsys=k8s
level=info msg="start hook executed" duration=100.669833ms function="*manager.diffstore[*github.com/cilium/cilium/pkg/k8s.endpoints].start" subsys=hive
level=info msg="start hook executed" duration="2.551µs" function="*resource.resource[*github.com/cilium/cilium/pkg/k8s/apis/cilium.io/v2.ciliumnode].start" subsys=hive
level=info msg="start hook executed" duration=952ns function="*resource.resource[*github.com/cilium/cilium/pkg/k8s/slim/k8s/api/core/v1.pod].start" subsys=hive
level=info msg="start hook executed" duration="1.034µs" function="*resource.resource[*github.com/cilium/cilium/pkg/k8s/slim/k8s/api/core/v1.namespace].start" subsys=hive
level=info msg="start hook executed" duration=952ns function="*resource.resource[*github.com/cilium/cilium/pkg/k8s/apis/cilium.io/v2.ciliumnetworkpolicy].start" subsys=hive
level=info msg="start hook executed" duration="1.305µs" function="*resource.resource[*github.com/cilium/cilium/pkg/k8s/apis/cilium.io/v2.ciliumclusterwidenetworkpolicy].start" subsys=hive
level=info msg="start hook executed" duration=878ns function="*resource.resource[*github.com/cilium/cilium/pkg/k8s/apis/cilium.io/v2alpha1.ciliumcidrgroup].start" subsys=hive
level=info msg="start hook executed" duration="18.945µs" function="endpointmanager.newdefaultendpointmanager.func1 (cell.go:185)" subsys=hive
level=info msg="start hook executed" duration="10.675µs" function="cmd.newpolicytrifecta.func1 (policy.go:127)" subsys=hive
level=info msg="start hook executed" duration="102.284µs" function="*manager.manager.start" subsys=hive
level=info msg="no eventsmap: monitor works only for agent events." subsys=monitor-agent
level=info msg="start hook executed" duration="5.67µs" function="agent.newmonitoragent.func1 (cell.go:62)" subsys=hive
level=info msg="start hook executed" duration=714ns function="*resource.resource[*github.com/cilium/cilium/pkg/k8s/apis/cilium.io/v2alpha1.ciliuml2announcementpolicy].start" subsys=hive
level=info msg="start hook executed" duration="5.99µs" function="*job.group.start" subsys=hive
level=info msg="auto-enabling \"enable-node-port\", \"enable-external-ips\", \"bpf-lb-sock\", \"enable-host-port\", \"enable-session-affinity\" features" subsys=daemon
level=warning msg="unable to automatically detect mtu" error="no interface contains the provided ip: 172.18.0.3" subsys=mtu
level=info msg="l7 proxies are disabled" subsys=daemon
level=warning msg="no valid cgroup base path found: socket load-balancing tracing with hubble will not work.see the kubeproxy-free guide for more details." subsys=cgroup-manager
level=info msg="reading old endpoints..." subsys=daemon
level=info msg="no old endpoints found." subsys=daemon
level=info msg="retrieved node information from kubernetes node" nodename=cilium-nodes-worker subsys=k8s
level=info msg="received own node information from api server" ipaddr.ipv4=172.18.0.3 ipaddr.ipv6="<nil>" k8snodeip=172.18.0.3 labels="map[cilium.io/ci-node:k8s1 kubernetes.io/arch:amd64 kubernetes.io/hostname:cilium-nodes-worker kubernetes.io/os:linux]" nodename=cilium-nodes-worker subsys=k8s v4prefix=10.244.1.0/24 v6prefix="<nil>"
level=info msg="k8s mode: allowing localhost to reach local endpoints" subsys=daemon
level=info msg="detected devices" devices="[ens5]" subsys=linux-datapath
level=info msg="enabling k8s event listener" subsys=k8s-watcher
level=info msg="removing stale endpoint interfaces" subsys=daemon
level=info msg="skipping kvstore configuration" subsys=daemon
level=info msg="initializing node addressing" subsys=daemon
level=info msg="initializing kubernetes ipam" subsys=ipam v4prefix=1.1.1.0/24 v6prefix="cafe::/96"
level=info msg="restoring endpoints..." subsys=daemon
level=info msg="endpoints restored" failed=0 restored=0 subsys=daemon
level=info msg="addressing information:" subsys=daemon
level=info msg="  cluster-name: default" subsys=daemon
level=info msg="  cluster-id: 0" subsys=daemon
level=info msg="  local node-name: cilium-nodes-worker" subsys=daemon
level=info msg="  node-ipv6: <nil>" subsys=daemon
level=info msg="  external-node ipv4: 172.18.0.3" subsys=daemon
level=info msg="  internal-node ipv4: 1.1.1.2" subsys=daemon
level=info msg="  ipv4 allocation prefix: 10.244.1.0/24" subsys=daemon
level=info msg="  loopback ipv4: 169.254.42.1" subsys=daemon
level=info msg="  local ipv4 addresses:" subsys=daemon
level=info msg="  - 10.0.0.2" subsys=daemon
level=info msg="  - 10.0.0.3" subsys=daemon
level=info msg="  - 10.0.0.4" subsys=daemon
level=info msg="creating or updating ciliumnode resource" node=cilium-nodes-worker subsys=nodediscovery
level=info msg="initializing identity allocator" subsys=identity-cache
level=info msg="allocating identities between range" cluster-id=0 max=65535 min=256 subsys=identity-cache
level=info msg="start hook executed" duration=2.992882ms function="cmd.newdaemonpromise.func1 (daemon_main.go:1626)" subsys=hive
level=info msg="start hook executed" duration="3.102µs" function="signal.providesignalmanager.func1 (cell.go:26)" subsys=hive
level=info msg="start hook executed" duration="1.469µs" function="*resource.resource[*github.com/cilium/cilium/pkg/k8s/apis/cilium.io/v2.ciliumidentity].start" subsys=hive
level=info msg="start hook executed" duration=935ns function="*resource.resource[*github.com/cilium/cilium/pkg/k8s/apis/cilium.io/v2.ciliumnode].start" subsys=hive
level=info msg="start hook executed" duration="3.651µs" function="auth.newmanager.func1 (cell.go:92)" subsys=hive
level=info msg="start hook executed" duration="20.078µs" function="*job.group.start" subsys=hive
level=info msg="waiting until local node addressing before starting watchers depending on it" subsys=k8s-watcher
level=info msg="waiting until all pre-existing resources have been received" subsys=k8s-watcher
level=info msg="adding local node to cluster" node="{cilium-nodes-worker default [{internalip 172.18.0.3} {ciliuminternalip 1.1.1.2}] 10.244.1.0/24 [] <nil> [] 1.1.1.205 <nil> <nil> <nil> 0 local 0 map[cilium.io/ci-node:k8s1 kubernetes.io/arch:amd64 kubernetes.io/hostname:cilium-nodes-worker kubernetes.io/os:linux] map[] 1 }" subsys=nodediscovery
level=info msg="datapath signal listener running" subsys=signal
    testcase.go:302: ciliumnode labels mismatch, expected: map[cilium.io/ci-node:k8s1 kubernetes.io/arch:amd64 kubernetes.io/hostname:cilium-nodes-worker kubernetes.io/os:linux], found: map[cilium.io/ci-node:k8s1 kubernetes.io/arch:amd64 kubernetes.io/hostname:cilium-nodes-worker kubernetes.io/os:linux test-label:test-value]
  • Copy-paste the link of the CI build where that test failure has happen. Unforatunely, the test was re-ran so link is no longer available.

About this issue

  • Original URL
  • State: closed
  • Created a year ago
  • Comments: 17 (17 by maintainers)

Commits related to this issue

Most upvoted comments

I’ve tried to reproduce this locally with no luck, so I tried to understand the culprit comparing the logs from a successful and a failed run.

In every log (both successful and unsuccessful) an initial occurrence of the following log entry can be seen:

level=info msg="Creating or updating CiliumNode resource" node=cilium-nodes-worker subsys=nodediscovery

This initial CiliumNode resource upsert is performed here:

https://github.com/cilium/cilium/blob/e0931df324592358d4645a9f9a31ca87aeddaf70/daemon/cmd/daemon.go#L874

Since that call is blocking, this initial step should not be related to the test failure, as that must complete before returning from newDaemon.

But going on with the analysis of the logs, in a successful run, other occurrences like that one appear:

level=error msg="Host endpoint not found" subsys=endpoint-manager
level=info msg="Creating or updating CiliumNode resource" node=cilium-nodes-worker subsys=nodediscovery

While for a failed run the Host endpoint not found log is not printed (and so the other one neither).

The first log entry comes from the (*endpointManager).OnUpdateNode callback that is registered just before the (*ciliumNodeUpdater).OnUpdateNode one:

https://github.com/cilium/cilium/blob/e0931df324592358d4645a9f9a31ca87aeddaf70/daemon/cmd/daemon.go#L670-L684

and the callbacks are executed in the same order in which they are registered. So, if the endpointManager blocks or slows down, as it seems from the missing Host endpoint not found, the CiliumNodeUpdater cannot run, ultimately leading to the failure.

So I tried to see if there is a sequence of events that could possibly slowing down the execution of the endpointManager callback. This lock here might be related:

https://github.com/cilium/cilium/blob/c1dffab254d7864edcb75aa5db159c06c7fa3381/pkg/endpointmanager/host.go#L16

Since it is held by the periodic endpoint gc that is enabled in the controlplane tests:

https://github.com/cilium/cilium/blob/6decd65279aea25a909ee418b7251b7bcacd994d/pkg/endpointmanager/gc.go#L30

Though there should be no endpoints in the endpoint manager map during controlplane testing (and so each iteration of the GC should be very fast), the unpredictability of the CI environment may have led to the flake.

Since the endpoint gc is not useful in the controlplane tests, I think that it should be disabled to (hopefully) mitigate this flake and then verify that the current timeout set in the test is still adequate.

Marcel Zięba I experimented again with the test and I’d like to recap what is happening to be 100% sure that I’m getting it right.

Yup, you are 100% right.

Regarding the solution, it seems quite simple: we just have to change the ciliumNodeUpdater.OnAddNode to add a call to u.updateCiliumNode(newNode) just like we already do in ciliumNodeUpdater.OnUpdateNode.

Sounds good. I would probably also take a look at other subscribers to NodeChain to make sure they don’t mind missing events. For example, endpointManager also ignores Add event and might be missing label change…

The good news is I was able to reproduce this issue locally and reliably, the bad news is - it’s a true race condition in the agent.

To simulate it locally, add time.Sleep(time.Second * 1) after this line: https://github.com/cilium/cilium/blob/dd06feb35e95b6ba7f9053998ecf3f78fc5eed29/pkg/k8s/resource/resource.go#L479

What is happening: informer store is updated asynchronously. For initial update of CiliumNode, we get it from the informer store here: https://github.com/cilium/cilium/blob/12ea4bbfe14ca5b00f5124b5782c97a7d9fe20f4/pkg/nodediscovery/nodediscovery.go#L352 The NodeChain thing is implemented in a way that first, it get’s first version of Node from the first Upset event here: https://github.com/cilium/cilium/blob/fcc7532fa694465227821e87bc614f91658b58de/pkg/k8s/watchers/node.go#L93 This version of node comes from another call to the informer store here (but informer store is updated asynchronously!): https://github.com/cilium/cilium/blob/dd06feb35e95b6ba7f9053998ecf3f78fc5eed29/pkg/k8s/resource/resource.go#L480 so basically an initial update of CiliumNode and this control loop can have different views of what is the initial version of node and we can miss some updates.

My guess of what might be happening here is:

  • we do not see Host endpoint not found in logs in failed runs - this most likely indicates that OnUpdateNode event might have never happened IMHO.
  • Both endpointManager and NewCiliumNodeUpdater ignore NodeAdd events and only handle NodeUpdate events.
  • It’s worth mentioning that with the new Resource[T] abstraction, we might be joining multiple events into one - for example: NodeAdd -> NodeUpdate might become a single NodeAdd event.

This makes me think it might be a true issue and not just a test flake.