cilium: CI: FAIL: TestControlPlane/CiliumNodes/v1.24

CI: FAIL: TestControlPlane/CiliumNodes/v1.24 [TravisCI]

-- FAIL: TestControlPlane (20.56s)

    --- pass: testcontrolplane/cnpstatusnodesgc (1.53s)
    --- pass: testcontrolplane/localnodestore (0.43s)
    --- pass: testcontrolplane/nodehandler (0.48s)
    --- fail: testcontrolplane/ciliumnodes (11.75s)
        --- fail: testcontrolplane/ciliumnodes/v1.24 (10.40s)
        --- pass: testcontrolplane/ciliumnodes/v1.25 (0.67s)
        --- pass: testcontrolplane/ciliumnodes/v1.26 (0.68s)
    --- pass: testcontrolplane/services/dualstack (1.44s)
        --- pass: testcontrolplane/services/dualstack/v1.24 (0.52s)
        --- pass: testcontrolplane/services/dualstack/v1.25 (0.41s)
        --- pass: testcontrolplane/services/dualstack/v1.26 (0.52s)
    --- pass: testcontrolplane/services/gracefultermination (0.62s)
    --- pass: testcontrolplane/services/nodeport (4.32s)
        --- pass: testcontrolplane/services/nodeport/v1.24/nodeport-control-plane (0.51s)
        --- pass: testcontrolplane/services/nodeport/v1.24/nodeport-worker (0.52s)
        --- pass: testcontrolplane/services/nodeport/v1.24/nodeport-worker2 (0.41s)
        --- pass: testcontrolplane/services/nodeport/v1.25/nodeport-control-plane (0.50s)
        --- pass: testcontrolplane/services/nodeport/v1.25/nodeport-worker (0.50s)
        --- pass: testcontrolplane/services/nodeport/v1.25/nodeport-worker2 (0.42s)
        --- pass: testcontrolplane/services/nodeport/v1.26/nodeport-control-plane (0.52s)
        --- pass: testcontrolplane/services/nodeport/v1.26/nodeport-worker (0.44s)
        --- pass: testcontrolplane/services/nodeport/v1.26/nodeport-worker2 (0.50s)
fail
	github.com/cilium/cilium/test/controlplane	coverage: 100.0% of statements
fail	github.com/cilium/cilium/test/controlplane	20.811s

--- fail: testcontrolplane (20.56s)

    --- fail: testcontrolplane/ciliumnodes (11.75s)

        --- fail: testcontrolplane/ciliumnodes/v1.24 (10.40s)

level=warning msg="running cilium with \"kvstore\"=\"\" requires identity allocation via crds. changing identity-allocation-mode to \"crd\"" subsys=config
level=info msg=invoked duration="984.666µs" function="metrics.newregistry (registry.go:65)" subsys=hive
level=info msg=invoked duration=129.470369ms function="cmd.glob..func4 (daemon_main.go:1574)" subsys=hive
level=info msg="spire delegate api client is disabled as no socket path is configured" subsys=spire-delegate
level=info msg="mutual authentication handler is disabled as no port is configured" subsys=auth
level=info msg=invoked duration="668.229µs" function="auth.newmanager (cell.go:87)" subsys=hive
level=info msg=invoked duration="5.062µs" function="suite.startciliumagent.func8 (agent.go:95)" subsys=hive
level=info msg=starting subsys=hive
level=info msg="start hook executed" duration="2.076µs" function="metrics.newregistry.func1 (registry.go:86)" subsys=hive
level=info msg="start hook executed" duration="2.962µs" function="*resource.resource[*github.com/cilium/cilium/pkg/k8s/slim/k8s/api/core/v1.node].start" subsys=hive
level=info msg="using autogenerated ipv4 allocation range" subsys=node v4prefix=10.148.0.0/16
level=info msg="serving prometheus metrics on :9962" subsys=metrics
level=info msg="start hook executed" duration="512.899µs" function="node.newlocalnodestore.func1 (local_node_store.go:77)" subsys=hive
level=info msg="start hook executed" duration="1.625µs" function="*resource.resource[*github.com/cilium/cilium/pkg/k8s/slim/k8s/api/core/v1.service].start" subsys=hive
level=info msg="start hook executed" duration=100.182304ms function="*manager.diffstore[*github.com/cilium/cilium/pkg/k8s/slim/k8s/api/core/v1.service].start" subsys=hive
level=info msg="start hook executed" duration="1.527µs" function="*resource.resource[*github.com/cilium/cilium/pkg/k8s.endpoints].start" subsys=hive
level=info msg="using discoveryv1.endpointslice" subsys=k8s
level=info msg="start hook executed" duration=100.669833ms function="*manager.diffstore[*github.com/cilium/cilium/pkg/k8s.endpoints].start" subsys=hive
level=info msg="start hook executed" duration="2.551µs" function="*resource.resource[*github.com/cilium/cilium/pkg/k8s/apis/cilium.io/v2.ciliumnode].start" subsys=hive
level=info msg="start hook executed" duration=952ns function="*resource.resource[*github.com/cilium/cilium/pkg/k8s/slim/k8s/api/core/v1.pod].start" subsys=hive
level=info msg="start hook executed" duration="1.034µs" function="*resource.resource[*github.com/cilium/cilium/pkg/k8s/slim/k8s/api/core/v1.namespace].start" subsys=hive
level=info msg="start hook executed" duration=952ns function="*resource.resource[*github.com/cilium/cilium/pkg/k8s/apis/cilium.io/v2.ciliumnetworkpolicy].start" subsys=hive
level=info msg="start hook executed" duration="1.305µs" function="*resource.resource[*github.com/cilium/cilium/pkg/k8s/apis/cilium.io/v2.ciliumclusterwidenetworkpolicy].start" subsys=hive
level=info msg="start hook executed" duration=878ns function="*resource.resource[*github.com/cilium/cilium/pkg/k8s/apis/cilium.io/v2alpha1.ciliumcidrgroup].start" subsys=hive
level=info msg="start hook executed" duration="18.945µs" function="endpointmanager.newdefaultendpointmanager.func1 (cell.go:185)" subsys=hive
level=info msg="start hook executed" duration="10.675µs" function="cmd.newpolicytrifecta.func1 (policy.go:127)" subsys=hive
level=info msg="start hook executed" duration="102.284µs" function="*manager.manager.start" subsys=hive
level=info msg="no eventsmap: monitor works only for agent events." subsys=monitor-agent
level=info msg="start hook executed" duration="5.67µs" function="agent.newmonitoragent.func1 (cell.go:62)" subsys=hive
level=info msg="start hook executed" duration=714ns function="*resource.resource[*github.com/cilium/cilium/pkg/k8s/apis/cilium.io/v2alpha1.ciliuml2announcementpolicy].start" subsys=hive
level=info msg="start hook executed" duration="5.99µs" function="*job.group.start" subsys=hive
level=info msg="auto-enabling \"enable-node-port\", \"enable-external-ips\", \"bpf-lb-sock\", \"enable-host-port\", \"enable-session-affinity\" features" subsys=daemon
level=warning msg="unable to automatically detect mtu" error="no interface contains the provided ip: 172.18.0.3" subsys=mtu
level=info msg="l7 proxies are disabled" subsys=daemon
level=warning msg="no valid cgroup base path found: socket load-balancing tracing with hubble will not work.see the kubeproxy-free guide for more details." subsys=cgroup-manager
level=info msg="reading old endpoints..." subsys=daemon
level=info msg="no old endpoints found." subsys=daemon
level=info msg="retrieved node information from kubernetes node" nodename=cilium-nodes-worker subsys=k8s
level=info msg="received own node information from api server" ipaddr.ipv4=172.18.0.3 ipaddr.ipv6="<nil>" k8snodeip=172.18.0.3 labels="map[cilium.io/ci-node:k8s1 kubernetes.io/arch:amd64 kubernetes.io/hostname:cilium-nodes-worker kubernetes.io/os:linux]" nodename=cilium-nodes-worker subsys=k8s v4prefix=10.244.1.0/24 v6prefix="<nil>"
level=info msg="k8s mode: allowing localhost to reach local endpoints" subsys=daemon
level=info msg="detected devices" devices="[ens5]" subsys=linux-datapath
level=info msg="enabling k8s event listener" subsys=k8s-watcher
level=info msg="removing stale endpoint interfaces" subsys=daemon
level=info msg="skipping kvstore configuration" subsys=daemon
level=info msg="initializing node addressing" subsys=daemon
level=info msg="initializing kubernetes ipam" subsys=ipam v4prefix=1.1.1.0/24 v6prefix="cafe::/96"
level=info msg="restoring endpoints..." subsys=daemon
level=info msg="endpoints restored" failed=0 restored=0 subsys=daemon
level=info msg="addressing information:" subsys=daemon
level=info msg="  cluster-name: default" subsys=daemon
level=info msg="  cluster-id: 0" subsys=daemon
level=info msg="  local node-name: cilium-nodes-worker" subsys=daemon
level=info msg="  node-ipv6: <nil>" subsys=daemon
level=info msg="  external-node ipv4: 172.18.0.3" subsys=daemon
level=info msg="  internal-node ipv4: 1.1.1.2" subsys=daemon
level=info msg="  ipv4 allocation prefix: 10.244.1.0/24" subsys=daemon
level=info msg="  loopback ipv4: 169.254.42.1" subsys=daemon
level=info msg="  local ipv4 addresses:" subsys=daemon
level=info msg="  - 10.0.0.2" subsys=daemon
level=info msg="  - 10.0.0.3" subsys=daemon
level=info msg="  - 10.0.0.4" subsys=daemon
level=info msg="creating or updating ciliumnode resource" node=cilium-nodes-worker subsys=nodediscovery
level=info msg="initializing identity allocator" subsys=identity-cache
level=info msg="allocating identities between range" cluster-id=0 max=65535 min=256 subsys=identity-cache
level=info msg="start hook executed" duration=2.992882ms function="cmd.newdaemonpromise.func1 (daemon_main.go:1626)" subsys=hive
level=info msg="start hook executed" duration="3.102µs" function="signal.providesignalmanager.func1 (cell.go:26)" subsys=hive
level=info msg="start hook executed" duration="1.469µs" function="*resource.resource[*github.com/cilium/cilium/pkg/k8s/apis/cilium.io/v2.ciliumidentity].start" subsys=hive
level=info msg="start hook executed" duration=935ns function="*resource.resource[*github.com/cilium/cilium/pkg/k8s/apis/cilium.io/v2.ciliumnode].start" subsys=hive
level=info msg="start hook executed" duration="3.651µs" function="auth.newmanager.func1 (cell.go:92)" subsys=hive
level=info msg="start hook executed" duration="20.078µs" function="*job.group.start" subsys=hive
level=info msg="waiting until local node addressing before starting watchers depending on it" subsys=k8s-watcher
level=info msg="waiting until all pre-existing resources have been received" subsys=k8s-watcher
level=info msg="adding local node to cluster" node="{cilium-nodes-worker default [{internalip 172.18.0.3} {ciliuminternalip 1.1.1.2}] 10.244.1.0/24 [] <nil> [] 1.1.1.205 <nil> <nil> <nil> 0 local 0 map[cilium.io/ci-node:k8s1 kubernetes.io/arch:amd64 kubernetes.io/hostname:cilium-nodes-worker kubernetes.io/os:linux] map[] 1 }" subsys=nodediscovery
level=info msg="datapath signal listener running" subsys=signal
    testcase.go:302: ciliumnode labels mismatch, expected: map[cilium.io/ci-node:k8s1 kubernetes.io/arch:amd64 kubernetes.io/hostname:cilium-nodes-worker kubernetes.io/os:linux], found: map[cilium.io/ci-node:k8s1 kubernetes.io/arch:amd64 kubernetes.io/hostname:cilium-nodes-worker kubernetes.io/os:linux test-label:test-value]

Copy-paste the link of the CI build where that test failure has happen. Unforatunely, the test was re-ran so link is no longer available.

About this issue

Original URL
State: closed
Created a year ago
Comments: 17 (17 by maintainers)

Commits related to this issue

test/controlplane: Disable endpoint GC The Endpoint GC is currently not needed in controlplane testing. Since that GC needs to lock the endpoints map in the endpoint manager, it can possibly delay t... — committed to pippolo84/cilium by pippolo84 a year ago
test/controlplane: Disable endpoint GC The Endpoint GC is currently not needed in controlplane testing. Since that GC needs to lock the endpoints map in the endpoint manager, it can possibly delay t... — committed to cilium/cilium by pippolo84 a year ago
test/controlplane: Disable endpoint GC The Endpoint GC is currently not needed in controlplane testing. Since that GC needs to lock the endpoints map in the endpoint manager, it can possibly delay t... — committed to romanspb80/cilium by pippolo84 a year ago
k8s: Add OnAddNode handlers for CiliumNodeUpdater and EndpointManager The k8s Node watcher receives events related to the local node from the Resource[T] framework Events() method. The Resource[T] f... — committed to pippolo84/cilium by pippolo84 a year ago
test/controlplane: Remove validation timeout Due to the unpredictability of a CI environment, setting a validation timeout when running tests may lead to flakyness. At the same time, a short timeout... — committed to pippolo84/cilium by pippolo84 a year ago
test/controlplane: Remove validation timeout Due to the unpredictability of a CI environment, setting a validation timeout when running tests may lead to flakyness. At the same time, a short timeout... — committed to cilium/cilium by pippolo84 a year ago
k8s: Add OnAddNode handlers for CiliumNodeUpdater and EndpointManager The k8s Node watcher receives events related to the local node from the Resource[T] framework Events() method. The Resource[T] f... — committed to cilium/cilium by pippolo84 a year ago
k8s: Add OnAddNode handlers for CiliumNodeUpdater and EndpointManager [ upstream commit 83a90664a09d93134bbd9c5e11a2e848af931b99 ] The k8s Node watcher receives events related to the local node from... — committed to cilium/cilium by pippolo84 a year ago
k8s: Add OnAddNode handlers for CiliumNodeUpdater and EndpointManager [ upstream commit 83a90664a09d93134bbd9c5e11a2e848af931b99 ] The k8s Node watcher receives events related to the local node from... — committed to cilium/cilium by pippolo84 a year ago
k8s: Add OnAddNode handlers for CiliumNodeUpdater and EndpointManager [ upstream commit 83a90664a09d93134bbd9c5e11a2e848af931b99 ] The k8s Node watcher receives events related to the local node from... — committed to ldelossa/cilium by pippolo84 a year ago

Most upvoted comments

I’ve tried to reproduce this locally with no luck, so I tried to understand the culprit comparing the logs from a successful and a failed run.

In every log (both successful and unsuccessful) an initial occurrence of the following log entry can be seen:

level=info msg="Creating or updating CiliumNode resource" node=cilium-nodes-worker subsys=nodediscovery

This initial CiliumNode resource upsert is performed here:

https://github.com/cilium/cilium/blob/e0931df324592358d4645a9f9a31ca87aeddaf70/daemon/cmd/daemon.go#L874

Since that call is blocking, this initial step should not be related to the test failure, as that must complete before returning from newDaemon.

But going on with the analysis of the logs, in a successful run, other occurrences like that one appear:

level=error msg="Host endpoint not found" subsys=endpoint-manager
level=info msg="Creating or updating CiliumNode resource" node=cilium-nodes-worker subsys=nodediscovery

While for a failed run the Host endpoint not found log is not printed (and so the other one neither).

The first log entry comes from the (*endpointManager).OnUpdateNode callback that is registered just before the (*ciliumNodeUpdater).OnUpdateNode one:

https://github.com/cilium/cilium/blob/e0931df324592358d4645a9f9a31ca87aeddaf70/daemon/cmd/daemon.go#L670-L684

and the callbacks are executed in the same order in which they are registered. So, if the endpointManager blocks or slows down, as it seems from the missing Host endpoint not found, the CiliumNodeUpdater cannot run, ultimately leading to the failure.

So I tried to see if there is a sequence of events that could possibly slowing down the execution of the endpointManager callback. This lock here might be related:

https://github.com/cilium/cilium/blob/c1dffab254d7864edcb75aa5db159c06c7fa3381/pkg/endpointmanager/host.go#L16

Since it is held by the periodic endpoint gc that is enabled in the controlplane tests:

https://github.com/cilium/cilium/blob/6decd65279aea25a909ee418b7251b7bcacd994d/pkg/endpointmanager/gc.go#L30

Though there should be no endpoints in the endpoint manager map during controlplane testing (and so each iteration of the GC should be very fast), the unpredictability of the CI environment may have led to the flake.

Since the endpoint gc is not useful in the controlplane tests, I think that it should be disabled to (hopefully) mitigate this flake and then verify that the current timeout set in the test is still adequate.

pippolo84 on Jun 20, 2023

Marcel Zięba I experimented again with the test and I’d like to recap what is happening to be 100% sure that I’m getting it right.

Yup, you are 100% right.

Regarding the solution, it seems quite simple: we just have to change the ciliumNodeUpdater.OnAddNode to add a call to u.updateCiliumNode(newNode) just like we already do in ciliumNodeUpdater.OnUpdateNode.

Sounds good. I would probably also take a look at other subscribers to NodeChain to make sure they don’t mind missing events. For example, endpointManager also ignores Add event and might be missing label change…

marseel on Jun 26, 2023

The good news is I was able to reproduce this issue locally and reliably, the bad news is - it’s a true race condition in the agent.

To simulate it locally, add time.Sleep(time.Second * 1) after this line: https://github.com/cilium/cilium/blob/dd06feb35e95b6ba7f9053998ecf3f78fc5eed29/pkg/k8s/resource/resource.go#L479

What is happening: informer store is updated asynchronously. For initial update of CiliumNode, we get it from the informer store here: https://github.com/cilium/cilium/blob/12ea4bbfe14ca5b00f5124b5782c97a7d9fe20f4/pkg/nodediscovery/nodediscovery.go#L352 The NodeChain thing is implemented in a way that first, it get’s first version of Node from the first Upset event here: https://github.com/cilium/cilium/blob/fcc7532fa694465227821e87bc614f91658b58de/pkg/k8s/watchers/node.go#L93 This version of node comes from another call to the informer store here (but informer store is updated asynchronously!): https://github.com/cilium/cilium/blob/dd06feb35e95b6ba7f9053998ecf3f78fc5eed29/pkg/k8s/resource/resource.go#L480 so basically an initial update of CiliumNode and this control loop can have different views of what is the initial version of node and we can miss some updates.

marseel on Jun 23, 2023

My guess of what might be happening here is:

we do not see Host endpoint not found in logs in failed runs - this most likely indicates that OnUpdateNode event might have never happened IMHO.
Both endpointManager and NewCiliumNodeUpdater ignore NodeAdd events and only handle NodeUpdate events.
It’s worth mentioning that with the new Resource[T] abstraction, we might be joining multiple events into one - for example: NodeAdd -> NodeUpdate might become a single NodeAdd event.

This makes me think it might be a true issue and not just a test flake.

marseel on Jun 22, 2023