CI: FAIL: TestControlPlane/CiliumNodes/v1.24 [TravisCI]
-- FAIL: TestControlPlane (20.56s)
--- pass: testcontrolplane/cnpstatusnodesgc (1.53s)
--- pass: testcontrolplane/localnodestore (0.43s)
--- pass: testcontrolplane/nodehandler (0.48s)
--- fail: testcontrolplane/ciliumnodes (11.75s)
--- fail: testcontrolplane/ciliumnodes/v1.24 (10.40s)
--- pass: testcontrolplane/ciliumnodes/v1.25 (0.67s)
--- pass: testcontrolplane/ciliumnodes/v1.26 (0.68s)
--- pass: testcontrolplane/services/dualstack (1.44s)
--- pass: testcontrolplane/services/dualstack/v1.24 (0.52s)
--- pass: testcontrolplane/services/dualstack/v1.25 (0.41s)
--- pass: testcontrolplane/services/dualstack/v1.26 (0.52s)
--- pass: testcontrolplane/services/gracefultermination (0.62s)
--- pass: testcontrolplane/services/nodeport (4.32s)
--- pass: testcontrolplane/services/nodeport/v1.24/nodeport-control-plane (0.51s)
--- pass: testcontrolplane/services/nodeport/v1.24/nodeport-worker (0.52s)
--- pass: testcontrolplane/services/nodeport/v1.24/nodeport-worker2 (0.41s)
--- pass: testcontrolplane/services/nodeport/v1.25/nodeport-control-plane (0.50s)
--- pass: testcontrolplane/services/nodeport/v1.25/nodeport-worker (0.50s)
--- pass: testcontrolplane/services/nodeport/v1.25/nodeport-worker2 (0.42s)
--- pass: testcontrolplane/services/nodeport/v1.26/nodeport-control-plane (0.52s)
--- pass: testcontrolplane/services/nodeport/v1.26/nodeport-worker (0.44s)
--- pass: testcontrolplane/services/nodeport/v1.26/nodeport-worker2 (0.50s)
fail
github.com/cilium/cilium/test/controlplane coverage: 100.0% of statements
fail github.com/cilium/cilium/test/controlplane 20.811s
--- fail: testcontrolplane (20.56s)
--- fail: testcontrolplane/ciliumnodes (11.75s)
--- fail: testcontrolplane/ciliumnodes/v1.24 (10.40s)
level=warning msg="running cilium with \"kvstore\"=\"\" requires identity allocation via crds. changing identity-allocation-mode to \"crd\"" subsys=config
level=info msg=invoked duration="984.666µs" function="metrics.newregistry (registry.go:65)" subsys=hive
level=info msg=invoked duration=129.470369ms function="cmd.glob..func4 (daemon_main.go:1574)" subsys=hive
level=info msg="spire delegate api client is disabled as no socket path is configured" subsys=spire-delegate
level=info msg="mutual authentication handler is disabled as no port is configured" subsys=auth
level=info msg=invoked duration="668.229µs" function="auth.newmanager (cell.go:87)" subsys=hive
level=info msg=invoked duration="5.062µs" function="suite.startciliumagent.func8 (agent.go:95)" subsys=hive
level=info msg=starting subsys=hive
level=info msg="start hook executed" duration="2.076µs" function="metrics.newregistry.func1 (registry.go:86)" subsys=hive
level=info msg="start hook executed" duration="2.962µs" function="*resource.resource[*github.com/cilium/cilium/pkg/k8s/slim/k8s/api/core/v1.node].start" subsys=hive
level=info msg="using autogenerated ipv4 allocation range" subsys=node v4prefix=10.148.0.0/16
level=info msg="serving prometheus metrics on :9962" subsys=metrics
level=info msg="start hook executed" duration="512.899µs" function="node.newlocalnodestore.func1 (local_node_store.go:77)" subsys=hive
level=info msg="start hook executed" duration="1.625µs" function="*resource.resource[*github.com/cilium/cilium/pkg/k8s/slim/k8s/api/core/v1.service].start" subsys=hive
level=info msg="start hook executed" duration=100.182304ms function="*manager.diffstore[*github.com/cilium/cilium/pkg/k8s/slim/k8s/api/core/v1.service].start" subsys=hive
level=info msg="start hook executed" duration="1.527µs" function="*resource.resource[*github.com/cilium/cilium/pkg/k8s.endpoints].start" subsys=hive
level=info msg="using discoveryv1.endpointslice" subsys=k8s
level=info msg="start hook executed" duration=100.669833ms function="*manager.diffstore[*github.com/cilium/cilium/pkg/k8s.endpoints].start" subsys=hive
level=info msg="start hook executed" duration="2.551µs" function="*resource.resource[*github.com/cilium/cilium/pkg/k8s/apis/cilium.io/v2.ciliumnode].start" subsys=hive
level=info msg="start hook executed" duration=952ns function="*resource.resource[*github.com/cilium/cilium/pkg/k8s/slim/k8s/api/core/v1.pod].start" subsys=hive
level=info msg="start hook executed" duration="1.034µs" function="*resource.resource[*github.com/cilium/cilium/pkg/k8s/slim/k8s/api/core/v1.namespace].start" subsys=hive
level=info msg="start hook executed" duration=952ns function="*resource.resource[*github.com/cilium/cilium/pkg/k8s/apis/cilium.io/v2.ciliumnetworkpolicy].start" subsys=hive
level=info msg="start hook executed" duration="1.305µs" function="*resource.resource[*github.com/cilium/cilium/pkg/k8s/apis/cilium.io/v2.ciliumclusterwidenetworkpolicy].start" subsys=hive
level=info msg="start hook executed" duration=878ns function="*resource.resource[*github.com/cilium/cilium/pkg/k8s/apis/cilium.io/v2alpha1.ciliumcidrgroup].start" subsys=hive
level=info msg="start hook executed" duration="18.945µs" function="endpointmanager.newdefaultendpointmanager.func1 (cell.go:185)" subsys=hive
level=info msg="start hook executed" duration="10.675µs" function="cmd.newpolicytrifecta.func1 (policy.go:127)" subsys=hive
level=info msg="start hook executed" duration="102.284µs" function="*manager.manager.start" subsys=hive
level=info msg="no eventsmap: monitor works only for agent events." subsys=monitor-agent
level=info msg="start hook executed" duration="5.67µs" function="agent.newmonitoragent.func1 (cell.go:62)" subsys=hive
level=info msg="start hook executed" duration=714ns function="*resource.resource[*github.com/cilium/cilium/pkg/k8s/apis/cilium.io/v2alpha1.ciliuml2announcementpolicy].start" subsys=hive
level=info msg="start hook executed" duration="5.99µs" function="*job.group.start" subsys=hive
level=info msg="auto-enabling \"enable-node-port\", \"enable-external-ips\", \"bpf-lb-sock\", \"enable-host-port\", \"enable-session-affinity\" features" subsys=daemon
level=warning msg="unable to automatically detect mtu" error="no interface contains the provided ip: 172.18.0.3" subsys=mtu
level=info msg="l7 proxies are disabled" subsys=daemon
level=warning msg="no valid cgroup base path found: socket load-balancing tracing with hubble will not work.see the kubeproxy-free guide for more details." subsys=cgroup-manager
level=info msg="reading old endpoints..." subsys=daemon
level=info msg="no old endpoints found." subsys=daemon
level=info msg="retrieved node information from kubernetes node" nodename=cilium-nodes-worker subsys=k8s
level=info msg="received own node information from api server" ipaddr.ipv4=172.18.0.3 ipaddr.ipv6="<nil>" k8snodeip=172.18.0.3 labels="map[cilium.io/ci-node:k8s1 kubernetes.io/arch:amd64 kubernetes.io/hostname:cilium-nodes-worker kubernetes.io/os:linux]" nodename=cilium-nodes-worker subsys=k8s v4prefix=10.244.1.0/24 v6prefix="<nil>"
level=info msg="k8s mode: allowing localhost to reach local endpoints" subsys=daemon
level=info msg="detected devices" devices="[ens5]" subsys=linux-datapath
level=info msg="enabling k8s event listener" subsys=k8s-watcher
level=info msg="removing stale endpoint interfaces" subsys=daemon
level=info msg="skipping kvstore configuration" subsys=daemon
level=info msg="initializing node addressing" subsys=daemon
level=info msg="initializing kubernetes ipam" subsys=ipam v4prefix=1.1.1.0/24 v6prefix="cafe::/96"
level=info msg="restoring endpoints..." subsys=daemon
level=info msg="endpoints restored" failed=0 restored=0 subsys=daemon
level=info msg="addressing information:" subsys=daemon
level=info msg=" cluster-name: default" subsys=daemon
level=info msg=" cluster-id: 0" subsys=daemon
level=info msg=" local node-name: cilium-nodes-worker" subsys=daemon
level=info msg=" node-ipv6: <nil>" subsys=daemon
level=info msg=" external-node ipv4: 172.18.0.3" subsys=daemon
level=info msg=" internal-node ipv4: 1.1.1.2" subsys=daemon
level=info msg=" ipv4 allocation prefix: 10.244.1.0/24" subsys=daemon
level=info msg=" loopback ipv4: 169.254.42.1" subsys=daemon
level=info msg=" local ipv4 addresses:" subsys=daemon
level=info msg=" - 10.0.0.2" subsys=daemon
level=info msg=" - 10.0.0.3" subsys=daemon
level=info msg=" - 10.0.0.4" subsys=daemon
level=info msg="creating or updating ciliumnode resource" node=cilium-nodes-worker subsys=nodediscovery
level=info msg="initializing identity allocator" subsys=identity-cache
level=info msg="allocating identities between range" cluster-id=0 max=65535 min=256 subsys=identity-cache
level=info msg="start hook executed" duration=2.992882ms function="cmd.newdaemonpromise.func1 (daemon_main.go:1626)" subsys=hive
level=info msg="start hook executed" duration="3.102µs" function="signal.providesignalmanager.func1 (cell.go:26)" subsys=hive
level=info msg="start hook executed" duration="1.469µs" function="*resource.resource[*github.com/cilium/cilium/pkg/k8s/apis/cilium.io/v2.ciliumidentity].start" subsys=hive
level=info msg="start hook executed" duration=935ns function="*resource.resource[*github.com/cilium/cilium/pkg/k8s/apis/cilium.io/v2.ciliumnode].start" subsys=hive
level=info msg="start hook executed" duration="3.651µs" function="auth.newmanager.func1 (cell.go:92)" subsys=hive
level=info msg="start hook executed" duration="20.078µs" function="*job.group.start" subsys=hive
level=info msg="waiting until local node addressing before starting watchers depending on it" subsys=k8s-watcher
level=info msg="waiting until all pre-existing resources have been received" subsys=k8s-watcher
level=info msg="adding local node to cluster" node="{cilium-nodes-worker default [{internalip 172.18.0.3} {ciliuminternalip 1.1.1.2}] 10.244.1.0/24 [] <nil> [] 1.1.1.205 <nil> <nil> <nil> 0 local 0 map[cilium.io/ci-node:k8s1 kubernetes.io/arch:amd64 kubernetes.io/hostname:cilium-nodes-worker kubernetes.io/os:linux] map[] 1 }" subsys=nodediscovery
level=info msg="datapath signal listener running" subsys=signal
testcase.go:302: ciliumnode labels mismatch, expected: map[cilium.io/ci-node:k8s1 kubernetes.io/arch:amd64 kubernetes.io/hostname:cilium-nodes-worker kubernetes.io/os:linux], found: map[cilium.io/ci-node:k8s1 kubernetes.io/arch:amd64 kubernetes.io/hostname:cilium-nodes-worker kubernetes.io/os:linux test-label:test-value]
- Copy-paste the link of the CI build where that test failure has happen.
Unforatunely, the test was re-ran so link is no longer available.
I’ve tried to reproduce this locally with no luck, so I tried to understand the culprit comparing the logs from a successful and a failed run.
In every log (both successful and unsuccessful) an initial occurrence of the following log entry can be seen:
level=info msg="Creating or updating CiliumNode resource" node=cilium-nodes-worker subsys=nodediscoveryThis initial CiliumNode resource upsert is performed here:
https://github.com/cilium/cilium/blob/e0931df324592358d4645a9f9a31ca87aeddaf70/daemon/cmd/daemon.go#L874
Since that call is blocking, this initial step should not be related to the test failure, as that must complete before returning from
newDaemon.But going on with the analysis of the logs, in a successful run, other occurrences like that one appear:
While for a failed run the
Host endpoint not foundlog is not printed (and so the other one neither).The first log entry comes from the
(*endpointManager).OnUpdateNodecallback that is registered just before the(*ciliumNodeUpdater).OnUpdateNodeone:https://github.com/cilium/cilium/blob/e0931df324592358d4645a9f9a31ca87aeddaf70/daemon/cmd/daemon.go#L670-L684
and the callbacks are executed in the same order in which they are registered. So, if the endpointManager blocks or slows down, as it seems from the missing
Host endpoint not found, the CiliumNodeUpdater cannot run, ultimately leading to the failure.So I tried to see if there is a sequence of events that could possibly slowing down the execution of the endpointManager callback. This lock here might be related:
https://github.com/cilium/cilium/blob/c1dffab254d7864edcb75aa5db159c06c7fa3381/pkg/endpointmanager/host.go#L16
Since it is held by the periodic endpoint gc that is enabled in the controlplane tests:
https://github.com/cilium/cilium/blob/6decd65279aea25a909ee418b7251b7bcacd994d/pkg/endpointmanager/gc.go#L30
Though there should be no endpoints in the endpoint manager map during controlplane testing (and so each iteration of the GC should be very fast), the unpredictability of the CI environment may have led to the flake.
Since the endpoint gc is not useful in the controlplane tests, I think that it should be disabled to (hopefully) mitigate this flake and then verify that the current timeout set in the test is still adequate.
Yup, you are 100% right.
Sounds good. I would probably also take a look at other subscribers to NodeChain to make sure they don’t mind missing events. For example,
endpointManageralso ignores Add event and might be missing label change…The good news is I was able to reproduce this issue locally and reliably, the bad news is - it’s a true race condition in the agent.
To simulate it locally, add
time.Sleep(time.Second * 1)after this line: https://github.com/cilium/cilium/blob/dd06feb35e95b6ba7f9053998ecf3f78fc5eed29/pkg/k8s/resource/resource.go#L479What is happening: informer store is updated asynchronously. For initial update of CiliumNode, we get it from the informer store here: https://github.com/cilium/cilium/blob/12ea4bbfe14ca5b00f5124b5782c97a7d9fe20f4/pkg/nodediscovery/nodediscovery.go#L352 The NodeChain thing is implemented in a way that first, it get’s first version of Node from the first Upset event here: https://github.com/cilium/cilium/blob/fcc7532fa694465227821e87bc614f91658b58de/pkg/k8s/watchers/node.go#L93 This version of node comes from another call to the informer store here (but informer store is updated asynchronously!): https://github.com/cilium/cilium/blob/dd06feb35e95b6ba7f9053998ecf3f78fc5eed29/pkg/k8s/resource/resource.go#L480 so basically an initial update of CiliumNode and this control loop can have different views of what is the initial version of node and we can miss some updates.
My guess of what might be happening here is:
Host endpoint not foundin logs in failed runs - this most likely indicates thatOnUpdateNodeevent might have never happened IMHO.endpointManagerandNewCiliumNodeUpdaterignoreNodeAddevents and only handleNodeUpdateevents.This makes me think it might be a true issue and not just a test flake.