cilium: CI: ip-identity watcher stopped receiving events, causing egress policy test failures
Looking at the CI run out put here: https://jenkins.cilium.io/job/Cilium-PR-Ginkgo-Tests-Validated/2839/testReport/junit/(root)/runtime/RuntimeValidatedPolicies_L3_L4_Checks/
monitor.log shows (showing only relevant lines):
[2018-04-30 22:10:30] CPU 01: MARK 0xd943dbd1 FROM 17990 DEBUG: Failed to map daddr=10.15.157.163 to identity
[2018-04-30 22:10:31] CPU 01: MARK 0xd1f1055e FROM 17990 DEBUG: Failed to map daddr=10.15.157.163 to identity
[2018-04-30 22:10:33] CPU 01: MARK 0x6def8c8 FROM 17990 DEBUG: Failed to map daddr=10.15.157.163 to identity
[2018-04-30 22:10:37] CPU 01: MARK 0x7243a868 FROM 17990 DEBUG: Failed to map daddr=10.15.83.26 to identity
[2018-04-30 22:10:38] CPU 01: MARK 0xb1e679db FROM 17990 DEBUG: Failed to map daddr=10.15.83.26 to identity
[2018-04-30 22:10:40] CPU 01: MARK 0xbb1aa16e FROM 17990 DEBUG: Failed to map daddr=10.15.83.26 to identity
[2018-04-30 22:10:42] CPU 01: MARK 0xc601188 FROM 17990 DEBUG: Failed to map daddr.p4=[::0:f426] to identity
[2018-04-30 22:10:43] CPU 01: MARK 0x68a2275c FROM 17990 DEBUG: Failed to map daddr.p4=[::0:f426] to identity
[2018-04-30 22:10:45] CPU 01: MARK 0xca8dbcd9 FROM 17990 DEBUG: Failed to map daddr.p4=[::0:f426] to identity
[2018-04-30 22:10:48] CPU 01: MARK 0xe9251566 FROM 17607 DEBUG: Failed to map daddr=10.15.83.26 to identity
[2018-04-30 22:10:49] CPU 01: MARK 0x5daf9c85 FROM 17607 DEBUG: Failed to map daddr=10.15.83.26 to identity
[2018-04-30 22:10:51] CPU 01: MARK 0x9a98f2ba FROM 17607 DEBUG: Failed to map daddr=10.15.83.26 to identity
Both the IPv4 (10.15.83.26) and IPv6 (::0:f426 should match f00d::a0f:0:0:f426?) belong to the container:id.httpd2 (from test-output.log):
17607 Disabled Enabled 49868 container:id.app3 f00d::a0f:0:0:44c7 10.15.195.104 ready
17990 Disabled Enabled 10211 container:id.app2 f00d::a0f:0:0:4646 10.15.106.144 ready
18544 Enabled Disabled 37841 container:id.httpd1 f00d::a0f:0:0:4870 10.15.157.163 ready
container:id.service1
28441 Disabled Disabled 61150 container:id.app1 f00d::a0f:0:0:6f19 10.15.77.25 ready
29898 Disabled Disabled 4319 reserved:health f00d::a0f:0:0:74ca 10.15.242.54 ready
49896 Disabled Disabled 37971 container:id.httpd3 f00d::a0f:0:0:c2e8 10.15.152.195 ready
container:id.service1
62502 Enabled Disabled 25083 container:id.httpd2 f00d::a0f:0:0:f426 10.15.83.26 ready
container:id.service1
And indeed, bpf ipcache does not have entry for these IPs (from cilium-bpf-ipcache-list.md):
IP PREFIX/ADDRESS IDENTITY
f00d::a0f:0:0:1150 44408
10.15.202.84 44408
fc00::10ca:1 1
10.15.28.238 1
f00d::a0f:0:0:74ca 4319
10.0.2.15 1
10.15.242.54 4319
f00d::a0f:0:0:8ad6 1
But then the bpf ipcache seems to be missing all other endpoints as well.
I did not know how to interpret kvstore_get.txt, but it contains this one line matching the IPv4 address of httpd2 that can’t be resolved to an ID:
cilium/state/ip/v1/default/10.15.83.26 => {"IP":"10.15.83.26","ID":25083,"Metadata":"cilium-global:default:runtime:62502"}
About this issue
- Original URL
- State: closed
- Created 6 years ago
- Comments: 20 (20 by maintainers)
@joestringer can we debug this together tomorrow?
Hit this in #3835 too. There is evidence that the kvstore is updated with the relevant IP->ID mapping, however the IP->ID kvstore watcher does not appear to have received events for about five minutes. Monitor shows packets being dropped due to the identity mapping to the cluster ID rather than the endpoint IP’s ID. Suspecting that in some cases, consul is not properly triggering create events.
Note that the PR #3835 adds CIDR-dependent L4, so the ipcache entries are extended with masks. One of the implications is that the “failure to map daddr” in the monitor becomes a success to map the daddr, to ID
3(cluster). This is still wrong, it should map to an actual endpoint identity.Link: https://jenkins.cilium.io/job/Cilium-PR-Ginkgo-Tests-Validated/3029/
test_results_Cilium-PR-Ginkgo-Tests-Validated_3029.zip
Failure:
Logs at time:
Monitor drop:
Cilium log for upsert:
KVstore:
Logs are missing from the daemon for about 12 seconds at a relevant time
gops suggests that there have been no updates from the kvstore:
Also interesting, another goroutine is blocking on kvstore operation:
Unfortunately the consul log appears to be clean, no indication of errors in it.
@ianvernon to prepare a commit to wait for IPs to exist in the BPF datapath as a way to: