cilium: CI: test-conn-disrupt-client failed due to interrupted traffic during upgrade/downgrade
CI failure
Failed job: https://github.com/cilium/cilium/actions/runs/6154793242/job/16700697606
[=] Test [no-interrupted-connections]
[-] Scenario [no-interrupted-connections/no-interrupted-connections]
🟥 Pod test-conn-disrupt-client-cc557fcf9-8sjnb flow was interrupted (restart count does not match 2 != 3)
This issue is different from Drop due to missed tail call when doing downgrade from main to v1.13.4 because there is no Missed tail call in cilium-bugtool/cmd/cilium-metrics-list.md.
This issue is not an IPsec related issue because:
- This failed check didn’t enable IPsec, although the job name is
IPsec upgrade. The job given above was triggered from https://github.com/cilium/cilium/pull/28086 which disabled IPsec. - XFRM counters didn’t increase.
About this issue
- Original URL
- State: open
- Created 10 months ago
- Comments: 24 (24 by maintainers)
Commits related to this issue
- datapath: Pin CT tail call buffers This commit pins the per-endpoint CT tail call buffer maps, so that they can be preserved during an endpoint regeneration (e.g., when cilium-agent restarts). Previ... — committed to cilium/cilium by brb 9 months ago
- ci-e2e: Do not check for missed tail calls with host-fw It's a known flake / bug [1]. [1]: https://github.com/cilium/cilium/issues/28088 Signed-off-by: Martynas Pumputis <m@lambda.lt> — committed to cilium/cilium by brb 7 months ago
- ci-e2e: Do not check for missed tail calls with host-fw It's a known flake / bug [1]. [1]: https://github.com/cilium/cilium/issues/28088 Signed-off-by: Martynas Pumputis <m@lambda.lt> — committed to cilium/cilium by brb 7 months ago
- ci-e2e: Do not check for missed tail calls with host-fw It's a known flake / bug [1]. [1]: https://github.com/cilium/cilium/issues/28088 Signed-off-by: Martynas Pumputis <m@lambda.lt> — committed to cilium/cilium by brb 7 months ago
- ci-e2e: Do not check for missed tail calls with host-fw It's a known flake / bug [1]. [1]: https://github.com/cilium/cilium/issues/28088 Signed-off-by: Martynas Pumputis <m@lambda.lt> — committed to cilium/cilium by brb 7 months ago
- ci-e2e: Do not check for missed tail calls with host-fw It's a known flake / bug [1]. [1]: https://github.com/cilium/cilium/issues/28088 Signed-off-by: Martynas Pumputis <m@lambda.lt> — committed to cilium/cilium by brb 7 months ago
- ci-e2e: Do not check for missed tail calls with host-fw It's a known flake / bug [1]. [1]: https://github.com/cilium/cilium/issues/28088 Signed-off-by: Martynas Pumputis <m@lambda.lt> — committed to cilium/cilium by brb 7 months ago
- ci-e2e: Do not check for missed tail calls with host-fw It's a known flake / bug [1]. [1]: https://github.com/cilium/cilium/issues/28088 Signed-off-by: Martynas Pumputis <m@lambda.lt> — committed to cilium/cilium by brb 7 months ago
- ci-e2e: Do not check for missed tail calls with host-fw It's a known flake / bug [1]. [1]: https://github.com/cilium/cilium/issues/28088 Signed-off-by: Martynas Pumputis <m@lambda.lt> — committed to cilium/cilium by brb 7 months ago
- ci-e2e: Do not check for missed tail calls with host-fw It's a known flake / bug [1]. [1]: https://github.com/cilium/cilium/issues/28088 Signed-off-by: Martynas Pumputis <m@lambda.lt> — committed to cilium/cilium by brb 7 months ago
- ci-e2e: Do not check for missed tail calls with host-fw [ upstream commit 829736636f84481a6f903e5d67471decfc29793b ] It's a known flake / bug [1]. [1]: https://github.com/cilium/cilium/issues/28088... — committed to cilium/cilium by brb 7 months ago
- ci-e2e: Do not check for missed tail calls with host-fw [ upstream commit 829736636f84481a6f903e5d67471decfc29793b ] It's a known flake / bug [1]. [1]: https://github.com/cilium/cilium/issues/28088... — committed to cilium/cilium by brb 7 months ago
- ci-e2e: Do not check for missed tail calls with host-fw It's a known flake / bug [1]. [1]: https://github.com/cilium/cilium/issues/28088 Signed-off-by: Martynas Pumputis <m@lambda.lt> — committed to pjablonski123/cilium by brb 7 months ago
- ci-e2e: Do not check for missed tail calls with host-fw [ upstream commit 829736636f84481a6f903e5d67471decfc29793b ] It's a known flake / bug [1]. [1]: https://github.com/cilium/cilium/issues/28088... — committed to cilium/cilium by brb 7 months ago
For the sake of posterity the latest bpftrace script:
@giorio94 Mentioned that the drop happens almost at the same time when the related client endpoint gets regenerated. The drop was observed only when both IPv{4,6} and the bpf_lxc per-packet LB were enabled. Anyway, I extend the Marco’s reproducer to get more info:
The drop with pwru:
What happened is that the BPF prog attached to
cilium_vxlandidn’t do the rev-DNAT. This is most likely due to a missing CT entry. I’m continuing the investigation.Here is a recent one with a missed tail call: https://github.com/cilium/cilium/actions/runs/7089953274/job/19295849453. It happens on the IPsec workflow so it’s not the host firewall one (incompatible with IPsec).
Small update. Added bpftrace to the rescue:
The relevant output:
From ^^ I see that the CT map lookup for the reply got a different map value (which probably didn’t have any
rev_nat_idset) than before. To be continued.https://github.com/cilium/cilium/issues/27827 was a duplicate of this issue. Closed to avoid duplicates, but it has a couple more sysdumps, in case that’s helpful.
The ci-e2e-upgrade PR is hitting this too (1 out of ~3 times). For example, https://github.com/cilium/cilium/actions/runs/6147703736/job/16679772837.