cilium: CI: Host firewall: unable to upgrade connection
Failed a couple times in different PRs:
- https://jenkins.cilium.io/job/Cilium-PR-K8s-1.20-kernel-4.9/1068/testReport/junit/Suite-k8s-1/20/K8sDatapathConfig_Host_firewall_With_VXLAN/ – 1cee8a70_K8sDatapathConfig_Host_firewall_With_VXLAN.zip
- https://jenkins.cilium.io/job/Cilium-PR-K8s-1.20-kernel-4.9/1076/testReport/junit/Suite-k8s-1/20/K8sDatapathConfig_Host_firewall_With_VXLAN/ – 39f2ff05_K8sDatapathConfig_Host_firewall_With_VXLAN.zip
- https://jenkins.cilium.io/job/Cilium-PR-K8s-1.13-net-next/1147/testReport/junit/Suite-k8s-1/13/K8sDatapathConfig_Host_firewall_With_native_routing/ – 5cc06922_K8sDatapathConfig_Host_firewall_With_native_routing.zip
Stacktrace
/home/jenkins/workspace/Cilium-PR-K8s-1.20-kernel-4.9/src/github.com/cilium/cilium/test/ginkgo-ext/scopes.go:518
Failed to reach 192.168.36.11:80 from testclient-vvxhw
Expected command: kubectl exec -n 202103232024k8sdatapathconfighostfirewallwithvxlan testclient-vvxhw -- curl --path-as-is -s -D /dev/stderr --fail --connect-timeout 5 --max-time 20 http://192.168.36.11:80/ -w "time-> DNS: '%{time_namelookup}(%{remote_ip})', Connect: '%{time_connect}',Transfer '%{time_starttransfer}', total '%{time_total}'"
To succeed, but it failed:
Exitcode: 1
Err: exit status 1
Stdout:
Stderr:
error: unable to upgrade connection: Authorization error (user=kube-apiserver-kubelet-client, verb=create, resource=nodes, subresource=proxy)
/home/jenkins/workspace/Cilium-PR-K8s-1.20-kernel-4.9/src/github.com/cilium/cilium/test/k8sT/DatapathConfiguration.go:649
Standard Output
Number of "context deadline exceeded" in logs: 0
Number of "level=error" in logs: 0
Number of "level=warning" in logs: 0
Number of "Cilium API handler panicked" in logs: 0
Number of "Goroutine took lock for more than" in logs: 0
No errors/warnings found in logs
Number of "context deadline exceeded" in logs: 0
Number of "level=error" in logs: 0
Number of "level=warning" in logs: 0
Number of "Cilium API handler panicked" in logs: 0
Number of "Goroutine took lock for more than" in logs: 0
No errors/warnings found in logs
Number of "context deadline exceeded" in logs: 2
Number of "level=error" in logs: 0
⚠️ Number of "level=warning" in logs: 24
Number of "Cilium API handler panicked" in logs: 0
Number of "Goroutine took lock for more than" in logs: 0
Top 5 errors/warnings:
Unable to update ipcache map entry on pod add
failed to resolve identity
Unable to restore endpoint, ignoring
Unable to serve pprof API
github.com/cilium/cilium/pkg/k8s/watchers/endpoint_slice.go:90: watch of *v1beta1.EndpointSlice ended with: an error on the server (\
Cilium pods: [cilium-cn8kr cilium-v8t6w]
Netpols loaded:
CiliumNetworkPolicies loaded:
Endpoint Policy Enforcement:
Pod Ingress Egress
grafana-d69c97b9b-hlvb6
prometheus-655fb888d7-rqk9s
coredns-867bf6789f-26clv
testclient-vvxhw
testclient-zfhh2
testserver-f7q56
testserver-szc2d
Cilium agent 'cilium-cn8kr': Status: Ok Health: Ok Nodes "" ContinerRuntime: Kubernetes: Ok KVstore: Ok Controllers: Total 42 Failed 0
Cilium agent 'cilium-v8t6w': Status: Ok Health: Ok Nodes "" ContinerRuntime: Kubernetes: Ok KVstore: Ok Controllers: Total 28 Failed 0
Standard Error
Click to show details
20:21:42 STEP: Installing Cilium
20:21:43 STEP: Waiting for Cilium to become ready
20:23:39 STEP: Validating if Kubernetes DNS is deployed
20:23:39 STEP: Checking if deployment is ready
20:23:39 STEP: Checking if kube-dns service is plumbed correctly
20:23:39 STEP: Checking if pods have identity
20:23:39 STEP: Checking if DNS can resolve
20:23:45 STEP: Kubernetes DNS is not ready: 5s timeout expired
20:23:45 STEP: Restarting Kubernetes DNS (-l k8s-app=kube-dns)
20:23:46 STEP: Checking service kube-system/kube-dns plumbing in cilium pod cilium-cn8kr: unable to find service backend 10.0.0.67:53 in datapath of cilium pod cilium-cn8kr
20:24:01 STEP: Waiting for Kubernetes DNS to become operational
20:24:01 STEP: Checking if deployment is ready
20:24:01 STEP: Checking if kube-dns service is plumbed correctly
20:24:01 STEP: Checking if pods have identity
20:24:01 STEP: Checking if DNS can resolve
20:24:02 STEP: Validating Cilium Installation
20:24:02 STEP: Performing Cilium controllers preflight check
20:24:02 STEP: Performing Cilium status preflight check
20:24:02 STEP: Performing Cilium health check
20:24:03 STEP: Performing Cilium service preflight check
20:24:03 STEP: Performing K8s service preflight check
20:24:05 STEP: Waiting for cilium-operator to be ready
20:24:05 STEP: WaitforPods(namespace="kube-system", filter="-l name=cilium-operator")
20:24:05 STEP: WaitforPods(namespace="kube-system", filter="-l name=cilium-operator") => <nil>
20:24:05 STEP: Making sure all endpoints are in ready state
20:24:06 STEP: Creating namespace 202103232024k8sdatapathconfighostfirewallwithvxlan
20:24:06 STEP: Deploying demo_hostfw.yaml in namespace 202103232024k8sdatapathconfighostfirewallwithvxlan
20:24:06 STEP: Waiting for 4m0s for 8 pods of deployment demo_hostfw.yaml to become ready
20:24:06 STEP: WaitforNPods(namespace="202103232024k8sdatapathconfighostfirewallwithvxlan", filter="")
20:24:21 STEP: WaitforNPods(namespace="202103232024k8sdatapathconfighostfirewallwithvxlan", filter="") => <nil>
20:24:21 STEP: Applying policies /home/jenkins/workspace/Cilium-PR-K8s-1.20-kernel-4.9/src/github.com/cilium/cilium/test/k8sT/manifests/host-policies.yaml
20:24:33 STEP: Checking host policies on ingress from local pod
20:24:33 STEP: WaitforPods(namespace="202103232024k8sdatapathconfighostfirewallwithvxlan", filter="-l zgroup=testClient")
20:24:33 STEP: WaitforPods(namespace="202103232024k8sdatapathconfighostfirewallwithvxlan", filter="-l zgroup=testClient") => <nil>
20:24:33 STEP: WaitforPods(namespace="202103232024k8sdatapathconfighostfirewallwithvxlan", filter="-l zgroup=testServerHost")
20:24:33 STEP: WaitforPods(namespace="202103232024k8sdatapathconfighostfirewallwithvxlan", filter="-l zgroup=testServerHost") => <nil>
20:24:38 STEP: Checking host policies on ingress from remote pod
20:24:38 STEP: WaitforPods(namespace="202103232024k8sdatapathconfighostfirewallwithvxlan", filter="-l zgroup=testClient")
20:24:38 STEP: WaitforPods(namespace="202103232024k8sdatapathconfighostfirewallwithvxlan", filter="-l zgroup=testClient") => <nil>
20:24:38 STEP: WaitforPods(namespace="202103232024k8sdatapathconfighostfirewallwithvxlan", filter="-l zgroup=testServerHost")
20:24:38 STEP: WaitforPods(namespace="202103232024k8sdatapathconfighostfirewallwithvxlan", filter="-l zgroup=testServerHost") => <nil>
FAIL: Failed to reach 192.168.36.11:80 from testclient-vvxhw
Expected command: kubectl exec -n 202103232024k8sdatapathconfighostfirewallwithvxlan testclient-vvxhw -- curl --path-as-is -s -D /dev/stderr --fail --connect-timeout 5 --max-time 20 http://192.168.36.11:80/ -w "time-> DNS: '%{time_namelookup}(%{remote_ip})', Connect: '%{time_connect}',Transfer '%{time_starttransfer}', total '%{time_total}'"
To succeed, but it failed:
Exitcode: 1
Err: exit status 1
Stdout:
Stderr:
error: unable to upgrade connection: Authorization error (user=kube-apiserver-kubelet-client, verb=create, resource=nodes, subresource=proxy)
=== Test Finished at 2021-03-23T20:24:47Z====
20:24:47 STEP: Running JustAfterEach block for EntireTestsuite K8sDatapathConfig
===================== TEST FAILED =====================
20:24:47 STEP: Running AfterFailed block for EntireTestsuite K8sDatapathConfig
cmd: kubectl get pods -o wide --all-namespaces
Exitcode: 0
Stdout:
NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
202103232024k8sdatapathconfighostfirewallwithvxlan testclient-host-g8nnv 1/1 Running 0 44s 192.168.36.11 k8s1 <none> <none>
202103232024k8sdatapathconfighostfirewallwithvxlan testclient-host-wlm48 1/1 Running 0 44s 192.168.36.12 k8s2 <none> <none>
202103232024k8sdatapathconfighostfirewallwithvxlan testclient-vvxhw 1/1 Running 0 44s 10.0.1.77 k8s2 <none> <none>
202103232024k8sdatapathconfighostfirewallwithvxlan testclient-zfhh2 1/1 Running 0 44s 10.0.0.76 k8s1 <none> <none>
202103232024k8sdatapathconfighostfirewallwithvxlan testserver-f7q56 2/2 Running 0 44s 10.0.1.177 k8s2 <none> <none>
202103232024k8sdatapathconfighostfirewallwithvxlan testserver-host-n6rt9 2/2 Running 0 44s 192.168.36.11 k8s1 <none> <none>
202103232024k8sdatapathconfighostfirewallwithvxlan testserver-host-q89jm 2/2 Running 0 44s 192.168.36.12 k8s2 <none> <none>
202103232024k8sdatapathconfighostfirewallwithvxlan testserver-szc2d 2/2 Running 0 44s 10.0.0.86 k8s1 <none> <none>
cilium-monitoring grafana-d69c97b9b-hlvb6 1/1 Running 0 97m 10.0.1.214 k8s2 <none> <none>
cilium-monitoring prometheus-655fb888d7-rqk9s 1/1 Running 0 97m 10.0.1.93 k8s2 <none> <none>
kube-system cilium-cn8kr 1/1 Running 0 3m7s 192.168.36.12 k8s2 <none> <none>
kube-system cilium-operator-5878c8bc5c-22nfr 1/1 Running 0 3m7s 192.168.36.12 k8s2 <none> <none>
kube-system cilium-operator-5878c8bc5c-85m7m 1/1 Running 0 3m7s 192.168.36.11 k8s1 <none> <none>
kube-system cilium-v8t6w 1/1 Running 0 3m7s 192.168.36.11 k8s1 <none> <none>
kube-system coredns-867bf6789f-26clv 1/1 Running 0 66s 10.0.1.249 k8s2 <none> <none>
kube-system etcd-k8s1 1/1 Running 0 99m 192.168.36.11 k8s1 <none> <none>
kube-system kube-apiserver-k8s1 1/1 Running 0 99m 192.168.36.11 k8s1 <none> <none>
kube-system kube-controller-manager-k8s1 1/1 Running 0 99m 192.168.36.11 k8s1 <none> <none>
kube-system kube-proxy-bx8ds 1/1 Running 0 99m 192.168.36.11 k8s1 <none> <none>
kube-system kube-proxy-lbqpl 1/1 Running 0 98m 192.168.36.12 k8s2 <none> <none>
kube-system kube-scheduler-k8s1 1/1 Running 0 99m 192.168.36.11 k8s1 <none> <none>
kube-system log-gatherer-cnn8x 1/1 Running 0 97m 192.168.36.11 k8s1 <none> <none>
kube-system log-gatherer-dvhg5 1/1 Running 0 97m 192.168.36.12 k8s2 <none> <none>
kube-system registry-adder-fbxwg 1/1 Running 0 98m 192.168.36.12 k8s2 <none> <none>
kube-system registry-adder-fc8b8 1/1 Running 0 98m 192.168.36.11 k8s1 <none> <none>
Stderr:
Fetching command output from pods [cilium-cn8kr cilium-v8t6w]
cmd: kubectl exec -n kube-system cilium-cn8kr -- cilium status
Exitcode: 0
Stdout:
KVStore: Ok Disabled
Kubernetes: Ok 1.20 (v1.20.4) [linux/amd64]
Kubernetes APIs: ["cilium/v2::CiliumClusterwideNetworkPolicy", "cilium/v2::CiliumEndpoint", "cilium/v2::CiliumNetworkPolicy", "cilium/v2::CiliumNode", "core/v1::Namespace", "core/v1::Node", "core/v1::Pods", "core/v1::Service", "discovery/v1beta1::EndpointSlice", "networking.k8s.io/v1::NetworkPolicy"]
KubeProxyReplacement: Disabled
Cilium: Ok 1.9.90 (v.1.9.90-r.a2a7594)
NodeMonitor: Listening for events on 3 CPUs with 64x4096 of shared memory
Cilium health daemon: Ok
IPAM: IPv4: 7/255 allocated from 10.0.1.0/24, IPv6: 7/255 allocated from fd02::100/120
BandwidthManager: Disabled
Host Routing: Legacy
Masquerading: IPTables [IPv4: Enabled, IPv6: Enabled]
Controller Status: 42/42 healthy
Proxy Status: OK, ip 10.0.1.42, 0 redirects active on ports 10000-20000
Hubble: Ok Current/Max Flows: 1193/4095 (29.13%), Flows/s: 14.04 Metrics: Disabled
Cluster health: 2/2 reachable (2021-03-23T20:24:03Z)
Stderr:
cmd: kubectl exec -n kube-system cilium-cn8kr -- cilium endpoint list
Exitcode: 0
Stdout:
ENDPOINT POLICY (ingress) POLICY (egress) IDENTITY LABELS (source:key[=value]) IPv6 IPv4 STATUS
ENFORCEMENT ENFORCEMENT
203 Disabled Disabled 32434 k8s:io.cilium.k8s.policy.cluster=default fd02::12d 10.0.1.249 ready
k8s:io.cilium.k8s.policy.serviceaccount=coredns
k8s:io.kubernetes.pod.namespace=kube-system
k8s:k8s-app=kube-dns
262 Enabled Enabled 1 k8s:cilium.io/ci-node=k8s2 ready
reserved:host
309 Disabled Disabled 4 reserved:health fd02::12c 10.0.1.175 ready
729 Disabled Disabled 26056 k8s:io.cilium.k8s.policy.cluster=default fd02::15a 10.0.1.77 ready
k8s:io.cilium.k8s.policy.serviceaccount=default
k8s:io.kubernetes.pod.namespace=202103232024k8sdatapathconfighostfirewallwithvxlan
k8s:zgroup=testClient
1334 Disabled Disabled 25739 k8s:app=prometheus fd02::1af 10.0.1.93 ready
k8s:io.cilium.k8s.policy.cluster=default
k8s:io.cilium.k8s.policy.serviceaccount=prometheus-k8s
k8s:io.kubernetes.pod.namespace=cilium-monitoring
2226 Disabled Disabled 10726 k8s:io.cilium.k8s.policy.cluster=default fd02::14c 10.0.1.177 ready
k8s:io.cilium.k8s.policy.serviceaccount=default
k8s:io.kubernetes.pod.namespace=202103232024k8sdatapathconfighostfirewallwithvxlan
k8s:zgroup=testServer
2608 Disabled Disabled 3821 k8s:app=grafana fd02::1ed 10.0.1.214 ready
k8s:io.cilium.k8s.policy.cluster=default
k8s:io.cilium.k8s.policy.serviceaccount=default
k8s:io.kubernetes.pod.namespace=cilium-monitoring
Stderr:
cmd: kubectl exec -n kube-system cilium-v8t6w -- cilium status
Exitcode: 0
Stdout:
KVStore: Ok Disabled
Kubernetes: Ok 1.20 (v1.20.4) [linux/amd64]
Kubernetes APIs: ["cilium/v2::CiliumClusterwideNetworkPolicy", "cilium/v2::CiliumEndpoint", "cilium/v2::CiliumNetworkPolicy", "cilium/v2::CiliumNode", "core/v1::Namespace", "core/v1::Node", "core/v1::Pods", "core/v1::Service", "discovery/v1beta1::EndpointSlice", "networking.k8s.io/v1::NetworkPolicy"]
KubeProxyReplacement: Disabled
Cilium: Ok 1.9.90 (v.1.9.90-r.a2a7594)
NodeMonitor: Listening for events on 3 CPUs with 64x4096 of shared memory
Cilium health daemon: Ok
IPAM: IPv4: 4/255 allocated from 10.0.0.0/24, IPv6: 4/255 allocated from fd02::/120
BandwidthManager: Disabled
Host Routing: Legacy
Masquerading: IPTables [IPv4: Enabled, IPv6: Enabled]
Controller Status: 28/28 healthy
Proxy Status: OK, ip 10.0.0.96, 0 redirects active on ports 10000-20000
Hubble: Ok Current/Max Flows: 1907/4095 (46.57%), Flows/s: 10.04 Metrics: Disabled
Cluster health: 2/2 reachable (2021-03-23T20:24:04Z)
Stderr:
cmd: kubectl exec -n kube-system cilium-v8t6w -- cilium endpoint list
Exitcode: 0
Stdout:
ENDPOINT POLICY (ingress) POLICY (egress) IDENTITY LABELS (source:key[=value]) IPv6 IPv4 STATUS
ENFORCEMENT ENFORCEMENT
494 Enabled Enabled 1 k8s:cilium.io/ci-node=k8s1 ready
k8s:node-role.kubernetes.io/control-plane
k8s:node-role.kubernetes.io/master
reserved:host
787 Disabled Disabled 10726 k8s:io.cilium.k8s.policy.cluster=default fd02::60 10.0.0.86 ready
k8s:io.cilium.k8s.policy.serviceaccount=default
k8s:io.kubernetes.pod.namespace=202103232024k8sdatapathconfighostfirewallwithvxlan
k8s:zgroup=testServer
1051 Disabled Disabled 4 reserved:health fd02::25 10.0.0.18 ready
1661 Disabled Disabled 26056 k8s:io.cilium.k8s.policy.cluster=default fd02::2 10.0.0.76 ready
k8s:io.cilium.k8s.policy.serviceaccount=default
k8s:io.kubernetes.pod.namespace=202103232024k8sdatapathconfighostfirewallwithvxlan
k8s:zgroup=testClient
Stderr:
===================== Exiting AfterFailed =====================
20:25:01 STEP: Running AfterEach for block EntireTestsuite K8sDatapathConfig Host firewall
20:25:02 STEP: Running AfterEach for block EntireTestsuite K8sDatapathConfig
20:25:02 STEP: Deleting deployment demo_hostfw.yaml
20:25:02 STEP: Deleting namespace 202103232024k8sdatapathconfighostfirewallwithvxlan
20:25:02 STEP: Deleting namespace 202103232024k8sdatapathconfighostfirewallwithvxlan
20:25:17 STEP: Running AfterEach for block EntireTestsuite
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Comments: 31 (26 by maintainers)
Commits related to this issue
- test/k8s: quarantine test while we validate hostfw ct ack issue fix. Change 439a0a059fdcabe23a33b427b637494bc5a59eda introduced potential workaround to common flake we've been seeing relating to issu... — committed to tommyp1ckles/cilium by tommyp1ckles a year ago
- test/k8s: add host firewall workaround for svc host policy test. Change 439a0a0 introduced workaround to common flake we've been seeing relating to issue #15455. Any test enabling hostfw/host-policy... — committed to tommyp1ckles/cilium by tommyp1ckles a year ago
- test/k8s: add host firewall workaround for svc host policy test. Change 439a0a0 introduced workaround to common flake we've been seeing relating to issue #15455. Any test enabling hostfw/host-policy... — committed to cilium/cilium by tommyp1ckles a year ago
- test/k8s: add host firewall workaround for svc host policy test. [ upstream commit 0cfce97f314b3c7975e62d5e04944184fa63326f ] Change 439a0a0 introduced workaround to common flake we've been seeing r... — committed to cilium/cilium by tommyp1ckles a year ago
- test/k8s: add host firewall workaround for svc host policy test. [ upstream commit 0cfce97f314b3c7975e62d5e04944184fa63326f ] Change 439a0a0 introduced workaround to common flake we've been seeing r... — committed to cilium/cilium by tommyp1ckles a year ago
- test/k8s: add host firewall workaround for svc host policy test. [ upstream commit 0cfce97f314b3c7975e62d5e04944184fa63326f ] Change 439a0a0 introduced workaround to common flake we've been seeing r... — committed to cilium/cilium by tommyp1ckles a year ago
I think we can say we have confirmation the workaround is working:
Let’s close this issue. We can discuss the proper way forward for the host firewall at https://github.com/cilium/cilium/issues/25448.
We should also extend this workaround to other host firewall test (i.e.,
K8sDatapathServices .* With host policy .*). IMO, it would probably be best to move that test to be with the others (thus, not requiring any additional workaround) rather than copy the workaround in K8sDatapathServices. cc @brb small hanging fruit to kill one of our highest Jenkins flake right now.The current hypothesis is that this flake is caused by the drops mentioned in a previous comment:
Origin of the Packet Drops
These drops happen for connections that were established before the host firewall was enabled. Once the host firewall is enabled, conntrack is enabled for host connections. For each established connection, the next packet seen by the conntrack engine is considered the first packet of the connection and the connection’s direction is deduced from that packet. So, if the next packet seen by the conntrack engine is a reply, the connection will be tracked in the wrong direction. Once L4 policies are loaded, this connection will be dropped because it’s destination port (actually the source port, and thus ephemeral port) likely won’t match any L4 rule.
In this flake, the packet drops are for connections to the kube-apiserver (TCP/6443) across nodes. So the hypothesis is that maybe these drops are causing the failure to execute
kubectl exec. The drops happen as soon as the host policies are loaded, but it takes a little while for the connections to be terminated (due to retries), so maybe that’s why we don’t see thekubectl execfailure at the very beginning of the test.Next Step
We discussed this issue at length with @joestringer and tried to consider how we might validate this hypothesis, but it proved a bit tricky. We have a couple errors in kube-apiserver logs around the time the failure is happening, but those errors are not consistent across all flake occurrences.
Alternatively, we can attempt a fix and see if it helps. We deduced that the only way to avoid those packet drops is for the connections to be established only after the host firewall is enabled. We can’t simply flush conntrack to reestablish connections because the next packet received would just be considered the first packet of the connection again. Since the connections affected here are kube-apiserver connections, we are considering restarting the kube-apiserver pods after the host firewall is enabled, such that all connections are reestablished.
I don’t think that specific test has the workaround. Only
K8sDatapathConfig Host Firewallhas it.The dashboard you want to follow is: https://lookerstudio.google.com/s/vquN3vXM-kM.
I’d say Monday we will have “official” confirmation and we can extend the workaround and consider what we should document.
Just checked all linked cases since merging https://github.com/cilium/cilium/pull/25323, all failures have happened on branches not rebased with the potential fix.
If anyone is reporting this issue make sure your branch has: https://github.com/cilium/cilium/commit/439a0a059fdcabe23a33b427b637494bc5a59eda
Happened again at https://jenkins.cilium.io/job/Cilium-PR-K8s-1.21-kernel-4.9/646/testReport/junit/Suite-k8s-1/21/K8sDatapathConfig_Host_firewall_With_VXLAN_and_endpoint_routes/. 😢
I checked recent failures and this flake seems to have disappeared since we merged https://github.com/cilium/cilium/pull/16064 (which parallelized the hostfw tests and made them a lot faster) on May 11th. The list of PRs above linked to this issue also seems to attest that. I’ll keep open for a short while longer and we’ll close if it’s confirmed.