cilium: CI: Host firewall: unable to upgrade connection

Failed a couple times in different PRs:

Stacktrace

/home/jenkins/workspace/Cilium-PR-K8s-1.20-kernel-4.9/src/github.com/cilium/cilium/test/ginkgo-ext/scopes.go:518
Failed to reach 192.168.36.11:80 from testclient-vvxhw
Expected command: kubectl exec -n 202103232024k8sdatapathconfighostfirewallwithvxlan testclient-vvxhw -- curl --path-as-is -s -D /dev/stderr --fail --connect-timeout 5 --max-time 20 http://192.168.36.11:80/ -w "time-> DNS: '%{time_namelookup}(%{remote_ip})', Connect: '%{time_connect}',Transfer '%{time_starttransfer}', total '%{time_total}'" 
To succeed, but it failed:
Exitcode: 1 
Err: exit status 1
Stdout:
 	 
Stderr:
 	 error: unable to upgrade connection: Authorization error (user=kube-apiserver-kubelet-client, verb=create, resource=nodes, subresource=proxy)
	 

/home/jenkins/workspace/Cilium-PR-K8s-1.20-kernel-4.9/src/github.com/cilium/cilium/test/k8sT/DatapathConfiguration.go:649

Standard Output

Number of "context deadline exceeded" in logs: 0
Number of "level=error" in logs: 0
Number of "level=warning" in logs: 0
Number of "Cilium API handler panicked" in logs: 0
Number of "Goroutine took lock for more than" in logs: 0
No errors/warnings found in logs
Number of "context deadline exceeded" in logs: 0
Number of "level=error" in logs: 0
Number of "level=warning" in logs: 0
Number of "Cilium API handler panicked" in logs: 0
Number of "Goroutine took lock for more than" in logs: 0
No errors/warnings found in logs
Number of "context deadline exceeded" in logs: 2
Number of "level=error" in logs: 0
⚠️  Number of "level=warning" in logs: 24
Number of "Cilium API handler panicked" in logs: 0
Number of "Goroutine took lock for more than" in logs: 0
Top 5 errors/warnings:
Unable to update ipcache map entry on pod add
failed to resolve identity
Unable to restore endpoint, ignoring
Unable to serve pprof API
github.com/cilium/cilium/pkg/k8s/watchers/endpoint_slice.go:90: watch of *v1beta1.EndpointSlice ended with: an error on the server (\
Cilium pods: [cilium-cn8kr cilium-v8t6w]
Netpols loaded: 
CiliumNetworkPolicies loaded: 
Endpoint Policy Enforcement:
Pod                           Ingress   Egress
grafana-d69c97b9b-hlvb6                 
prometheus-655fb888d7-rqk9s             
coredns-867bf6789f-26clv                
testclient-vvxhw                        
testclient-zfhh2                        
testserver-f7q56                        
testserver-szc2d                        
Cilium agent 'cilium-cn8kr': Status: Ok  Health: Ok Nodes "" ContinerRuntime:  Kubernetes: Ok KVstore: Ok Controllers: Total 42 Failed 0
Cilium agent 'cilium-v8t6w': Status: Ok  Health: Ok Nodes "" ContinerRuntime:  Kubernetes: Ok KVstore: Ok Controllers: Total 28 Failed 0

Standard Error

Click to show details
20:21:42 STEP: Installing Cilium
20:21:43 STEP: Waiting for Cilium to become ready
20:23:39 STEP: Validating if Kubernetes DNS is deployed
20:23:39 STEP: Checking if deployment is ready
20:23:39 STEP: Checking if kube-dns service is plumbed correctly
20:23:39 STEP: Checking if pods have identity
20:23:39 STEP: Checking if DNS can resolve
20:23:45 STEP: Kubernetes DNS is not ready: 5s timeout expired
20:23:45 STEP: Restarting Kubernetes DNS (-l k8s-app=kube-dns)
20:23:46 STEP: Checking service kube-system/kube-dns plumbing in cilium pod cilium-cn8kr: unable to find service backend 10.0.0.67:53 in datapath of cilium pod cilium-cn8kr
20:24:01 STEP: Waiting for Kubernetes DNS to become operational
20:24:01 STEP: Checking if deployment is ready
20:24:01 STEP: Checking if kube-dns service is plumbed correctly
20:24:01 STEP: Checking if pods have identity
20:24:01 STEP: Checking if DNS can resolve
20:24:02 STEP: Validating Cilium Installation
20:24:02 STEP: Performing Cilium controllers preflight check
20:24:02 STEP: Performing Cilium status preflight check
20:24:02 STEP: Performing Cilium health check
20:24:03 STEP: Performing Cilium service preflight check
20:24:03 STEP: Performing K8s service preflight check
20:24:05 STEP: Waiting for cilium-operator to be ready
20:24:05 STEP: WaitforPods(namespace="kube-system", filter="-l name=cilium-operator")
20:24:05 STEP: WaitforPods(namespace="kube-system", filter="-l name=cilium-operator") => <nil>
20:24:05 STEP: Making sure all endpoints are in ready state
20:24:06 STEP: Creating namespace 202103232024k8sdatapathconfighostfirewallwithvxlan
20:24:06 STEP: Deploying demo_hostfw.yaml in namespace 202103232024k8sdatapathconfighostfirewallwithvxlan
20:24:06 STEP: Waiting for 4m0s for 8 pods of deployment demo_hostfw.yaml to become ready
20:24:06 STEP: WaitforNPods(namespace="202103232024k8sdatapathconfighostfirewallwithvxlan", filter="")
20:24:21 STEP: WaitforNPods(namespace="202103232024k8sdatapathconfighostfirewallwithvxlan", filter="") => <nil>
20:24:21 STEP: Applying policies /home/jenkins/workspace/Cilium-PR-K8s-1.20-kernel-4.9/src/github.com/cilium/cilium/test/k8sT/manifests/host-policies.yaml
20:24:33 STEP: Checking host policies on ingress from local pod
20:24:33 STEP: WaitforPods(namespace="202103232024k8sdatapathconfighostfirewallwithvxlan", filter="-l zgroup=testClient")
20:24:33 STEP: WaitforPods(namespace="202103232024k8sdatapathconfighostfirewallwithvxlan", filter="-l zgroup=testClient") => <nil>
20:24:33 STEP: WaitforPods(namespace="202103232024k8sdatapathconfighostfirewallwithvxlan", filter="-l zgroup=testServerHost")
20:24:33 STEP: WaitforPods(namespace="202103232024k8sdatapathconfighostfirewallwithvxlan", filter="-l zgroup=testServerHost") => <nil>
20:24:38 STEP: Checking host policies on ingress from remote pod
20:24:38 STEP: WaitforPods(namespace="202103232024k8sdatapathconfighostfirewallwithvxlan", filter="-l zgroup=testClient")
20:24:38 STEP: WaitforPods(namespace="202103232024k8sdatapathconfighostfirewallwithvxlan", filter="-l zgroup=testClient") => <nil>
20:24:38 STEP: WaitforPods(namespace="202103232024k8sdatapathconfighostfirewallwithvxlan", filter="-l zgroup=testServerHost")
20:24:38 STEP: WaitforPods(namespace="202103232024k8sdatapathconfighostfirewallwithvxlan", filter="-l zgroup=testServerHost") => <nil>
FAIL: Failed to reach 192.168.36.11:80 from testclient-vvxhw
Expected command: kubectl exec -n 202103232024k8sdatapathconfighostfirewallwithvxlan testclient-vvxhw -- curl --path-as-is -s -D /dev/stderr --fail --connect-timeout 5 --max-time 20 http://192.168.36.11:80/ -w "time-> DNS: '%{time_namelookup}(%{remote_ip})', Connect: '%{time_connect}',Transfer '%{time_starttransfer}', total '%{time_total}'" 
To succeed, but it failed:
Exitcode: 1 
Err: exit status 1
Stdout:
 	 
Stderr:
 	 error: unable to upgrade connection: Authorization error (user=kube-apiserver-kubelet-client, verb=create, resource=nodes, subresource=proxy)
	 

=== Test Finished at 2021-03-23T20:24:47Z====
20:24:47 STEP: Running JustAfterEach block for EntireTestsuite K8sDatapathConfig
===================== TEST FAILED =====================
20:24:47 STEP: Running AfterFailed block for EntireTestsuite K8sDatapathConfig
cmd: kubectl get pods -o wide --all-namespaces
Exitcode: 0 
Stdout:
 	 NAMESPACE                                            NAME                               READY   STATUS    RESTARTS   AGE    IP              NODE   NOMINATED NODE   READINESS GATES
	 202103232024k8sdatapathconfighostfirewallwithvxlan   testclient-host-g8nnv              1/1     Running   0          44s    192.168.36.11   k8s1   <none>           <none>
	 202103232024k8sdatapathconfighostfirewallwithvxlan   testclient-host-wlm48              1/1     Running   0          44s    192.168.36.12   k8s2   <none>           <none>
	 202103232024k8sdatapathconfighostfirewallwithvxlan   testclient-vvxhw                   1/1     Running   0          44s    10.0.1.77       k8s2   <none>           <none>
	 202103232024k8sdatapathconfighostfirewallwithvxlan   testclient-zfhh2                   1/1     Running   0          44s    10.0.0.76       k8s1   <none>           <none>
	 202103232024k8sdatapathconfighostfirewallwithvxlan   testserver-f7q56                   2/2     Running   0          44s    10.0.1.177      k8s2   <none>           <none>
	 202103232024k8sdatapathconfighostfirewallwithvxlan   testserver-host-n6rt9              2/2     Running   0          44s    192.168.36.11   k8s1   <none>           <none>
	 202103232024k8sdatapathconfighostfirewallwithvxlan   testserver-host-q89jm              2/2     Running   0          44s    192.168.36.12   k8s2   <none>           <none>
	 202103232024k8sdatapathconfighostfirewallwithvxlan   testserver-szc2d                   2/2     Running   0          44s    10.0.0.86       k8s1   <none>           <none>
	 cilium-monitoring                                    grafana-d69c97b9b-hlvb6            1/1     Running   0          97m    10.0.1.214      k8s2   <none>           <none>
	 cilium-monitoring                                    prometheus-655fb888d7-rqk9s        1/1     Running   0          97m    10.0.1.93       k8s2   <none>           <none>
	 kube-system                                          cilium-cn8kr                       1/1     Running   0          3m7s   192.168.36.12   k8s2   <none>           <none>
	 kube-system                                          cilium-operator-5878c8bc5c-22nfr   1/1     Running   0          3m7s   192.168.36.12   k8s2   <none>           <none>
	 kube-system                                          cilium-operator-5878c8bc5c-85m7m   1/1     Running   0          3m7s   192.168.36.11   k8s1   <none>           <none>
	 kube-system                                          cilium-v8t6w                       1/1     Running   0          3m7s   192.168.36.11   k8s1   <none>           <none>
	 kube-system                                          coredns-867bf6789f-26clv           1/1     Running   0          66s    10.0.1.249      k8s2   <none>           <none>
	 kube-system                                          etcd-k8s1                          1/1     Running   0          99m    192.168.36.11   k8s1   <none>           <none>
	 kube-system                                          kube-apiserver-k8s1                1/1     Running   0          99m    192.168.36.11   k8s1   <none>           <none>
	 kube-system                                          kube-controller-manager-k8s1       1/1     Running   0          99m    192.168.36.11   k8s1   <none>           <none>
	 kube-system                                          kube-proxy-bx8ds                   1/1     Running   0          99m    192.168.36.11   k8s1   <none>           <none>
	 kube-system                                          kube-proxy-lbqpl                   1/1     Running   0          98m    192.168.36.12   k8s2   <none>           <none>
	 kube-system                                          kube-scheduler-k8s1                1/1     Running   0          99m    192.168.36.11   k8s1   <none>           <none>
	 kube-system                                          log-gatherer-cnn8x                 1/1     Running   0          97m    192.168.36.11   k8s1   <none>           <none>
	 kube-system                                          log-gatherer-dvhg5                 1/1     Running   0          97m    192.168.36.12   k8s2   <none>           <none>
	 kube-system                                          registry-adder-fbxwg               1/1     Running   0          98m    192.168.36.12   k8s2   <none>           <none>
	 kube-system                                          registry-adder-fc8b8               1/1     Running   0          98m    192.168.36.11   k8s1   <none>           <none>
	 
Stderr:
 	 

Fetching command output from pods [cilium-cn8kr cilium-v8t6w]
cmd: kubectl exec -n kube-system cilium-cn8kr -- cilium status
Exitcode: 0 
Stdout:
 	 KVStore:                Ok   Disabled
	 Kubernetes:             Ok   1.20 (v1.20.4) [linux/amd64]
	 Kubernetes APIs:        ["cilium/v2::CiliumClusterwideNetworkPolicy", "cilium/v2::CiliumEndpoint", "cilium/v2::CiliumNetworkPolicy", "cilium/v2::CiliumNode", "core/v1::Namespace", "core/v1::Node", "core/v1::Pods", "core/v1::Service", "discovery/v1beta1::EndpointSlice", "networking.k8s.io/v1::NetworkPolicy"]
	 KubeProxyReplacement:   Disabled   
	 Cilium:                 Ok   1.9.90 (v.1.9.90-r.a2a7594)
	 NodeMonitor:            Listening for events on 3 CPUs with 64x4096 of shared memory
	 Cilium health daemon:   Ok   
	 IPAM:                   IPv4: 7/255 allocated from 10.0.1.0/24, IPv6: 7/255 allocated from fd02::100/120
	 BandwidthManager:       Disabled
	 Host Routing:           Legacy
	 Masquerading:           IPTables [IPv4: Enabled, IPv6: Enabled]
	 Controller Status:      42/42 healthy
	 Proxy Status:           OK, ip 10.0.1.42, 0 redirects active on ports 10000-20000
	 Hubble:                 Ok              Current/Max Flows: 1193/4095 (29.13%), Flows/s: 14.04   Metrics: Disabled
	 Cluster health:         2/2 reachable   (2021-03-23T20:24:03Z)
	 
Stderr:
 	 

cmd: kubectl exec -n kube-system cilium-cn8kr -- cilium endpoint list
Exitcode: 0 
Stdout:
 	 ENDPOINT   POLICY (ingress)   POLICY (egress)   IDENTITY   LABELS (source:key[=value])                                                          IPv6        IPv4         STATUS   
	            ENFORCEMENT        ENFORCEMENT                                                                                                                                
	 203        Disabled           Disabled          32434      k8s:io.cilium.k8s.policy.cluster=default                                             fd02::12d   10.0.1.249   ready   
	                                                            k8s:io.cilium.k8s.policy.serviceaccount=coredns                                                                       
	                                                            k8s:io.kubernetes.pod.namespace=kube-system                                                                           
	                                                            k8s:k8s-app=kube-dns                                                                                                  
	 262        Enabled            Enabled           1          k8s:cilium.io/ci-node=k8s2                                                                                    ready   
	                                                            reserved:host                                                                                                         
	 309        Disabled           Disabled          4          reserved:health                                                                      fd02::12c   10.0.1.175   ready   
	 729        Disabled           Disabled          26056      k8s:io.cilium.k8s.policy.cluster=default                                             fd02::15a   10.0.1.77    ready   
	                                                            k8s:io.cilium.k8s.policy.serviceaccount=default                                                                       
	                                                            k8s:io.kubernetes.pod.namespace=202103232024k8sdatapathconfighostfirewallwithvxlan                                    
	                                                            k8s:zgroup=testClient                                                                                                 
	 1334       Disabled           Disabled          25739      k8s:app=prometheus                                                                   fd02::1af   10.0.1.93    ready   
	                                                            k8s:io.cilium.k8s.policy.cluster=default                                                                              
	                                                            k8s:io.cilium.k8s.policy.serviceaccount=prometheus-k8s                                                                
	                                                            k8s:io.kubernetes.pod.namespace=cilium-monitoring                                                                     
	 2226       Disabled           Disabled          10726      k8s:io.cilium.k8s.policy.cluster=default                                             fd02::14c   10.0.1.177   ready   
	                                                            k8s:io.cilium.k8s.policy.serviceaccount=default                                                                       
	                                                            k8s:io.kubernetes.pod.namespace=202103232024k8sdatapathconfighostfirewallwithvxlan                                    
	                                                            k8s:zgroup=testServer                                                                                                 
	 2608       Disabled           Disabled          3821       k8s:app=grafana                                                                      fd02::1ed   10.0.1.214   ready   
	                                                            k8s:io.cilium.k8s.policy.cluster=default                                                                              
	                                                            k8s:io.cilium.k8s.policy.serviceaccount=default                                                                       
	                                                            k8s:io.kubernetes.pod.namespace=cilium-monitoring                                                                     
	 
Stderr:
 	 

cmd: kubectl exec -n kube-system cilium-v8t6w -- cilium status
Exitcode: 0 
Stdout:
 	 KVStore:                Ok   Disabled
	 Kubernetes:             Ok   1.20 (v1.20.4) [linux/amd64]
	 Kubernetes APIs:        ["cilium/v2::CiliumClusterwideNetworkPolicy", "cilium/v2::CiliumEndpoint", "cilium/v2::CiliumNetworkPolicy", "cilium/v2::CiliumNode", "core/v1::Namespace", "core/v1::Node", "core/v1::Pods", "core/v1::Service", "discovery/v1beta1::EndpointSlice", "networking.k8s.io/v1::NetworkPolicy"]
	 KubeProxyReplacement:   Disabled   
	 Cilium:                 Ok   1.9.90 (v.1.9.90-r.a2a7594)
	 NodeMonitor:            Listening for events on 3 CPUs with 64x4096 of shared memory
	 Cilium health daemon:   Ok   
	 IPAM:                   IPv4: 4/255 allocated from 10.0.0.0/24, IPv6: 4/255 allocated from fd02::/120
	 BandwidthManager:       Disabled
	 Host Routing:           Legacy
	 Masquerading:           IPTables [IPv4: Enabled, IPv6: Enabled]
	 Controller Status:      28/28 healthy
	 Proxy Status:           OK, ip 10.0.0.96, 0 redirects active on ports 10000-20000
	 Hubble:                 Ok              Current/Max Flows: 1907/4095 (46.57%), Flows/s: 10.04   Metrics: Disabled
	 Cluster health:         2/2 reachable   (2021-03-23T20:24:04Z)
	 
Stderr:
 	 

cmd: kubectl exec -n kube-system cilium-v8t6w -- cilium endpoint list
Exitcode: 0 
Stdout:
 	 ENDPOINT   POLICY (ingress)   POLICY (egress)   IDENTITY   LABELS (source:key[=value])                                                          IPv6       IPv4        STATUS   
	            ENFORCEMENT        ENFORCEMENT                                                                                                                              
	 494        Enabled            Enabled           1          k8s:cilium.io/ci-node=k8s1                                                                                  ready   
	                                                            k8s:node-role.kubernetes.io/control-plane                                                                           
	                                                            k8s:node-role.kubernetes.io/master                                                                                  
	                                                            reserved:host                                                                                                       
	 787        Disabled           Disabled          10726      k8s:io.cilium.k8s.policy.cluster=default                                             fd02::60   10.0.0.86   ready   
	                                                            k8s:io.cilium.k8s.policy.serviceaccount=default                                                                     
	                                                            k8s:io.kubernetes.pod.namespace=202103232024k8sdatapathconfighostfirewallwithvxlan                                  
	                                                            k8s:zgroup=testServer                                                                                               
	 1051       Disabled           Disabled          4          reserved:health                                                                      fd02::25   10.0.0.18   ready   
	 1661       Disabled           Disabled          26056      k8s:io.cilium.k8s.policy.cluster=default                                             fd02::2    10.0.0.76   ready   
	                                                            k8s:io.cilium.k8s.policy.serviceaccount=default                                                                     
	                                                            k8s:io.kubernetes.pod.namespace=202103232024k8sdatapathconfighostfirewallwithvxlan                                  
	                                                            k8s:zgroup=testClient                                                                                               
	 
Stderr:
 	 

===================== Exiting AfterFailed =====================
20:25:01 STEP: Running AfterEach for block EntireTestsuite K8sDatapathConfig Host firewall
20:25:02 STEP: Running AfterEach for block EntireTestsuite K8sDatapathConfig
20:25:02 STEP: Deleting deployment demo_hostfw.yaml
20:25:02 STEP: Deleting namespace 202103232024k8sdatapathconfighostfirewallwithvxlan
20:25:02 STEP: Deleting namespace 202103232024k8sdatapathconfighostfirewallwithvxlan
20:25:17 STEP: Running AfterEach for block EntireTestsuite

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Comments: 31 (26 by maintainers)

Commits related to this issue

Most upvoted comments

I think we can say we have confirmation the workaround is working: image

Let’s close this issue. We can discuss the proper way forward for the host firewall at https://github.com/cilium/cilium/issues/25448.

We should also extend this workaround to other host firewall test (i.e., K8sDatapathServices .* With host policy .*). IMO, it would probably be best to move that test to be with the others (thus, not requiring any additional workaround) rather than copy the workaround in K8sDatapathServices. cc @brb small hanging fruit to kill one of our highest Jenkins flake right now.

The current hypothesis is that this flake is caused by the drops mentioned in a previous comment:

All other policy drops I found were on connections established in the wrong direction in conntrack, which should be fine to drop.

Origin of the Packet Drops

These drops happen for connections that were established before the host firewall was enabled. Once the host firewall is enabled, conntrack is enabled for host connections. For each established connection, the next packet seen by the conntrack engine is considered the first packet of the connection and the connection’s direction is deduced from that packet. So, if the next packet seen by the conntrack engine is a reply, the connection will be tracked in the wrong direction. Once L4 policies are loaded, this connection will be dropped because it’s destination port (actually the source port, and thus ephemeral port) likely won’t match any L4 rule.

In this flake, the packet drops are for connections to the kube-apiserver (TCP/6443) across nodes. So the hypothesis is that maybe these drops are causing the failure to execute kubectl exec. The drops happen as soon as the host policies are loaded, but it takes a little while for the connections to be terminated (due to retries), so maybe that’s why we don’t see the kubectl exec failure at the very beginning of the test.

Next Step

We discussed this issue at length with @joestringer and tried to consider how we might validate this hypothesis, but it proved a bit tricky. We have a couple errors in kube-apiserver logs around the time the failure is happening, but those errors are not consistent across all flake occurrences.

Alternatively, we can attempt a fix and see if it helps. We deduced that the only way to avoid those packet drops is for the connections to be established only after the host firewall is enabled. We can’t simply flush conntrack to reestablish connections because the next packet received would just be considered the first packet of the connection again. Since the connections affected here are kube-apiserver connections, we are considering restarting the kube-apiserver pods after the host firewall is enabled, such that all connections are reestablished.

I don’t think that specific test has the workaround. Only K8sDatapathConfig Host Firewall has it.

The dashboard you want to follow is: https://lookerstudio.google.com/s/vquN3vXM-kM.

image

I’d say Monday we will have “official” confirmation and we can extend the workaround and consider what we should document.

Just checked all linked cases since merging https://github.com/cilium/cilium/pull/25323, all failures have happened on branches not rebased with the potential fix.

If anyone is reporting this issue make sure your branch has: https://github.com/cilium/cilium/commit/439a0a059fdcabe23a33b427b637494bc5a59eda

I checked recent failures and this flake seems to have disappeared since we merged https://github.com/cilium/cilium/pull/16064 (which parallelized the hostfw tests and made them a lot faster) on May 11th. The list of PRs above linked to this issue also seems to attest that. I’ll keep open for a short while longer and we’ll close if it’s confirmed.