cilium: CI: Endpoint can still connect while Cilium is not running

https://jenkins.cilium.io/job/Ginkgo-CI-Tests-Pipeline/2299/testReport/junit/k8s-1/10/K8sValidatedChaosTest_Endpoint_can_still_connect_while_Cilium_is_not_running/

Stacktrace

/home/jenkins/workspace/Ginkgo-CI-Tests-Pipeline/src/github.com/cilium/cilium/test/ginkgo-ext/scopes.go:312
Cannot curl from "testclient-vrs9w" to testds-service
Expected
    <bool>: false
to be true
/home/jenkins/workspace/Ginkgo-CI-Tests-Pipeline/src/github.com/cilium/cilium/test/k8sT/Chaos.go:126
Standard Output

time="2018-05-02T06:39:28Z" level=info msg=Starting testName=K8sChaosTest
time="2018-05-02T06:39:28Z" level=info msg="Vagrant: running command 'vagrant ssh-config k8s1-1.10'"
time="2018-05-02T06:39:29Z" level=debug msg="generated SSHConfig for node k8s1-1.10"
time="2018-05-02T06:39:29Z" level=debug msg="done importing ssh config"
time="2018-05-02T06:39:29Z" level=debug msg="running command: ls /tmp/"
time="2018-05-02T06:39:30Z" level=debug msg="running command: cat ${HOME}/.kube/config"
time="2018-05-02T06:39:30Z" level=debug msg="applying /home/vagrant/go/src/github.com/cilium/cilium/test/k8sT/manifests/cilium_ds.yaml" testName=K8sChaosTest
time="2018-05-02T06:39:30Z" level=debug msg="running command: kubectl apply -f  /home/vagrant/go/src/github.com/cilium/cilium/test/k8sT/manifests/cilium_ds.yaml"
time="2018-05-02T06:39:35Z" level=debug msg="running command: kubectl -n kube-system get pods -l k8s-app=cilium -o json"
time="2018-05-02T06:39:46Z" level=debug msg="running command: kubectl -n kube-system get pods -l k8s-app=kube-dns -o json"
time="2018-05-02T06:39:46Z" level=debug msg="applying /home/vagrant/go/src/github.com/cilium/cilium/test/k8sT/manifests/demo_ds.yaml" testName=K8sChaosTest
time="2018-05-02T06:39:46Z" level=debug msg="running command: kubectl apply -f  /home/vagrant/go/src/github.com/cilium/cilium/test/k8sT/manifests/demo_ds.yaml"
time="2018-05-02T06:39:51Z" level=debug msg="running command: kubectl -n default get pods -l zgroup=testDS -o json"
time="2018-05-02T06:39:52Z" level=info msg="WaitforPods: pods are not ready" data="false false" filter="-l zgroup=testDS" namespace=default testName=K8sChaosTest
time="2018-05-02T06:39:56Z" level=debug msg="running command: kubectl -n default get pods -l zgroup=testDS -o json"
time="2018-05-02T06:40:02Z" level=debug msg="running command: kubectl -n default get pods -l zgroup=testDSClient -o json"
time="2018-05-02T06:40:02Z" level=debug msg="running command: kubectl -n default get pods -l zgroup=testDS -o jsonpath='{range .items[*]}{@.metadata.name}{\"=\"}{@.status.podIP}{\"\\n\"}{end}'"
time="2018-05-02T06:40:02Z" level=debug msg="running command: kubectl exec -n default testclient-qdkdl -- ping -W 2 -c 5 10.10.1.95"
time="2018-05-02T06:40:07Z" level=debug msg="Pod testclient-qdkdl ping 10.10.1.95"
time="2018-05-02T06:40:07Z" level=debug msg="running command: kubectl -n kube-system get service kube-dns -o json"
time="2018-05-02T06:40:12Z" level=debug msg="running command: dig +short testds-service.default.svc.cluster.local @10.96.0.10 | grep -v -e '^$'"
time="2018-05-02T06:40:12Z" level=debug msg="running command: kubectl exec -n default testclient-qdkdl -- curl -s --fail --connect-timeout 5 --max-time 5 http://testds-service.default.svc.cluster.local:80/"
time="2018-05-02T06:40:12Z" level=debug msg="running command: kubectl exec -n default testclient-qdkdl -- ping -W 2 -c 5 10.10.0.158"
time="2018-05-02T06:40:17Z" level=debug msg="Pod testclient-qdkdl ping 10.10.0.158"
time="2018-05-02T06:40:17Z" level=debug msg="running command: kubectl -n kube-system get service kube-dns -o json"
time="2018-05-02T06:40:22Z" level=debug msg="running command: dig +short testds-service.default.svc.cluster.local @10.96.0.10 | grep -v -e '^$'"
time="2018-05-02T06:40:22Z" level=debug msg="running command: kubectl exec -n default testclient-qdkdl -- curl -s --fail --connect-timeout 5 --max-time 5 http://testds-service.default.svc.cluster.local:80/"
time="2018-05-02T06:40:23Z" level=debug msg="running command: kubectl exec -n default testclient-vrs9w -- ping -W 2 -c 5 10.10.1.95"
time="2018-05-02T06:40:27Z" level=debug msg="Pod testclient-vrs9w ping 10.10.1.95"
time="2018-05-02T06:40:27Z" level=debug msg="running command: kubectl -n kube-system get service kube-dns -o json"
time="2018-05-02T06:40:33Z" level=debug msg="running command: dig +short testds-service.default.svc.cluster.local @10.96.0.10 | grep -v -e '^$'"
time="2018-05-02T06:40:33Z" level=debug msg="running command: kubectl exec -n default testclient-vrs9w -- curl -s --fail --connect-timeout 5 --max-time 5 http://testds-service.default.svc.cluster.local:80/"
time="2018-05-02T06:40:33Z" level=debug msg="running command: kubectl exec -n default testclient-vrs9w -- ping -W 2 -c 5 10.10.0.158"
time="2018-05-02T06:40:38Z" level=debug msg="Pod testclient-vrs9w ping 10.10.0.158"
time="2018-05-02T06:40:38Z" level=debug msg="running command: kubectl -n kube-system get service kube-dns -o json"
time="2018-05-02T06:40:43Z" level=debug msg="running command: dig +short testds-service.default.svc.cluster.local @10.96.0.10 | grep -v -e '^$'"
time="2018-05-02T06:40:43Z" level=debug msg="running command: kubectl exec -n default testclient-vrs9w -- curl -s --fail --connect-timeout 5 --max-time 5 http://testds-service.default.svc.cluster.local:80/"
�[1mSTEP�[0m: Deleting cilium pods
time="2018-05-02T06:40:43Z" level=debug msg="running command: kubectl -n kube-system delete pods -l k8s-app=cilium"
time="2018-05-02T06:40:49Z" level=debug msg="running command: kubectl -n kube-system get pods -l k8s-app=cilium -o json"
time="2018-05-02T06:40:49Z" level=info msg="WaitforPods: pods are not ready" data="false true" filter="-l k8s-app=cilium" namespace=kube-system testName=K8sChaosTest
time="2018-05-02T06:40:54Z" level=debug msg="running command: kubectl -n kube-system get pods -l k8s-app=cilium -o json"
time="2018-05-02T06:40:54Z" level=info msg="WaitforPods: pods are not ready" data="false true" filter="-l k8s-app=cilium" namespace=kube-system testName=K8sChaosTest
time="2018-05-02T06:40:59Z" level=debug msg="running command: kubectl -n kube-system get pods -l k8s-app=cilium -o json"
time="2018-05-02T06:40:59Z" level=debug msg="running command: kubectl -n default get pods -l zgroup=testDS -o jsonpath='{range .items[*]}{@.metadata.name}{\"=\"}{@.status.podIP}{\"\\n\"}{end}'"
time="2018-05-02T06:40:59Z" level=debug msg="running command: kubectl exec -n default testclient-qdkdl -- ping -W 2 -c 5 10.10.1.95"
time="2018-05-02T06:41:04Z" level=debug msg="Pod testclient-qdkdl ping 10.10.1.95"
time="2018-05-02T06:41:04Z" level=debug msg="running command: kubectl -n kube-system get service kube-dns -o json"
time="2018-05-02T06:41:10Z" level=debug msg="running command: dig +short testds-service.default.svc.cluster.local @10.96.0.10 | grep -v -e '^$'"
time="2018-05-02T06:41:10Z" level=debug msg="running command: kubectl exec -n default testclient-qdkdl -- curl -s --fail --connect-timeout 5 --max-time 5 http://testds-service.default.svc.cluster.local:80/"
time="2018-05-02T06:41:10Z" level=debug msg="running command: kubectl exec -n default testclient-qdkdl -- ping -W 2 -c 5 10.10.0.158"
time="2018-05-02T06:41:15Z" level=debug msg="Pod testclient-qdkdl ping 10.10.0.158"
time="2018-05-02T06:41:15Z" level=debug msg="running command: kubectl -n kube-system get service kube-dns -o json"
time="2018-05-02T06:41:20Z" level=debug msg="running command: dig +short testds-service.default.svc.cluster.local @10.96.0.10 | grep -v -e '^$'"
time="2018-05-02T06:41:20Z" level=debug msg="running command: kubectl exec -n default testclient-qdkdl -- curl -s --fail --connect-timeout 5 --max-time 5 http://testds-service.default.svc.cluster.local:80/"
time="2018-05-02T06:41:20Z" level=debug msg="running command: kubectl exec -n default testclient-vrs9w -- ping -W 2 -c 5 10.10.1.95"
time="2018-05-02T06:41:25Z" level=debug msg="Pod testclient-vrs9w ping 10.10.1.95"
time="2018-05-02T06:41:25Z" level=debug msg="running command: kubectl -n kube-system get service kube-dns -o json"
time="2018-05-02T06:41:30Z" level=debug msg="running command: dig +short testds-service.default.svc.cluster.local @10.96.0.10 | grep -v -e '^$'"
time="2018-05-02T06:41:30Z" level=debug msg="running command: kubectl exec -n default testclient-vrs9w -- curl -s --fail --connect-timeout 5 --max-time 5 http://testds-service.default.svc.cluster.local:80/"

About this issue

  • Original URL
  • State: closed
  • Created 6 years ago
  • Comments: 17 (17 by maintainers)

Most upvoted comments

I searched the logs and ignored the copious amounts of policy computation logs, as I view them as extraneous here:

grep "47219" cilium-4bc4g-logs.log | grep -v "Evaluating context for" > out.txt

The endpoint starts with the reserved identity init, as expected:

2018-06-05T08:23:38.491617843Z level=debug msg="PUT /endpoint/{id} request" params="{HTTPRequest:0xc42417ab00 Endpoint:0xc421d4c3c0 ID:cilium-local:47219}"
2018-06-05T08:23:38.492685419Z level=debug msg="Assigning security relevant label" containerID=4764327edc endpointID=47219 ipv4=10.10.0.144 ipv6="f00d::a0a:0:0:b873" k8sPodName=":" labels="reserved:init" policyRevision=0
2018-06-05T08:23:38.492721661Z level=debug msg="Endpoint has reserved identity, changing synchronously" containerID=4764327edc endpointID=47219 identityLabels="reserved:init" ipv4=10.10.0.144 ipv6="f00d::a0a:0:0:b873" k8sPodName=":" policyRevision=0
2018-06-05T08:23:38.492763522Z level=debug msg="Resolving identity for labels" containerID=4764327edc endpointID=47219 identityLabels="reserved:init" ipv4=10.10.0.144 ipv6="f00d::a0a:0:0:b873" k8sPodName=":" policyRevision=0
2018-06-05T08:23:38.492789242Z level=debug msg="Assigned new identity to endpoint" containerID=4764327edc endpointID=47219 identity=5 identityLabels="reserved:init" ipv4=10.10.0.144 ipv6="f00d::a0a:0:0:b873" k8sPodName=":" policyRevision=0
2018-06-05T08:23:38.49282331Z level=debug msg="Set identity for this endpoint" code=OK containerID=4764327edc endpointID=47219 endpointState=ready ipv4=10.10.0.144 ipv6="f00d::a0a:0:0:b873" k8sPodName=":" policyRevision=0 type=0

Policy enforcement is enabled for the endpoint because it is in init mode still (per the code block in the previous comment above):

2018-06-05T08:23:38.512149005Z level=debug msg="policy enforcement for ingress and egress enabled" containerID=4764327edc endpointID=47219 ipv4=10.10.0.144 ipv6="f00d::a0a:0:0:b873" k8sPodName=":" policyRevision=0

The endpoint then gets a new identity because we received its labels from K8s:

2018-06-05T08:23:40.298118665Z level=debug msg="TriggerPolicyUpdatesLocked: changed: false" containerID=4764327edc endpointID=47219 ipv4=10.10.0.144 ipv6="f00d::a0a:0:0:b873" k8sPodName=":" policyRevision=3
2018-06-05T08:23:40.298208655Z level=debug msg="Skipped invalid state transition to waiting-to-regenerate due to: Triggering endpoint regeneration due to policy updates" code=Warning containerID=4764327edc endpointID=47219 endpointState=waiting-to-regenerate ipv4=10.10.0.144 ipv6="f00d::a0a:0:0:b873" k8sPodName=":" policyRevision=3 type=0
2018-06-05T08:23:40.298248479Z level=debug msg="Shutting down controller for stopForUpdate; skipping stop function" name=sync-policymap-47219 subsys=controller uuid=c069a6f8-6899-11e8-a9ef-080027aa233b
2018-06-05T08:23:40.347270766Z level=debug msg="Refreshing labels of endpoint" containerID=4764327edc endpointID=47219 identityLabels="k8s:io.kubernetes.pod.namespace=default,k8s:zgroup=testDSClient" infoLabels="container:annotation.kubernetes.io/config.seen=2018-06-05T08:23:37.502585256Z,container:annotation.kubernetes.io/config.source=api,container:annotation.kubernetes.io/created-by={\"kind\":\"SerializedReference\",\"apiVersion\":\"v1\",\"reference\":{\"kind\":\"DaemonSet\",\"namespace\":\"default\",\"name\":\"testclient\",\"uid\":\"bec7a8d5-6899-11e8-874e-080027aa233b\",\"apiVersion\":\"extensions\",\"resourceVersion\":\"1159\"}}\n,container:io.kubernetes.container.name=POD,container:io.kubernetes.docker.type=podsandbox,container:io.kubernetes.pod.name=testclient-q5lx7,container:io.kubernetes.pod.uid=beca070c-6899-11e8-874e-080027aa233b,k8s:controller-revision-hash=2382745371,k8s:pod-template-generation=1"
2018-06-05T08:23:40.347394227Z level=debug msg="Assigning information label" containerID=4764327edc endpointID=47219 ipv4=10.10.0.144 ipv6="f00d::a0a:0:0:b873" k8sPodName="default:testclient-q5lx7" labels="container:annotation.kubernetes.io/created-by={\"kind\":\"SerializedReference\",\"apiVersion\":\"v1\",\"reference\":{\"kind\":\"DaemonSet\",\"namespace\":\"default\",\"name\":\"testclient\",\"uid\":\"bec7a8d5-6899-11e8-874e-080027aa233b\",\"apiVersion\":\"extensions\",\"resourceVersion\":\"1159\"}}\n" policyRevision=3

We then skip regeneration of policy for some reason - I’m not sure where these “waiting for proxy updates to complete…” logs come into play here:

2018-06-05T08:23:40.643061297Z level=debug msg="Waiting for proxy updates to complete..." containerID=4764327edc endpointID=47219 ipv4=10.10.0.144 ipv6="f00d::a0a:0:0:b873" k8sPodName="default:testclient-q5lx7" policyRevision=3
2018-06-05T08:23:40.643164371Z level=debug msg="Wait time for proxy updates: 183.991µs" containerID=4764327edc endpointID=47219 ipv4=10.10.0.144 ipv6="f00d::a0a:0:0:b873" k8sPodName="default:testclient-q5lx7" policyRevision=3
2018-06-05T08:23:40.643250984Z level=info msg="Regeneration of BPF program has completed" buildDuration=2.133814692s containerID=4764327edc endpointID=47219 ipv4=10.10.0.144 ipv6="f00d::a0a:0:0:b873" k8sPodName="default:testclient-q5lx7" policyRevision=3
2018-06-05T08:23:40.643350182Z level=info msg="Endpoint policy recalculated" containerID=4764327edc endpointID=47219 ipv4=10.10.0.144 ipv6="f00d::a0a:0:0:b873" k8sPodName="default:testclient-q5lx7" policyRevision=3
2018-06-05T08:23:40.643430853Z level=debug msg="Skipped invalid state transition to ready due to: Completed endpoint regeneration with no pending regeneration requests" code=Warning containerID=4764327edc endpointID=47219 endpointState=waiting-to-regenerate ipv4=10.10.0.144 ipv6="f00d::a0a:0:0:b873" k8sPodName="default:testclient-q5lx7" policyRevision=3 type=0
2018-06-05T08:23:40.643561195Z level=debug msg="Successfully regenerated endpoint program due to updated security labels" code=OK containerID=4764327edc endpointID=47219 endpointState=waiting-to-regenerate ipv4=10.10.0.144 ipv6="f00d::a0a:0:0:b873" k8sPodName="default:testclient-q5lx7" policyRevision=3 type=200
2018-06-05T08:23:40.644953456Z level=debug msg="Regenerating endpoint..." containerID=4764327edc endpointID=47219 ipv4=10.10.0.144 ipv6="f00d::a0a:0:0:b873" k8sPodName="default:testclient-q5lx7" policyRevision=3
2018-06-05T08:23:40.645145413Z level=info msg="Regenerating BPF program" containerID=4764327edc endpointID=47219 ipv4=10.10.0.144 ipv6="f00d::a0a:0:0:b873" k8sPodName="default:testclient-q5lx7" policyRevision=3 startTime="2018-06-05 08:23:40.645103821 +0000 UTC m=+47.842411312"
2018-06-05T08:23:40.645319305Z level=debug msg="Regenerating Endpoint BPF: endpoint policy updated & changes were needed" code=OK containerID=4764327edc endpointID=47219 endpointState=regenerating ipv4=10.10.0.144 ipv6="f00d::a0a:0:0:b873" k8sPodName="default:testclient-q5lx7" policyRevision=3 type=0
2018-06-05T08:23:40.645371271Z level=debug msg="Starting regenerate..." containerID=4764327edc endpointID=47219 ipv4=10.10.0.144 ipv6="f00d::a0a:0:0:b873" k8sPodName="default:testclient-q5lx7" policyRevision=3
2018-06-05T08:23:40.645459826Z level=debug msg="skipping policy recalculation" containerID=4764327edc endpointID=47219 ipv4=10.10.0.144 ipv6="f00d::a0a:0:0:b873" k8sPodName="default:testclient-q5lx7" policyChanged=false policyRevision=3 policyRevision.next=3 policyRevision.repo=3
2018-06-05T08:23:40.648045457Z level=debug msg="BPF header file hashed (was: \"f2c633615da0ae6d875ee14578982875\")" bpfHeaderfileHash=b70cfda2194f482b4536895bd97994fe containerID=4764327edc endpointID=47219 ipv4=10.10.0.144 ipv6="f00d::a0a:0:0:b873" k8sPodName="default:testclient-q5lx7" policyRevision=3

The key thing here is that we are ‘skipping policy recalculation’ after getting this new identity, it seems:

2018-06-05T08:23:40.645459826Z level=debug msg="skipping policy recalculation" containerID=4764327edc endpointID=47219 ipv4=10.10.0.144 ipv6="f00d::a0a:0:0:b873" k8sPodName="default:testclient-q5lx7" policyChanged=false policyRevision=3 policyRevision.next=3 policyRevision.repo=3

We skip policy recalculation even though we have changed identities, and policy enforcement should be disabled for ingress / egress:

https://github.com/cilium/cilium/blob/18ef7f2b8fa5a7017f5027e88a5274b96e6786cf/pkg/endpoint/policy.go#L492-L504

And ingress / egress policy enforcement are updated later in the same function than the above block of code (indicating that the code is never hit, and as such the endpoint is not regenerated correctly):

https://github.com/cilium/cilium/blob/18ef7f2b8fa5a7017f5027e88a5274b96e6786cf/pkg/endpoint/policy.go#L549-L570

This is a release-blocker, IMO.