cilium: CI: 1.9.K8sValidatedPolicyTest - DNS \"redis-master.default.svc.cluster.local\" is not ready after timeout

https://jenkins.cilium.io/job/Cilium-PR-Ginkgo-Tests-K8s/139/testReport/junit/k8s-1/9/K8sValidatedPolicyTest_GuestBook_Examples_checks_policy_example/

/home/jenkins/workspace/Cilium-PR-Ginkgo-Tests-K8s/src/github.com/cilium/cilium/test/ginkgo-ext/scopes.go:312
DNS entry is not ready after timeout
Expected
    <*errors.errorString | 0xc420049cb0>: {
        s: "Timeout reached: DNS \"redis-master.default.svc.cluster.local\" is not ready after timeout",
    }
to be nil
/home/jenkins/workspace/Cilium-PR-Ginkgo-Tests-K8s/src/github.com/cilium/cilium/test/k8sT/Policies.go:723

PR https://github.com/cilium/cilium/pull/3874

test_results_Cilium-PR-Ginkgo-Tests-K8s_139.zip

About this issue

  • Original URL
  • State: closed
  • Created 6 years ago
  • Comments: 27 (27 by maintainers)

Commits related to this issue

Most upvoted comments

So, I finally saw a test fail on https://github.com/cilium/cilium/pull/4406 (this tested undoing the MTU changes). The errors are a little different, and the branch doesn’t include adde5de7053849bf0475c9047f40b6d20a04316c but that isn’t terribly relevant.

tl;dr we seem to be missing routes to clusterIPs, so kubedns can’t connect to the kubeapiserver. Cilium can, however, so I’m not sure I interpreted this correctly.

f0c5341e_K8sValidatedUpdates_Updating_Cilium_stable_to_master.zip

The most interesting error is from kubedns.log E0612 19:23:34.370631 143 reflector.go:201] k8s.io/dns/pkg/dns/dns.go:192: Failed to list *v1.Service: Get https://10.96.0.1:443/api/v1/services?resourceVersion=0: dial tcp 10.96.0.1:443: getsockopt: no route to host

Note that kubedns is running on k8s2 but the apiserver is on k8s1. This is what @eloycoto pointed out above. I had hoped it was because of the MTU stuff but it isn’t. It’s also worth noting that kubedns is the only pod with restartCount >0. kubedns: 5, sidecar:4 and dnsmasq: 4. The cilium pod on k8s2 is cilium-490kg. Neither cilium has the kubedns in the lb map.

The cilium-health report is also bad, like what @eloycoto saw:

cilium-health status

Probe time:   2018-06-12T19:23:25Z
Nodes:
  k8s2 (localhost):
    Host connectivity to 192.168.36.12:
      ICMP:          OK, RTT=409.49µs
      HTTP via L3:   OK, RTT=329.567µs
    Endpoint connectivity to 10.10.1.172:
      ICMP:          Connection timed out
      HTTP via L3:   Connection timed out
  k8s1:
    Host connectivity to 192.168.36.11:
      ICMP:          OK, RTT=662.562µs
      HTTP via L3:   OK, RTT=1.031337ms
    Endpoint connectivity to 10.10.0.172:
      ICMP:          Connection timed out
      HTTP via L3:   Connection timed out

Oddly, cilium doesn’t know about kubedns

ENDPOINT   POLICY (ingress)   POLICY (egress)   IDENTITY   LABELS (source:key[=value])   IPv6                   IPv4          STATUS   
           ENFORCEMENT        ENFORCEMENT                                                                                     
29898      Disabled           Disabled          25244      reserved:health               f00d::a0a:100:0:74ca   10.10.1.172   ready   

We seem to be missing the kube-apiserver cluster IP route (clusterIP: 10.96.0.1, backend: 192.168.36.11). I’m not sure why. The podCIDRs for the nodes are 10.10.0.0/24 and 10.10.1.0/24.

k8s2 ip r

default via 10.0.2.2 dev enp0s3 proto dhcp src 10.0.2.15 metric 100 
10.0.2.0/24 dev enp0s3 proto kernel scope link src 10.0.2.15 
10.0.2.2 dev enp0s3 proto dhcp scope link src 10.0.2.15 metric 100 
10.10.0.0/24 via 10.10.1.44 dev cilium_host src 10.10.1.44 
10.10.1.0/24 via 10.10.1.44 dev cilium_host src 10.10.1.44 
10.10.1.44 dev cilium_host scope link 
172.17.0.0/16 dev docker0 proto kernel scope link src 172.17.0.1 
172.28.128.0/24 dev enp0s9 proto kernel scope link src 172.28.128.5 
192.168.36.0/24 dev enp0s8 proto kernel scope link src 192.168.36.12 

k8s1 ip r

default via 10.0.2.2 dev enp0s3 proto dhcp src 10.0.2.15 metric 100 
10.0.2.0/24 dev enp0s3 proto kernel scope link src 10.0.2.15 
10.0.2.2 dev enp0s3 proto dhcp scope link src 10.0.2.15 metric 100 
10.10.0.0/24 via 10.10.0.44 dev cilium_host src 10.10.0.44 
10.10.0.44 dev cilium_host scope link 
10.10.1.0/24 via 10.10.0.44 dev cilium_host src 10.10.0.44 
172.17.0.0/16 dev docker0 proto kernel scope link src 172.17.0.1 
172.28.128.0/24 dev enp0s9 proto kernel scope link src 172.28.128.4 
192.168.36.0/24 dev enp0s8 proto kernel scope link src 192.168.36.11 

Looking through the logs for 172.17 we see

$ grep -E '172.17' cilium-490kg-logs.log
2018-06-12T19:15:20.565200637Z level=debug msg="Considering route" route="{Ifindex: 5 Dst: 172.17.0.0/16 Src: 172.17.0.1 Gw: <nil> Flags: [] Table: 254}"
2018-06-12T19:15:20.568913369Z level=debug msg="Considering removing iptables rule" obj="-A POSTROUTING -s 172.17.0.0/16 ! -o docker0 -j MASQUERADE"
2018-06-12T19:15:20.568923164Z level=debug msg="Considering removing iptables rule" obj="-A POSTROUTING -s 172.17.0.2/32 -d 172.17.0.2/32 -p tcp -m tcp --dport 5000 -j MASQUERADE"
2018-06-12T19:15:20.576692598Z level=debug msg="Considering removing iptables rule" obj="-A DOCKER ! -i docker0 -p tcp -m tcp --dport 5000 -j DNAT --to-destination 172.17.0.2:5000"
2018-06-12T19:15:20.594023875Z level=debug msg="Considering removing iptables rule" obj="-A DOCKER -d 172.17.0.2/32 ! -i docker0 -o docker0 -p tcp -m tcp --dport 5000 -j ACCEPT"
2018-06-12T19:15:21.990182549Z level=debug msg="Skipping network because of gateway mismatch" endpointID=e78e9f05f07d266375ad9fc29fb94197bc492bed7dd83cda5c1a8c2362d4414e obj="&{<nil> [] [] ae6644e1a2b3b1c607dc92bd29bdc96e5196b32414d1784d383063693cc4548b e78e9f05f07d266375ad9fc29fb94197bc492bed7dd83cda5c1a8c2362d4414e 172.17.0.1 172.17.0.2 16   0 02:42:ac:11:00:02}" subsys=containerd-watcher

and looking for that message pkg/workloads/docker.go:380: scopedLog.WithField(logfields.Object, contNetwork).Debug("Skipping network because of gateway mismatch") Oddly, that happens in getCiliumIPv6 and that is used when deciding to ignore workloads in pkg/workloads/docker.go.

10.96 is mentioned in a similar context:

2018-06-12T19:15:20.577310354Z level=debug msg="Considering removing iptables rule" obj="-A KUBE-SERVICES ! -s 10.10.0.0/16 -d 10.96.0.1/32 -p tcp -m comment --comment \"default/kubernetes:https cluster IP\" -m tcp --dport 443 -j KUBE-MARK-MASQ"
2018-06-12T19:15:20.577394862Z level=debug msg="Considering removing iptables rule" obj="-A KUBE-SERVICES -d 10.96.0.1/32 -p tcp -m comment --comment \"default/kubernetes:https cluster IP\" -m tcp --dport 443 -j KUBE-SVC-NPX46M4PTMTKRN6Y"

but has no skip/remove log

@eloycoto could this be a conflict between a hardcoded podCIDR CLI option to cilium and a derived value from k8s?

@raybejjani the “0.0.0.0 (0)” entry in cilium bpf lb list is normal, it is used in the BPF to hold the number of other service backends that there are (I investigated briefly in https://github.com/cilium/cilium/issues/3905).