cilium: CI: ip-identity watcher stopped receiving events, causing egress policy test failures

Looking at the CI run out put here: https://jenkins.cilium.io/job/Cilium-PR-Ginkgo-Tests-Validated/2839/testReport/junit/(root)/runtime/RuntimeValidatedPolicies_L3_L4_Checks/

monitor.log shows (showing only relevant lines):

[2018-04-30 22:10:30] CPU 01: MARK 0xd943dbd1 FROM 17990 DEBUG: Failed to map daddr=10.15.157.163 to identity
[2018-04-30 22:10:31] CPU 01: MARK 0xd1f1055e FROM 17990 DEBUG: Failed to map daddr=10.15.157.163 to identity
[2018-04-30 22:10:33] CPU 01: MARK 0x6def8c8 FROM 17990 DEBUG: Failed to map daddr=10.15.157.163 to identity
[2018-04-30 22:10:37] CPU 01: MARK 0x7243a868 FROM 17990 DEBUG: Failed to map daddr=10.15.83.26 to identity
[2018-04-30 22:10:38] CPU 01: MARK 0xb1e679db FROM 17990 DEBUG: Failed to map daddr=10.15.83.26 to identity
[2018-04-30 22:10:40] CPU 01: MARK 0xbb1aa16e FROM 17990 DEBUG: Failed to map daddr=10.15.83.26 to identity
[2018-04-30 22:10:42] CPU 01: MARK 0xc601188 FROM 17990 DEBUG: Failed to map daddr.p4=[::0:f426] to identity
[2018-04-30 22:10:43] CPU 01: MARK 0x68a2275c FROM 17990 DEBUG: Failed to map daddr.p4=[::0:f426] to identity
[2018-04-30 22:10:45] CPU 01: MARK 0xca8dbcd9 FROM 17990 DEBUG: Failed to map daddr.p4=[::0:f426] to identity
[2018-04-30 22:10:48] CPU 01: MARK 0xe9251566 FROM 17607 DEBUG: Failed to map daddr=10.15.83.26 to identity
[2018-04-30 22:10:49] CPU 01: MARK 0x5daf9c85 FROM 17607 DEBUG: Failed to map daddr=10.15.83.26 to identity
[2018-04-30 22:10:51] CPU 01: MARK 0x9a98f2ba FROM 17607 DEBUG: Failed to map daddr=10.15.83.26 to identity

Both the IPv4 (10.15.83.26) and IPv6 (::0:f426 should match f00d::a0f:0:0:f426?) belong to the container:id.httpd2 (from test-output.log):

17607      Disabled           Enabled           49868      container:id.app3             f00d::a0f:0:0:44c7   10.15.195.104   ready   
17990      Disabled           Enabled           10211      container:id.app2             f00d::a0f:0:0:4646   10.15.106.144   ready   
18544      Enabled            Disabled          37841      container:id.httpd1           f00d::a0f:0:0:4870   10.15.157.163   ready   
                                                           container:id.service1                                                      
28441      Disabled           Disabled          61150      container:id.app1             f00d::a0f:0:0:6f19   10.15.77.25     ready   
29898      Disabled           Disabled          4319       reserved:health               f00d::a0f:0:0:74ca   10.15.242.54    ready   
49896      Disabled           Disabled          37971      container:id.httpd3           f00d::a0f:0:0:c2e8   10.15.152.195   ready   
                                                           container:id.service1                                                      
62502      Enabled            Disabled          25083      container:id.httpd2           f00d::a0f:0:0:f426   10.15.83.26     ready   
                                                           container:id.service1                                                      

And indeed, bpf ipcache does not have entry for these IPs (from cilium-bpf-ipcache-list.md):

IP PREFIX/ADDRESS    IDENTITY   
f00d::a0f:0:0:1150   44408      
10.15.202.84         44408      
fc00::10ca:1         1          
10.15.28.238         1          
f00d::a0f:0:0:74ca   4319       
10.0.2.15            1          
10.15.242.54         4319       
f00d::a0f:0:0:8ad6   1          

But then the bpf ipcache seems to be missing all other endpoints as well.

I did not know how to interpret kvstore_get.txt, but it contains this one line matching the IPv4 address of httpd2 that can’t be resolved to an ID:

cilium/state/ip/v1/default/10.15.83.26 => {"IP":"10.15.83.26","ID":25083,"Metadata":"cilium-global:default:runtime:62502"}

About this issue

  • Original URL
  • State: closed
  • Created 6 years ago
  • Comments: 20 (20 by maintainers)

Most upvoted comments

@joestringer can we debug this together tomorrow?

Hit this in #3835 too. There is evidence that the kvstore is updated with the relevant IP->ID mapping, however the IP->ID kvstore watcher does not appear to have received events for about five minutes. Monitor shows packets being dropped due to the identity mapping to the cluster ID rather than the endpoint IP’s ID. Suspecting that in some cases, consul is not properly triggering create events.

Note that the PR #3835 adds CIDR-dependent L4, so the ipcache entries are extended with masks. One of the implications is that the “failure to map daddr” in the monitor becomes a success to map the daddr, to ID 3 (cluster). This is still wrong, it should map to an actual endpoint identity.

Link: https://jenkins.cilium.io/job/Cilium-PR-Ginkgo-Tests-Validated/3029/

test_results_Cilium-PR-Ginkgo-Tests-Validated_3029.zip

Failure:

/home/jenkins/workspace/Cilium-PR-Ginkgo-Tests-Validated/src/github.com/cilium/cilium/test/ginkgo-ext/scopes.go:312
"app3" expects curl public URL on httpd2 (10.15.77.25) to succeed
Expected
    <bool>: false
to be true
/home/jenkins/workspace/Cilium-PR-Ginkgo-Tests-Validated/src/github.com/cilium/cilium/test/runtime/Policies.go:377

Logs at time:

�[1mSTEP�[0m: Client "app3" attempting to curl public URL on httpd2
time="2018-05-09T00:42:35Z" level=debug msg="running command: docker exec -i  app3 curl -s --fail --connect-timeout 5 --max-time 5 http://10.15.77.25:80/public"

Monitor drop:

[2018-05-09 00:42:37] ------------------------------------------------------------------------------
[2018-05-09 00:42:37] CPU 01: MARK 0xae11567d FROM 62502 from-endpoint: 74 bytes, state new, identity 18020->0
[2018-05-09 00:42:37] Ethernet>.{Contents=[..14..] Payload=[..62..] SrcMAC=1e:5c:cb:a3:8e:3d DstMAC=a2:56:61:50:2e:74 EthernetType=IPv4 Length=0}
[2018-05-09 00:42:37] IPv4>.....{Contents=[..20..] Payload=[..40..] Version=4 IHL=5 TOS=0 Length=60 Id=32336 Flags=DF FragOffset=0 TTL=64 Protocol=TCP Checksum=2075 SrcIP=10.15.83.26 DstIP=10.15.77.25 Options=[] Padding=[]}
[2018-05-09 00:42:37] TCP>......{Contents=[..40..] Payload=[] SrcPort=54632 DstPort=80(http) Seq=1746970306 Ack=0 DataOffset=10 FIN=false SYN=true RST=false PSH=false ACK=false URG=false ECE=false CWR=false NS=false Window=28200 Checksum=46207 Urgent=0 Options=[..5..] Padding=[]}
[2018-05-09 00:42:37] CPU 01: MARK 0xae11567d FROM 62502 DEBUG: Conntrack lookup 1/2: src=10.15.83.26:54632 dst=10.15.77.25:80
[2018-05-09 00:42:37] CPU 01: MARK 0xae11567d FROM 62502 DEBUG: Conntrack lookup 2/2: nexthdr=6 flags=1
[2018-05-09 00:42:37] CPU 01: MARK 0xae11567d FROM 62502 DEBUG: CT verdict: New, revnat=0
[2018-05-09 00:42:37] CPU 01: MARK 0xae11567d FROM 62502 DEBUG: Successfully mapped daddr=10.15.77.25 to identity=3
[2018-05-09 00:42:37] CPU 01: MARK 0xae11567d FROM 62502 DEBUG: Policy evaluation would deny packet from 18020 to 3
[2018-05-09 00:42:37] ------------------------------------------------------------------------------
[2018-05-09 00:42:37] CPU 01: MARK 0xae11567d FROM 62502 DROP: 74 bytes, reason Policy denied (CIDR), to ifindex 0, identity 18020->0
[2018-05-09 00:42:37] Ethernet>.{Contents=[..14..] Payload=[..62..] SrcMAC=1e:5c:cb:a3:8e:3d DstMAC=a2:56:61:50:2e:74 EthernetType=IPv4 Length=0}
[2018-05-09 00:42:37] IPv4>.....{Contents=[..20..] Payload=[..40..] Version=4 IHL=5 TOS=0 Length=60 Id=32336 Flags=DF FragOffset=0 TTL=64 Protocol=TCP Checksum=2075 SrcIP=10.15.83.26 DstIP=10.15.77.25 Options=[] Padding=[]}
[2018-05-09 00:42:37] TCP>......{Contents=[..40..] Payload=[] SrcPort=54632 DstPort=80(http) Seq=1746970306 Ack=0 DataOffset=10 FIN=false SYN=true RST=false PSH=false ACK=false URG=false ECE=false CWR=false NS=false Window=28200 Checksum=46207 Urgent=0 Options=[..5..] Padding=[]} 

Cilium log for upsert:

$ grep 10.15.77.25 cilium.log | grep upsert
May 09 00:41:47 runtime cilium-agent[11337]: level=debug msg="upserting IP->ID mapping to kvstore" identity=26401 ipAddr=10.15.77.25 ipMask="<nil>" modification=Upsert

KVstore:

$ grep 10.15.77.25 kvstore_get.txt 
cilium/state/ip/v1/default/10.15.77.25 => {"IP":"10.15.77.25","Mask":null,"ID":26401,"Metadata":"cilium-global:default:runtime:28441"}

Logs are missing from the daemon for about 12 seconds at a relevant time

May 09 00:41:50 runtime cilium-agent[11337]: level=debug msg="Evaluating ingress context for source PolicyID" containerID=c510876817 endpointID=28441 ingress_context="{0 0 <nil> [cidr:9.9.0.0/19 cidr:0.0.0.0/1 cidr:0.0.0.0/3 cidr:9.8.0.0/15 cidr:9.9.0.0/17 cidr:9.9.9.0/28 reserved:world cidr:9.9.9.8/31 cidr:9.9.9.9/3#
May 09 00:41:50 runtime cilium-agent[11337]: level=debug msg="Evaluating egress context for source PolicyID" containerID=c510876817 egress_context="{0 0 <nil> [container:id.service1 container:id.httpd2] [cidr:9.9.0.0/19 cidr:0.0.0.0/1 cidr:0.0.0.0/3 cidr:9.8.0.0/15 cidr:9.9.0.0/17 cidr:9.9.9.0/28 reserved:world cidr:#
May 09 00:42:02 runtime cilium-agent[11337]: level=debug msg="GET /endpoint request" params="{HTTPRequest:0xc42178c900 Labels:[]}"
May 09 00:42:02 runtime cilium-agent[11337]: level=info msg="New monitor connected." count.listener=1

gops suggests that there have been no updates from the kvstore:

goroutine 345 [chan receive, 5 minutes]:                                        
github.com/cilium/cilium/pkg/ipcache.ipIdentityWatcher(0xc4216cd440, 0x2, 0x2)  
        /home/vagrant/go/src/github.com/cilium/cilium/pkg/ipcache/ipcache.go:508 +0x139
created by github.com/cilium/cilium/pkg/ipcache.InitIPIdentityWatcher.func1     
        /home/vagrant/go/src/github.com/cilium/cilium/pkg/ipcache/ipcache.go:643 +0x50 

Also interesting, another goroutine is blocking on kvstore operation:

goroutine 24148 [select, 3 minutes]:                                            
net/http.(*persistConn).roundTrip(0xc423f6cc60, 0xc42188c4b0, 0x0, 0x0, 0x0)    
        /usr/local/go/src/net/http/transport.go:1970 +0x608                     
net/http.(*Transport).RoundTrip(0xc4204e22d0, 0xc423a29b00, 0xc4204e22d0, 0x0, 0x0)
        /usr/local/go/src/net/http/transport.go:413 +0x992                      
net/http.send(0xc423a29b00, 0x2def520, 0xc4204e22d0, 0x0, 0x0, 0x0, 0xc420088ba0, 0x7ffff7fa8000, 0xc42407f890, 0x1)
        /usr/local/go/src/net/http/client.go:249 +0x1a9                         
net/http.(*Client).send(0xc422187b90, 0xc423a29b00, 0x0, 0x0, 0x0, 0xc420088ba0, 0x0, 0x1, 0x0)
        /usr/local/go/src/net/http/client.go:173 +0xfd                          
net/http.(*Client).Do(0xc422187b90, 0xc423a29b00, 0x2eb9400, 0x0, 0x0)          
        /usr/local/go/src/net/http/client.go:602 +0x28d                         
github.com/cilium/cilium/vendor/github.com/hashicorp/consul/api.(*Client).doRequest(0xc42217cea0, 0xc42333fbd0, 0xc42407fbc0, 0xc421ed80f0, 0x2e, 0xc42333fbd0)
        /home/vagrant/go/src/github.com/cilium/cilium/vendor/github.com/hashicorp/consul/api/api.go:368 +0xbe
github.com/cilium/cilium/vendor/github.com/hashicorp/consul/api.(*KV).deleteInternal(0xc42407fcc0, 0xc421ed80c0, 0x27, 0x0, 0x0, 0xc4204a0e00, 0x0, 0x0, 0x0)
        /home/vagrant/go/src/github.com/cilium/cilium/vendor/github.com/hashicorp/consul/api/kv.go:225 +0x296
github.com/cilium/cilium/vendor/github.com/hashicorp/consul/api.(*KV).Delete(0xc42407fcc0, 0xc421ed80c0, 0x27, 0x0, 0xc421ed80c0, 0x6331000000000027, 0x26)
        /home/vagrant/go/src/github.com/cilium/cilium/vendor/github.com/hashicorp/consul/api/kv.go:200 +0x52
github.com/cilium/cilium/pkg/kvstore.(*consulClient).Delete(0xc420088f88, 0xc421ed80c0, 0x27, 0x0, 0x0)
        /home/vagrant/go/src/github.com/cilium/cilium/pkg/kvstore/consul.go:483 +0x55
github.com/cilium/cilium/pkg/kvstore.Delete(0xc421ed80c0, 0x27, 0x3, 0xc421ed80c0)
        /home/vagrant/go/src/github.com/cilium/cilium/pkg/kvstore/kvstore.go:89 +0x56
github.com/cilium/cilium/pkg/ipcache.DeleteIPFromKVStore(0xc421f1beb4, 0xc, 0xc, 0xc42407fe50)                                                                                                                                                                                                                                 
        /home/vagrant/go/src/github.com/cilium/cilium/pkg/ipcache/ipcache.go:227 +0xbc
github.com/cilium/cilium/pkg/endpoint.(*Endpoint).runIPIdentitySync.func2(0xc42333fb30, 0xc42407ff28)
        /home/vagrant/go/src/github.com/cilium/cilium/pkg/endpoint/policy.go:930 +0x63
github.com/cilium/cilium/pkg/controller.(*Controller).runController(0xc423c9b200)
        /home/vagrant/go/src/github.com/cilium/cilium/pkg/controller/controller.go:233 +0x398
created by github.com/cilium/cilium/pkg/controller.(*Manager).UpdateController  
        /home/vagrant/go/src/github.com/cilium/cilium/pkg/controller/manager.go:81 +0x2c8

Unfortunately the consul log appears to be clean, no indication of errors in it.

@ianvernon to prepare a commit to wait for IPs to exist in the BPF datapath as a way to:

  • Catch eventual consistency issues
  • Cause the test to fail in this wait logic if the kvstore sync is misbehaving