cilium: DNS request from hostNetwork pods cannot be delivered to local backend with LRP

When doing nslookup from within a hostNetwork pod with NodelocalDNS + Cilium (full KRP), DNS requests time out. I do not see any dropped packets with cilium monitor -t drop. Running tcpdump on the veth of the local node-cache backend pod doesn’t show anything, aka the packets never hit local backend pot.

General Information

  • Cilium version (run cilium version): Client: 1.9.5 caf84d780 2021-04-23T00:39:47+00:00 go version go1.15.8 linux/amd64

  • Kernel version (run uname -a): Linux gke-nld-default-pool-5a5e9ec3-8llq 5.4.104+ SMP Tue Apr 6 09:49:56 PDT 2021 x86_64 x86_64 x86_64 GNU/Linux

  • Orchestration system version in use (e.g. kubectl version, …) GKE cluster v1.20.6-gke.1400

How to reproduce the issue

  1. Deploy NodelocalDNS with Cilium following gsg
  2. Deploy following pod and do nslookup from within:
apiVersion: v1
kind: Pod
metadata:
        name: dnsclienthostnet1
spec:
  dnsPolicy: ClusterFirstWithHostNet
  hostNetwork: true
  containers:
      - image: gcr.io/kubernetes-e2e-test-images/dnsutils:1.1
        name: dnsclient
        resources:
                limits:
                        cpu: "0.1"
                requests:
                        cpu: 100m
        command: ["sh", "-c"]
        args: ["sleep 36000"]

Note this issue only happens when the pod has dnsPolicy: ClusterFirstWithHostNet.

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Comments: 16 (16 by maintainers)

Most upvoted comments

Spent some more time on this today, finally figured out root cause.

tldr, we need an additional ACCEPT rule in filter:CILIUM_OUTPUT to accept the outgoing packet.

RCA: When a hostns pod sends out a dns pkt going to local node-cache pod, it will hit this rule (https://github.com/cilium/cilium/blob/master/pkg/datapath/iptables/iptables.go#L813) and skip conntrack, however, there’s no corresponding ACCEPT rule to allow it in filter:output, resulting in a drop there, this can be observed via below pkt trace:

root@gke-test2-default-pool-2ea80d29-4xrl:/home/cilium# dmesg | grep ID=40009
[ 6594.691542] TRACE: raw:OUTPUT:rule:3 IN= OUT=gke2d2b5b36051 SRC=10.48.0.1 DST=10.48.0.7 LEN=85 TOS=0x00 PREC=0x00 TTL=64 ID=40009 PROTO=UDP SPT=42393 DPT=53 LEN=65 UID=0 GID=0 
[ 6594.707446] TRACE: raw:CILIUM_OUTPUT_raw:rule:5 IN= OUT=gke2d2b5b36051 SRC=10.48.0.1 DST=10.48.0.7 LEN=85 TOS=0x00 PREC=0x00 TTL=64 ID=40009 PROTO=UDP SPT=42393 DPT=53 LEN=65 UID=0 GID=0 
[ 6594.724302] TRACE: raw:CILIUM_OUTPUT_raw:return:7 IN= OUT=gke2d2b5b36051 SRC=10.48.0.1 DST=10.48.0.7 LEN=85 TOS=0x00 PREC=0x00 TTL=64 ID=40009 PROTO=UDP SPT=42393 DPT=53 LEN=65 UID=0 GID=0 
[ 6594.741412] TRACE: raw:OUTPUT:policy:4 IN= OUT=gke2d2b5b36051 SRC=10.48.0.1 DST=10.48.0.7 LEN=85 TOS=0x00 PREC=0x00 TTL=64 ID=40009 PROTO=UDP SPT=42393 DPT=53 LEN=65 UID=0 GID=0 
[ 6594.757764] TRACE: mangle:OUTPUT:policy:2 IN= OUT=gke2d2b5b36051 SRC=10.48.0.1 DST=10.48.0.7 LEN=85 TOS=0x00 PREC=0x00 TTL=64 ID=40009 PROTO=UDP SPT=42393 DPT=53 LEN=65 UID=0 GID=0 
[ 6594.774306] TRACE: filter:OUTPUT:rule:1 IN= OUT=gke2d2b5b36051 SRC=10.48.0.1 DST=10.48.0.7 LEN=85 TOS=0x00 PREC=0x00 TTL=64 ID=40009 PROTO=UDP SPT=42393 DPT=53 LEN=65 UID=0 GID=0 
[ 6594.790459] TRACE: filter:CILIUM_OUTPUT:rule:2 IN= OUT=gke2d2b5b36051 SRC=10.48.0.1 DST=10.48.0.7 LEN=85 TOS=0x00 PREC=0x00 TTL=64 ID=40009 PROTO=UDP SPT=42393 DPT=53 LEN=65 UID=0 GID=0 
[ 6594.807343] TRACE: filter:CILIUM_OUTPUT:return:5 IN= OUT=gke2d2b5b36051 SRC=10.48.0.1 DST=10.48.0.7 LEN=85 TOS=0x00 PREC=0x00 TTL=64 ID=40009 PROTO=UDP SPT=42393 DPT=53 LEN=65 UID=0 GID=0 MARK=0xc00 
[ 6594.825245] TRACE: filter:OUTPUT:rule:2 IN= OUT=gke2d2b5b36051 SRC=10.48.0.1 DST=10.48.0.7 LEN=85 TOS=0x00 PREC=0x00 TTL=64 ID=40009 PROTO=UDP SPT=42393 DPT=53 LEN=65 UID=0 GID=0 MARK=0xc00 
[ 6594.842374] TRACE: filter:KUBE-FIREWALL:return:3 IN= OUT=gke2d2b5b36051 SRC=10.48.0.1 DST=10.48.0.7 LEN=85 TOS=0x00 PREC=0x00 TTL=64 ID=40009 PROTO=UDP SPT=42393 DPT=53 LEN=65 UID=0 GID=0 MARK=0xc00 
[ 6594.860244] TRACE: filter:OUTPUT:policy:5 IN= OUT=gke2d2b5b36051 SRC=10.48.0.1 DST=10.48.0.7 LEN=85 TOS=0x00 PREC=0x00 TTL=64 ID=40009 PROTO=UDP SPT=42393 DPT=53 LEN=65 UID=0 GID=0 MARK=0xc00 

Here’s filter:OUTPUT:

root@gke-test2-default-pool-2ea80d29-4xrl:/home/cilium# iptables -L OUTPUT -n -v --line-numbers
Chain OUTPUT (policy DROP 48 packets, 3447 bytes)
num   pkts bytes target     prot opt in     out     source               destination         
1     275K  128M CILIUM_OUTPUT  all  --  *      *       0.0.0.0/0            0.0.0.0/0            /* cilium-feeder: CILIUM_OUTPUT */
2     300K  135M KUBE-FIREWALL  all  --  *      *       0.0.0.0/0            0.0.0.0/0           
3     300K  136M ACCEPT     all  --  *      *       0.0.0.0/0            0.0.0.0/0            state NEW,RELATED,ESTABLISHED
4        0     0 ACCEPT     all  --  *      lo      0.0.0.0/0            0.0.0.0/0

OSS NodelocalDNS does not have a specific ACCEPT rule either (that’s probably why we missed it in the first place), but it works because it’s DNATing the packet to a link-local dummy interface and the pkt will be routed to the loopback dev, hence hitting rule no.4 above and be allowed through.

Solution: After adding the last 2 rules into filter:CILIUM_OUTPUT to explicitly allow this untracked pkt (to nodelcoaldns ip)

root@gke-test2-default-pool-a5ad9368-sq4k:/home/cilium# iptables -nvL CILIUM_OUTPUT -t filter
Chain CILIUM_OUTPUT (1 references)
 pkts bytes target     prot opt in     out     source               destination         
    0     0 ACCEPT     all  --  *      *       0.0.0.0/0            0.0.0.0/0            mark match 0xa00/0xfffffeff /* cilium: ACCEPT for proxy return traffic */
14788 5037K MARK       all  --  *      *       0.0.0.0/0            0.0.0.0/0            mark match ! 0xe00/0xf00 mark match ! 0xd00/0xf00 mark match ! 0xa00/0xe00 /* cilium: host->any mark as from host */ MARK xset 0xc00/0xf00
    0     0 ACCEPT     tcp  --  *      *       10.48.0.2            0.0.0.0/0            tcp spt:53
    0     0 ACCEPT     udp  --  *      *       10.48.0.2            0.0.0.0/0            udp spt:53
    0     0 ACCEPT     tcp  --  *      *       0.0.0.0/0            10.48.0.2            tcp dpt:53
    6   510 ACCEPT     udp  --  *      *       0.0.0.0/0            10.48.0.2            udp dpt:53

I can see the pkt flows normally as expected through:

root@gke-test2-default-pool-a5ad9368-sq4k:/home/cilium# dmesg | grep ID=54644
[  662.783656] TRACE: raw:OUTPUT:rule:3 IN= OUT=gke8162daf509e SRC=10.48.0.1 DST=10.48.0.2 LEN=85 TOS=0x00 PREC=0x00 TTL=64 ID=54644 PROTO=UDP SPT=55946 DPT=53 LEN=65 UID=0 GID=0 
[  662.800213] TRACE: raw:CILIUM_OUTPUT_raw:rule:6 IN= OUT=gke8162daf509e SRC=10.48.0.1 DST=10.48.0.2 LEN=85 TOS=0x00 PREC=0x00 TTL=64 ID=54644 PROTO=UDP SPT=55946 DPT=53 LEN=65 UID=0 GID=0 
[  662.817271] TRACE: raw:CILIUM_OUTPUT_raw:return:7 IN= OUT=gke8162daf509e SRC=10.48.0.1 DST=10.48.0.2 LEN=85 TOS=0x00 PREC=0x00 TTL=64 ID=54644 PROTO=UDP SPT=55946 DPT=53 LEN=65 UID=0 GID=0 
[  662.834603] TRACE: raw:OUTPUT:policy:4 IN= OUT=gke8162daf509e SRC=10.48.0.1 DST=10.48.0.2 LEN=85 TOS=0x00 PREC=0x00 TTL=64 ID=54644 PROTO=UDP SPT=55946 DPT=53 LEN=65 UID=0 GID=0 
[  662.850878] TRACE: mangle:OUTPUT:policy:2 IN= OUT=gke8162daf509e SRC=10.48.0.1 DST=10.48.0.2 LEN=85 TOS=0x00 PREC=0x00 TTL=64 ID=54644 PROTO=UDP SPT=55946 DPT=53 LEN=65 UID=0 GID=0 
[  662.867569] TRACE: filter:OUTPUT:rule:1 IN= OUT=gke8162daf509e SRC=10.48.0.1 DST=10.48.0.2 LEN=85 TOS=0x00 PREC=0x00 TTL=64 ID=54644 PROTO=UDP SPT=55946 DPT=53 LEN=65 UID=0 GID=0 
[  662.883879] TRACE: filter:CILIUM_OUTPUT:rule:2 IN= OUT=gke8162daf509e SRC=10.48.0.1 DST=10.48.0.2 LEN=85 TOS=0x00 PREC=0x00 TTL=64 ID=54644 PROTO=UDP SPT=55946 DPT=53 LEN=65 UID=0 GID=0 
[  662.900748] TRACE: filter:CILIUM_OUTPUT:rule:6 IN= OUT=gke8162daf509e SRC=10.48.0.1 DST=10.48.0.2 LEN=85 TOS=0x00 PREC=0x00 TTL=64 ID=54644 PROTO=UDP SPT=55946 DPT=53 LEN=65 UID=0 GID=0 MARK=0xc00 
[  662.918460] TRACE: mangle:POSTROUTING:rule:1 IN= OUT=gke8162daf509e SRC=10.48.0.1 DST=10.48.0.2 LEN=85 TOS=0x00 PREC=0x00 TTL=64 ID=54644 PROTO=UDP SPT=55946 DPT=53 LEN=65 UID=0 GID=0 MARK=0xc00 
[  662.936033] TRACE: mangle:CILIUM_POST_mangle:return:1 IN= OUT=gke8162daf509e SRC=10.48.0.1 DST=10.48.0.2 LEN=85 TOS=0x00 PREC=0x00 TTL=64 ID=54644 PROTO=UDP SPT=55946 DPT=53 LEN=65 UID=0 GID=0 MARK=0xc00 
[  662.954440] TRACE: mangle:POSTROUTING:policy:2 IN= OUT=gke8162daf509e SRC=10.48.0.1 DST=10.48.0.2 LEN=85 TOS=0x00 PREC=0x00 TTL=64 ID=54644 PROTO=UDP SPT=55946 DPT=53 LEN=65 UID=0 GID=0 MARK=0xc00

And dns request works just fine in hostns pod:

/ # nslookup nginx-service
Server:         10.28.16.10
Address:        10.28.16.10#53

Name:   nginx-service.default.svc.cluster.local
Address: 10.28.29.201

Added debug prints in bpf_sock.c, saw the following:

  <...>-72998   [001] ....  7662.985851: 0: sock4_xlate_fwd -> 174080010:13568  
  <...>-72998   [001] ....  7662.985869: 0: sock4_rewrite -> 50355210:13568  

First line is right upon entry of sock4_xlate_fwd, second line is right b4 final return (I also have a printk in the LRP skip case which never triggers).

Convering the ip/port to human readable format I can see that the first packet is going to kube-dns svc VIP and second line goes to the nodelocaldns pod on the same node, so I’m pretty sure the rewrite happens correctly in this case.

I can see the NOTRACK rule being hit when issuing nslookup in the hostNetwork client (the counter number stays unchanged if no operation is taken):

Before:

[26:2132] -A CILIUM_OUTPUT_raw -d 10.92.0.3/32 -p udp -m udp --dport 53 -j NOTRACK 

After:

[27:2214] -A CILIUM_OUTPUT_raw -d 10.92.0.3/32 -p udp -m udp --dport 53 -j NOTRACK

Above seems aligns with the fact that “hacking the skip LRP translate doesn’t work” because the traffic is indeed DNATed to nodelocaldns’s IP.