amazon-vpc-cni-k8s: Using Security Groups per pod with NodeLocal DNSCache doesn't work

What happened: I tried to attach security group to pod using official guide. Everything work as expected, but when I try to use NodeLocal DNS cache, I can’t connect to CoreDNS IP (172.20.0.10) from pods, to which I attached Security Group (I can connect from other pods). I Use this file as template for my installation. Here is my NodeLocalCache DaemonSet and ConfigMap manifests:

apiVersion: apps/v1
kind: DaemonSet
metadata:
  labels:
    addonmanager.kubernetes.io/mode: Reconcile
    k8s-app: node-local-dns
  name: node-local-dns
  namespace: kube-system
spec:
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      k8s-app: node-local-dns
  template:
    metadata:
      annotations:
        prometheus.io/port: "9253"
        prometheus.io/scrape: "true"
      labels:
        k8s-app: node-local-dns
    spec:
      containers:
        - args:
            - -localip
            - 169.254.20.10,172.20.0.10
            - -conf
            - /etc/Corefile
            - -upstreamsvc
            - kube-dns-upstream
          image: k8s.gcr.io/dns/k8s-dns-node-cache:1.16.0
          imagePullPolicy: IfNotPresent
          livenessProbe:
            failureThreshold: 3
            httpGet:
              host: 169.254.20.10
              path: /health
              port: 8080
              scheme: HTTP
            initialDelaySeconds: 60
            periodSeconds: 10
            successThreshold: 1
            timeoutSeconds: 5
          name: node-cache
          ports:
            - containerPort: 53
              hostPort: 53
              name: dns
              protocol: UDP
            - containerPort: 53
              hostPort: 53
              name: dns-tcp
              protocol: TCP
            - containerPort: 9253
              hostPort: 9253
              name: metrics
              protocol: TCP
          resources:
            requests:
              cpu: 25m
              memory: 5Mi
          securityContext:
            privileged: true
          terminationMessagePath: /dev/termination-log
          terminationMessagePolicy: File
          volumeMounts:
            - mountPath: /run/xtables.lock
              name: xtables-lock
            - mountPath: /etc/coredns
              name: config-volume
            - mountPath: /etc/kube-dns
              name: kube-dns-config
      dnsPolicy: Default
      hostNetwork: true
      priorityClassName: system-node-critical
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext: {}
      serviceAccount: node-local-dns
      serviceAccountName: node-local-dns
      terminationGracePeriodSeconds: 30
      tolerations:
        - key: CriticalAddonsOnly
          operator: Exists
        - effect: NoExecute
          operator: Exists
        - effect: NoSchedule
          operator: Exists
      volumes:
        - hostPath:
            path: /run/xtables.lock
            type: FileOrCreate
          name: xtables-lock
        - configMap:
            defaultMode: 420
            name: kube-dns
            optional: true
          name: kube-dns-config
        - configMap:
            defaultMode: 420
            items:
              - key: Corefile
                path: Corefile.base
            name: node-local-dns
          name: config-volume
  updateStrategy:
    rollingUpdate:
      maxUnavailable: 10%
    type: RollingUpdate

---
apiVersion: v1
data:
  Corefile: |
    cluster.local:53 {
        errors
        cache {
                success 9984 30
                denial 9984 5
        }
        reload
        loop
        bind 169.254.20.10 172.20.0.10
        forward . __PILLAR__CLUSTER__DNS__
        prometheus :9253
        health 169.254.20.10:8080
        }
    in-addr.arpa:53 {
        errors
        cache 30
        reload
        loop
        bind 169.254.20.10 172.20.0.10
        forward . __PILLAR__CLUSTER__DNS__
        prometheus :9253
        }
    ip6.arpa:53 {
        errors
        cache 30
        reload
        loop
        bind 169.254.20.10 172.20.0.10
        forward . __PILLAR__CLUSTER__DNS__
        prometheus :9253
        }
    .:53 {
        errors
        cache 30
        reload
        loop
        bind 169.254.20.10 172.20.0.10
        forward . __PILLAR__UPSTREAM__SERVERS__
        prometheus :9253
        }
kind: ConfigMap
metadata:
  labels:
    addonmanager.kubernetes.io/mode: Reconcile
  name: node-local-dns
  namespace: kube-system

Here is final /etc/Corefile file:

cluster.local:53 {
    errors
    cache {
            success 9984 30
            denial 9984 5
    }
    reload
    loop
    bind 169.254.20.10 172.20.0.10
    forward . 172.20.209.13
    prometheus :9253
    health 169.254.20.10:8080
    }
in-addr.arpa:53 {
    errors
    cache 30
    reload
    loop
    bind 169.254.20.10 172.20.0.10
    forward . 172.20.209.13
    prometheus :9253
    }
ip6.arpa:53 {
    errors
    cache 30
    reload
    loop
    bind 169.254.20.10 172.20.0.10
    forward . 172.20.209.13
    prometheus :9253
    }
.:53 {
    errors
    cache 30
    reload
    loop
    bind 169.254.20.10 172.20.0.10
    forward . /etc/resolv.conf
    prometheus :9253
    }

Note, that I don’t use force_tcp option in CoreDNS configuration according official recommendation

Environment:

Kubernetes version (use kubectl version): v1.18.9-eks-d1db3c
CNI Version: 1.7.8
OS (e.g: cat /etc/os-release):
Kernel (e.g. uname -a):

About this issue

Original URL
State: closed
Created 3 years ago
Reactions: 2
Comments: 15 (9 by maintainers)

Commits related to this issue

Update security-groups-for-pods.md There is an open issue for this, we can configure nodelocal DNS to run with the -setupiptables=false flag and a custom DNS policy however this would need to be inve... — committed to frimgrandango/amazon-eks-user-guide by frimgrandango 3 years ago

Most upvoted comments

The following workaround allows DNS Resolution to work for pods using SG.

In your nodelocaldns deployment we can disable the DaemonSet to apply the IP Table rules that are blocking DNS Resolution for pods using Security Group by passing the following additional argument.

"-setupiptables=false"

We now have to create a custom dnsPolicy for all the pods using Security Group for Pod feature and for that we need to find the kube DNS service IP using KUBE_DNS_IP=$(kubectl get svc kube-dns -n kube-system -o jsonpath={.spec.clusterIP}) and the region for the cluster.

dnsPolicy: "None"
dnsConfig:
  nameservers:
  - KUBE_DNS_IP
  searches:
  - default.svc.cluster.local
  - svc.cluster.local
  - cluster.local
  - AWS_REGION.compute.internal
  options:
  - name: ndots
    value: "5"

After replacing the values in the dnsPolicy and using it in the pods using SGP, DNS resolution should go through. You can verify the same using the following command

k exec -ti <pod-name> nslookup kubernetes.default

It should return the output like this.

kubectl exec [POD] [COMMAND] is DEPRECATED and will be removed in a future version. Use kubectl exec [POD] -- [COMMAND] instead.
Server:		10.100.0.10
Address:	10.100.0.10#53
Name:	kubernetes.default.svc.cluster.local
Address: 10.100.0.1

abhipth on Feb 16, 2021

Just to provide more around this issue, documenting few behaviors below

Pods using security group doesn’t use local route table and therefore connection always goes out of branch ENI (using vlan) and to outside through trunk ENI even if these pods want to communicate to other pods on the same host. This ensures the egress on security group is applied. Therefore pods using security group will not be able to communicate to pods using host networking on the host if security group doesn’t allow for such communication.

NodeLocalDNS pods Setup NodeLocalDNS pods performs two operations as follows,

adds NOTRACK iptables rules for clusterDNS IP as well as nodelocalDNS IP. This skips iptables DNAT and connection tracking (http://www.netfilter.org/documentation/HOWTO//netfilter-hacking-HOWTO-3.html#ss3.3 connection tracking is needed for NATs). This also avoids any potential race conditions while using Connection Tracking.
nodelocaldns pods interfaces are setup using local routing table.

Since pods using security group will go out of branchENI(VLAN) device and doesn’t know about local route table, packets will not reach nodeLocalDNS. Also, due to NOTRACK iptable rule, these pods won’t be able to communicate with actual clusterDNS service IP as well.

There are few workarounds for this, @abhipth is looking into this if we can use cluster DNSPolicy and ClusterDNS NameServer on pod spec to avoid this and configure NodeLocalDNS to not add NOTRACK iptable rules.

Other option we have is, add a flag in aws-node ipamd that indicates to add a special ip rule on the host to enable NodeLocalDNS traffic within the host (cni plugin can add from all vlan use route table x and ip rule from NodeLocalDNS use route table x and allow traffic to NodeLocalDNS to lookup local). This might be better and right approach but open to suggestions on this.

Sorry for the inconvenience this has caused, we will update our docs with whatever we decide as right path forward as localDNS is super useful feature and we want to support this on all pods.

SaranBalaji90 on Feb 16, 2021