ingress-nginx: Fail to resolve auth-url DNS

What happened:

We use “External OAUTH Authentication” feature with nginx.ingress.kubernetes.io/auth-url set to a DNS of a service in our cluster (i.e. xxx.xxx.svc.cluster.local). We run ingress-nginx in 3 replicas, one on each of our hosts with hostNetwork=true. Intermittently the “External OAUTH Authentication” feature in one (or more) of our instances stops working because nginx can’t resolve the DNS of the auth-url (i.e. xxx.xxx.svc.cluster.local). Log message (masked IP etc with XXX):

2022/10/27 11:23:00 [error] 28#28: *1026 xxx.xxx.svc.cluster.local could not be resolved (110: Operation timed out), client: XXX, server: _, request: "POST /kibana/api/core/capabilities HTTP/2.0", subrequest: "/_external-auth-XXX-Prefix", host: "XXX", referrer: "https://XXX/kibana/app/kibana"
XXX - - [27/Oct/2022:11:23:00 +0000] "POST /kibana/api/core/capabilities HTTP/2.0" 502 0 "https://XXX/kibana/app/kibana" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/106.0.0.0 Safari/537.36" 0 29.999 [XXX] [] - - - - XX
2022/10/27 11:23:00 [error] 28#28: *1026 auth request unexpected status: 502 while sending to client, client: XXX, server: _, request: "POST /kibana/api/core/capabilities HTTP/2.0", host: "XXX", referrer: "https://XXX/kibana/app/kibana"
10.254.254.17 - - [27/Oct/2022:11:23:00 +0000] "POST /kibana/api/core/capabilities HTTP/2.0" 500 572 "https://XXX/kibana/app/kibana" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/106.0.0.0 Safari/537.36" 70 29.999 [XXX] [] - - - - XXX

The pod never seem to recover from this state.

When this happens it’s still possible to nslookup DNS and curl the auth-url (xxx.xxx.svc.cluster.local) from within the container. It seems like it’s just nginx that can’t resolve the DNS.

Performing any of these actions fixes the issue:

Exec into the container and run nginx -s reload
Delete the pod. A new working one is created.

Not sure what triggers the error when we’ve seen it in production but found one way to trigger it is to delete the “core-dns” pods in “kube-system” namespace.

This doesn’t trigger the issue every time but after a couple of recreations of “core-dns” pods at least one of the “ingress-nginx” instances get stuck in this “broken” state.

Exec into container and manually change the following line to not use a variable $target and instead set the url directly in the proxy_pass directive fixes the issue and it can’t be reproduced by deleting “core-dns” pods anymore. How can this be any different? https://github.com/kubernetes/ingress-nginx/blob/2488fb00649f36cc0c6d0eadbb3d38dccd12f7be/rootfs/etc/nginx/template/nginx.tmpl#L1163

What you expected to happen: The DNS resolution of auth-url should always work. Even if “core-dns” pods are recreated.

NGINX Ingress controller version (exec into the pod and run nginx-ingress-controller --version.):

-------------------------------------------------------------------------------
NGINX Ingress controller
  Release:       v1.4.0
  Build:         50be2bf95fd1ef480420e2aa1d6c5c7c138c95ea
  Repository:    https://github.com/kubernetes/ingress-nginx
  nginx version: nginx/1.19.10

-------------------------------------------------------------------------------

Kubernetes version (use kubectl version): Server Version: v1.23.0

Environment:

Cloud provider or hardware configuration: Self hosted
OS (e.g. from /etc/os-release): Ubuntu 18.04.6 LTS (Bionic Beaver)
Kernel (e.g. uname -a): 5.4.0-131-generic #147~18.04.1-Ubuntu SMP Sat Oct 15 13:10:18 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
Install tools: kubeadm
Basic cluster related info: Server Version: v1.23.0

NAME            STATUS   ROLES                  AGE   VERSION   INTERNAL-IP     EXTERNAL-IP   OS-IMAGE             KERNEL-VERSION      CONTAINER-RUNTIME
NAME            STATUS   ROLES                  AGE   VERSION   INTERNAL-IP     EXTERNAL-IP   OS-IMAGE             KERNEL-VERSION      CONTAINER-RUNTIME
xxx-1   Ready    control-plane,master   16d   v1.23.3   xxx   <none>        Ubuntu 18.04.6 LTS   5.4.0-131-generic   cri-o://1.21.7
xxx-1   Ready    <none>                 16d   v1.23.3   xxx   <none>        Ubuntu 18.04.6 LTS   5.4.0-131-generic   cri-o://1.21.7
xxx-2   Ready    <none>                 16d   v1.23.3   xxx   <none>        Ubuntu 18.04.6 LTS   5.4.0-131-generic   cri-o://1.21.7
xxx-3   Ready    <none>                 16d   v1.23.3   xxx  <none>        Ubuntu 18.04.6 LTS   5.4.0-131-generic   cri-o://1.21.7
xxx-4   Ready    <none>                 16d   v1.23.3   xxx   <none>        Ubuntu 18.04.6 LTS   5.4.0-131-generic   cri-o://1.21.7
xxx-5   Ready    <none>                 16d   v1.23.3   xxx   <none>        Ubuntu 18.04.6 LTS   5.4.0-131-generic   cri-o://1.21.7

How was the ingress-nginx-controller installed:

ingress-nginx:
  controller:
    admissionWebhooks:
      enabled: false
    daemonset:
      useHostPort: true
    dnsPolicy: ClusterFirstWithHostNet
    hostNetwork: true
    kind: DaemonSet
    nodeSelector:
      core: ""
    service:
      annotations:
        service.beta.kubernetes.io/aws-load-balancer-type: nlb
  nameOverride: nginx-ingress

Current State of the controller:
- kubectl describe ingressclasses

Name:         nginx
Labels:       app.kubernetes.io/component=controller
              app.kubernetes.io/instance=XXX
              app.kubernetes.io/managed-by=Helm
              app.kubernetes.io/name=nginx-ingress
              app.kubernetes.io/part-of=nginx-ingress
              app.kubernetes.io/version=1.4.0
              helm.sh/chart=ingress-nginx-4.3.0
Annotations:  meta.helm.sh/release-name: XXX
              meta.helm.sh/release-namespace: XXX
Controller:   k8s.io/ingress-nginx
Events:       <none>

kubectl -n <ingresscontrollernamespace> get all -A -o wide
kubectl -n <ingresscontrollernamespace> describe po <ingresscontrollerpodname>

Name:             xxx
Namespace:        xxx
Priority:         0
Service Account:  xxx
Node:             xxx
Start Time:       Thu, 27 Oct 2022 13:19:25 +0200
Labels:           app.kubernetes.io/component=controller
                  app.kubernetes.io/instance=XXX
                  app.kubernetes.io/name=nginx-ingress
                  controller-revision-hash=58d9d684cc
                  pod-template-generation=6
Annotations:      <none>
Status:           Running
IP:              xxx
IPs:
  IP:           xxx
Controlled By:  DaemonSet/xxx
Containers:
  controller:
    Container ID:  cri-o://5a2c60b3f3dc788430e3079f616161e71ec502a8f6acdabfa4c578cb093fda1e
    Image:         registry.k8s.io/ingress-nginx/controller:v1.4.0@sha256:34ee929b111ffc7aa426ffd409af44da48e5a0eea1eb2207994d9e0c0882d143
    Image ID:      registry.k8s.io/ingress-nginx/controller@sha256:34ee929b111ffc7aa426ffd409af44da48e5a0eea1eb2207994d9e0c0882d143
    Ports:         80/TCP, 443/TCP
    Host Ports:    80/TCP, 443/TCP
    Args:
      /nginx-ingress-controller
      --publish-service=$(POD_NAMESPACE)/xxx
      --election-id=ingress-controller-leader
      --controller-class=k8s.io/ingress-nginx
      --ingress-class=nginx
      --configmap=$(POD_NAMESPACE)/xxx
    State:          Running
      Started:      Thu, 27 Oct 2022 14:02:08 +0200
    Ready:          True
    Restart Count:  1
    Requests:
      cpu:      100m
      memory:   90Mi
    Liveness:   http-get http://:10254/healthz delay=10s timeout=1s period=10s #success=1 #failure=5
    Readiness:  http-get http://:10254/healthz delay=10s timeout=1s period=10s #success=1 #failure=3
    Environment:
      POD_NAME:       xxx(v1:metadata.name)
      POD_NAMESPACE:  xxx(v1:metadata.namespace)
      LD_PRELOAD:     /usr/local/lib/libmimalloc.so
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-b5sd5 (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             True 
  ContainersReady   True 
  PodScheduled      True 
Volumes:
  kube-api-access-b5sd5:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   Burstable
Node-Selectors:              core=
                             kubernetes.io/os=linux
Tolerations:                 node.kubernetes.io/disk-pressure:NoSchedule op=Exists
                             node.kubernetes.io/memory-pressure:NoSchedule op=Exists
                             node.kubernetes.io/network-unavailable:NoSchedule op=Exists
                             node.kubernetes.io/not-ready:NoExecute op=Exists
                             node.kubernetes.io/pid-pressure:NoSchedule op=Exists
                             node.kubernetes.io/unreachable:NoExecute op=Exists
                             node.kubernetes.io/unschedulable:NoSchedule op=Exists

kubectl -n <ingresscontrollernamespace> describe svc <ingresscontrollerservicename>

Name:                     xxx
Namespace:                xx
Labels:                   app.kubernetes.io/component=controller
                          app.kubernetes.io/instance=xxx
                          app.kubernetes.io/managed-by=Helm
                          app.kubernetes.io/name=nginx-ingress
                          app.kubernetes.io/part-of=nginx-ingress
                          app.kubernetes.io/version=1.4.0
                          helm.sh/chart=ingress-nginx-4.3.0
Annotations:              meta.helm.sh/release-name: xxx
                          meta.helm.sh/release-namespace: xxx
                          service.beta.kubernetes.io/aws-load-balancer-type: nlb
Selector:                 app.kubernetes.io/component=controller,app.kubernetes.io/instance=xxx,app.kubernetes.io/name=nginx-ingress
Type:                     LoadBalancer
IP Family Policy:         SingleStack
IP Families:              IPv4
IP:                       xxx
IPs:                      xxx
Port:                     http  80/TCP
TargetPort:               http/TCP
NodePort:                 http  32655/TCP
Endpoints:                xxx
Port:                     https  443/TCP
TargetPort:               https/TCP
NodePort:                 https  32081/TCP
Endpoints:                xxx
Session Affinity:         None
External Traffic Policy:  Cluster
Events:                   <none>

Current state of ingress object, if applicable:

Name:       xxx      
Labels:           app.kubernetes.io/instance=xxx
                  app.kubernetes.io/managed-by=Helm
                  app.kubernetes.io/name=xxx
                  helm.sh/chart=xxx-0.1.0
Namespace:        xxx
Address:          
Ingress Class:    <none>
Default backend:  <default>
Rules:
  Host             Path  Backends
  ----             ----  --------
  xxx
                   /grafana(/|$)(.*)   grafana:80 (10.0.2.238:3000)
                   /kibana(/|$)(.*)    kibana-kibana:5601 (10.0.2.167:5601)
  *                
                   /grafana(/|$)(.*)   grafana:80 (10.0.2.238:3000)
                   /kibana(/|$)(.*)    kibana-kibana:5601 (10.0.2.167:5601)
Annotations:       cert-manager.io/cluster-issuer: xxx
                   kubernetes.io/ingress.class: nginx
                   meta.helm.sh/release-name: xxx
                   meta.helm.sh/release-namespace: xxx
                   nginx.ingress.kubernetes.io/auth-cache-duration: 200 202 60s, 401 0s
                   nginx.ingress.kubernetes.io/auth-cache-key: $cookie_xxx
                   nginx.ingress.kubernetes.io/auth-response-headers: xxx
                   nginx.ingress.kubernetes.io/auth-signin: /
                   nginx.ingress.kubernetes.io/auth-url: http://xxx.svc.cluster.local:8080/xxx
                   nginx.ingress.kubernetes.io/rewrite-target: /$2
                   nginx.ingress.kubernetes.io/server-snippet:
                     location = /kibana {
                       return 301 /kibana/app/kibana#/discover/xxx...
                     }
                   nginx.ingress.kubernetes.io/ssl-redirect: false
Events:            <none>

Others:

How to reproduce this issue: Delete the “core-dns” pods in “kube-system” namespace a couple of times.

Anything else we need to know:

About this issue

Original URL
State: open
Created 2 years ago
Reactions: 1
Comments: 24 (8 by maintainers)

Most upvoted comments

FWIW I experienced this same issue in a reverse proxy nginx in Grafana Mimir: nginx process was trying to use a died coredns as a resolver, timing out in every request

paolostancato on Apr 4, 2023

My workaround for this problem was to deploy a DNS Proxy (I use CoreDNS) as a sidecar container to nginx (explanation below)

Here is the section of the nginx helm chart value to do what I mentioned above

  ### START OF SECTION TO DEPLOY A DNS PROXY (CoreDNS) AS SIDECAR CONTAINER TO NGINX ###
  # We need this DNS Proxy to overcome the issue that under Cilium network infrastructure,
  # nginx failed to update the list of CoreDNS addresses. Thus, when CoreDNS pods are moved,
  # nginx tried to connect to no-longer-exist CoreDNS pods which resulted in a failure in DNS resolving.
  #
  # We deploy a local DNS Server (CoreDNS) and this acts as a proxy between nginx and kube-system/kube-dns.
  # By doing this, the nginx instance will always connect to this local DNS server only which eliminating
  # the needs of refreshing/managing a fleet of CoreDNS Server.
  #
  # Link to some related issues:
  # - https://github.com/kubernetes/ingress-nginx/issues/9222
  # - https://github.com/projectcalico/calico/issues/4509
  dnsPolicy: None
  dnsConfig:
    nameservers:
      - 127.0.0.1
    searches:
      - ingress-nginx.svc.cluster.local
      - svc.cluster.local
      - cluster.local
  extraVolumes:
  - name: dns-proxy-config-volume
    configMap:
      name: dns-proxy-config # Will be created via Kustomize
  extraContainers:
  - name: dns-proxy
    image: coredns/coredns:1.10.0
    imagePullPolicy: IfNotPresent
    args:
    - -conf
    - /etc/coredns/Corefile
    volumeMounts:
    - mountPath: /etc/coredns
      name: dns-proxy-config-volume
      readOnly: true
    livenessProbe:
      failureThreshold: 5
      httpGet:
        path: /health
        port: 8080
        scheme: HTTP
      initialDelaySeconds: 60
      periodSeconds: 10
      successThreshold: 1
      timeoutSeconds: 5
    readinessProbe:
      failureThreshold: 3
      httpGet:
        path: /health
        port: 8080
        scheme: HTTP
      periodSeconds: 10
      successThreshold: 1
      timeoutSeconds: 1
    resources:
      limits:
        memory: 1Gi
      requests:
        cpu: 512m
        memory: 512Mi
    securityContext:
      allowPrivilegeEscalation: false
      capabilities:
        add:
        - NET_BIND_SERVICE
        drop:
        - all
      readOnlyRootFilesystem: true
  ### END OF SECTION TO DEPLOY A DNS PROXY (CoreDNS) AS SIDECAR CONTAINER TO NGINX ###

And the CoreDNS config file

.:53 {
  errors
  health
  forward . 172.20.0.10 # Forward to kube-system/kube-dns
  loop
  loadbalance
}

tobernguyen on Jan 26, 2023

@paolostancato there are 3 people reporting a problem here and there is not enough data to make any conclusions. From your last post, it seems you had coredns related reasons causing problems. That is beyond the scope of this project discussion in github issue.

Better to talk on Kubernetes slack about this.

longwuyuan on Jan 9, 2023