ingress-nginx: Fail to resolve auth-url DNS
What happened:
We use “External OAUTH Authentication” feature with nginx.ingress.kubernetes.io/auth-url set to a DNS of a service in our cluster (i.e. xxx.xxx.svc.cluster.local). We run ingress-nginx in 3 replicas, one on each of our hosts with hostNetwork=true. Intermittently the “External OAUTH Authentication” feature in one (or more) of our instances stops working because nginx can’t resolve the DNS of the auth-url (i.e. xxx.xxx.svc.cluster.local). Log message (masked IP etc with XXX):
2022/10/27 11:23:00 [error] 28#28: *1026 xxx.xxx.svc.cluster.local could not be resolved (110: Operation timed out), client: XXX, server: _, request: "POST /kibana/api/core/capabilities HTTP/2.0", subrequest: "/_external-auth-XXX-Prefix", host: "XXX", referrer: "https://XXX/kibana/app/kibana"
XXX - - [27/Oct/2022:11:23:00 +0000] "POST /kibana/api/core/capabilities HTTP/2.0" 502 0 "https://XXX/kibana/app/kibana" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/106.0.0.0 Safari/537.36" 0 29.999 [XXX] [] - - - - XX
2022/10/27 11:23:00 [error] 28#28: *1026 auth request unexpected status: 502 while sending to client, client: XXX, server: _, request: "POST /kibana/api/core/capabilities HTTP/2.0", host: "XXX", referrer: "https://XXX/kibana/app/kibana"
10.254.254.17 - - [27/Oct/2022:11:23:00 +0000] "POST /kibana/api/core/capabilities HTTP/2.0" 500 572 "https://XXX/kibana/app/kibana" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/106.0.0.0 Safari/537.36" 70 29.999 [XXX] [] - - - - XXX
The pod never seem to recover from this state.
When this happens it’s still possible to nslookup
DNS and curl
the auth-url (xxx.xxx.svc.cluster.local) from within the container. It seems like it’s just nginx that can’t resolve the DNS.
Performing any of these actions fixes the issue:
- Exec into the container and run
nginx -s reload
- Delete the pod. A new working one is created.
Not sure what triggers the error when we’ve seen it in production but found one way to trigger it is to delete the “core-dns” pods in “kube-system” namespace.
This doesn’t trigger the issue every time but after a couple of recreations of “core-dns” pods at least one of the “ingress-nginx” instances get stuck in this “broken” state.
Exec into container and manually change the following line to not use a variable $target
and instead set the url directly in the proxy_pass
directive fixes the issue and it can’t be reproduced by deleting “core-dns” pods anymore. How can this be any different?
https://github.com/kubernetes/ingress-nginx/blob/2488fb00649f36cc0c6d0eadbb3d38dccd12f7be/rootfs/etc/nginx/template/nginx.tmpl#L1163
What you expected to happen: The DNS resolution of auth-url should always work. Even if “core-dns” pods are recreated.
NGINX Ingress controller version (exec into the pod and run nginx-ingress-controller --version.):
-------------------------------------------------------------------------------
NGINX Ingress controller
Release: v1.4.0
Build: 50be2bf95fd1ef480420e2aa1d6c5c7c138c95ea
Repository: https://github.com/kubernetes/ingress-nginx
nginx version: nginx/1.19.10
-------------------------------------------------------------------------------
Kubernetes version (use kubectl version
):
Server Version: v1.23.0
Environment:
- Cloud provider or hardware configuration: Self hosted
- OS (e.g. from /etc/os-release):
Ubuntu 18.04.6 LTS (Bionic Beaver)
- Kernel (e.g.
uname -a
):5.4.0-131-generic #147~18.04.1-Ubuntu SMP Sat Oct 15 13:10:18 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
- Install tools: kubeadm
- Basic cluster related info:
Server Version: v1.23.0
NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
xxx-1 Ready control-plane,master 16d v1.23.3 xxx <none> Ubuntu 18.04.6 LTS 5.4.0-131-generic cri-o://1.21.7
xxx-1 Ready <none> 16d v1.23.3 xxx <none> Ubuntu 18.04.6 LTS 5.4.0-131-generic cri-o://1.21.7
xxx-2 Ready <none> 16d v1.23.3 xxx <none> Ubuntu 18.04.6 LTS 5.4.0-131-generic cri-o://1.21.7
xxx-3 Ready <none> 16d v1.23.3 xxx <none> Ubuntu 18.04.6 LTS 5.4.0-131-generic cri-o://1.21.7
xxx-4 Ready <none> 16d v1.23.3 xxx <none> Ubuntu 18.04.6 LTS 5.4.0-131-generic cri-o://1.21.7
xxx-5 Ready <none> 16d v1.23.3 xxx <none> Ubuntu 18.04.6 LTS 5.4.0-131-generic cri-o://1.21.7
- How was the ingress-nginx-controller installed:
ingress-nginx:
controller:
admissionWebhooks:
enabled: false
daemonset:
useHostPort: true
dnsPolicy: ClusterFirstWithHostNet
hostNetwork: true
kind: DaemonSet
nodeSelector:
core: ""
service:
annotations:
service.beta.kubernetes.io/aws-load-balancer-type: nlb
nameOverride: nginx-ingress
- Current State of the controller:
kubectl describe ingressclasses
Name: nginx
Labels: app.kubernetes.io/component=controller
app.kubernetes.io/instance=XXX
app.kubernetes.io/managed-by=Helm
app.kubernetes.io/name=nginx-ingress
app.kubernetes.io/part-of=nginx-ingress
app.kubernetes.io/version=1.4.0
helm.sh/chart=ingress-nginx-4.3.0
Annotations: meta.helm.sh/release-name: XXX
meta.helm.sh/release-namespace: XXX
Controller: k8s.io/ingress-nginx
Events: <none>
kubectl -n <ingresscontrollernamespace> get all -A -o wide
kubectl -n <ingresscontrollernamespace> describe po <ingresscontrollerpodname>
Name: xxx
Namespace: xxx
Priority: 0
Service Account: xxx
Node: xxx
Start Time: Thu, 27 Oct 2022 13:19:25 +0200
Labels: app.kubernetes.io/component=controller
app.kubernetes.io/instance=XXX
app.kubernetes.io/name=nginx-ingress
controller-revision-hash=58d9d684cc
pod-template-generation=6
Annotations: <none>
Status: Running
IP: xxx
IPs:
IP: xxx
Controlled By: DaemonSet/xxx
Containers:
controller:
Container ID: cri-o://5a2c60b3f3dc788430e3079f616161e71ec502a8f6acdabfa4c578cb093fda1e
Image: registry.k8s.io/ingress-nginx/controller:v1.4.0@sha256:34ee929b111ffc7aa426ffd409af44da48e5a0eea1eb2207994d9e0c0882d143
Image ID: registry.k8s.io/ingress-nginx/controller@sha256:34ee929b111ffc7aa426ffd409af44da48e5a0eea1eb2207994d9e0c0882d143
Ports: 80/TCP, 443/TCP
Host Ports: 80/TCP, 443/TCP
Args:
/nginx-ingress-controller
--publish-service=$(POD_NAMESPACE)/xxx
--election-id=ingress-controller-leader
--controller-class=k8s.io/ingress-nginx
--ingress-class=nginx
--configmap=$(POD_NAMESPACE)/xxx
State: Running
Started: Thu, 27 Oct 2022 14:02:08 +0200
Ready: True
Restart Count: 1
Requests:
cpu: 100m
memory: 90Mi
Liveness: http-get http://:10254/healthz delay=10s timeout=1s period=10s #success=1 #failure=5
Readiness: http-get http://:10254/healthz delay=10s timeout=1s period=10s #success=1 #failure=3
Environment:
POD_NAME: xxx(v1:metadata.name)
POD_NAMESPACE: xxx(v1:metadata.namespace)
LD_PRELOAD: /usr/local/lib/libmimalloc.so
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-b5sd5 (ro)
Conditions:
Type Status
Initialized True
Ready True
ContainersReady True
PodScheduled True
Volumes:
kube-api-access-b5sd5:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
QoS Class: Burstable
Node-Selectors: core=
kubernetes.io/os=linux
Tolerations: node.kubernetes.io/disk-pressure:NoSchedule op=Exists
node.kubernetes.io/memory-pressure:NoSchedule op=Exists
node.kubernetes.io/network-unavailable:NoSchedule op=Exists
node.kubernetes.io/not-ready:NoExecute op=Exists
node.kubernetes.io/pid-pressure:NoSchedule op=Exists
node.kubernetes.io/unreachable:NoExecute op=Exists
node.kubernetes.io/unschedulable:NoSchedule op=Exists
kubectl -n <ingresscontrollernamespace> describe svc <ingresscontrollerservicename>
Name: xxx
Namespace: xx
Labels: app.kubernetes.io/component=controller
app.kubernetes.io/instance=xxx
app.kubernetes.io/managed-by=Helm
app.kubernetes.io/name=nginx-ingress
app.kubernetes.io/part-of=nginx-ingress
app.kubernetes.io/version=1.4.0
helm.sh/chart=ingress-nginx-4.3.0
Annotations: meta.helm.sh/release-name: xxx
meta.helm.sh/release-namespace: xxx
service.beta.kubernetes.io/aws-load-balancer-type: nlb
Selector: app.kubernetes.io/component=controller,app.kubernetes.io/instance=xxx,app.kubernetes.io/name=nginx-ingress
Type: LoadBalancer
IP Family Policy: SingleStack
IP Families: IPv4
IP: xxx
IPs: xxx
Port: http 80/TCP
TargetPort: http/TCP
NodePort: http 32655/TCP
Endpoints: xxx
Port: https 443/TCP
TargetPort: https/TCP
NodePort: https 32081/TCP
Endpoints: xxx
Session Affinity: None
External Traffic Policy: Cluster
Events: <none>
- Current state of ingress object, if applicable:
Name: xxx
Labels: app.kubernetes.io/instance=xxx
app.kubernetes.io/managed-by=Helm
app.kubernetes.io/name=xxx
helm.sh/chart=xxx-0.1.0
Namespace: xxx
Address:
Ingress Class: <none>
Default backend: <default>
Rules:
Host Path Backends
---- ---- --------
xxx
/grafana(/|$)(.*) grafana:80 (10.0.2.238:3000)
/kibana(/|$)(.*) kibana-kibana:5601 (10.0.2.167:5601)
*
/grafana(/|$)(.*) grafana:80 (10.0.2.238:3000)
/kibana(/|$)(.*) kibana-kibana:5601 (10.0.2.167:5601)
Annotations: cert-manager.io/cluster-issuer: xxx
kubernetes.io/ingress.class: nginx
meta.helm.sh/release-name: xxx
meta.helm.sh/release-namespace: xxx
nginx.ingress.kubernetes.io/auth-cache-duration: 200 202 60s, 401 0s
nginx.ingress.kubernetes.io/auth-cache-key: $cookie_xxx
nginx.ingress.kubernetes.io/auth-response-headers: xxx
nginx.ingress.kubernetes.io/auth-signin: /
nginx.ingress.kubernetes.io/auth-url: http://xxx.svc.cluster.local:8080/xxx
nginx.ingress.kubernetes.io/rewrite-target: /$2
nginx.ingress.kubernetes.io/server-snippet:
location = /kibana {
return 301 /kibana/app/kibana#/discover/xxx...
}
nginx.ingress.kubernetes.io/ssl-redirect: false
Events: <none>
- Others:
How to reproduce this issue: Delete the “core-dns” pods in “kube-system” namespace a couple of times.
Anything else we need to know:
About this issue
- Original URL
- State: open
- Created 2 years ago
- Reactions: 1
- Comments: 24 (8 by maintainers)
FWIW I experienced this same issue in a reverse proxy nginx in Grafana Mimir: nginx process was trying to use a died coredns as a resolver, timing out in every request
My workaround for this problem was to deploy a DNS Proxy (I use CoreDNS) as a sidecar container to nginx (explanation below)
Here is the section of the nginx helm chart value to do what I mentioned above
And the CoreDNS config file
@paolostancato there are 3 people reporting a problem here and there is not enough data to make any conclusions. From your last post, it seems you had coredns related reasons causing problems. That is beyond the scope of this project discussion in github issue.
Better to talk on Kubernetes slack about this.