kubernetes: Using nginx 1.9 DNS resolver output error periodically
I’m not a Kubernetes or Nginx expert and I really want to provide meaningful information. However, I need some guidance and I’m willing to provide more data.
I constantly see 2016/05/25 21:34:35 [error] 170#170: unexpected response for hello-node.default.svc.cluster.local message when using Nginx resolver feature to resolve pod IP. The pod can be visited just fine, but this message bugs me a lot. Here is my configs and logs.
kind: Service
apiVersion: v1
metadata:
name: hello-node
spec:
selector:
app: hello-node
ports:
- port: 80
targetPort: http
clusterIP: None
# nginx.conf
user nginx;
worker_processes 1;
error_log /var/log/nginx/error.log debug;
pid /var/run/nginx.pid;
events {
worker_connections 1024;
}
http {
include /etc/nginx/mime.types;
default_type application/octet-stream;
log_format main '$remote_addr - $remote_user [$time_local] "$request" '
'$status $body_bytes_sent "$http_referer" '
'"$http_user_agent" "$http_x_forwarded_for"';
access_log /var/log/nginx/access.log main;
sendfile on;
#tcp_nopush on;
keepalive_timeout 65;
#gzip on;
include /etc/nginx/conf.d/*.conf;
}
# /etc/nginx/conf.d/hello-node.conf
server {
listen 80;
resolver 10.3.240.10 ipv6=off;
set $backend_upstream "http://hello-node.default.svc.cluster.local:3000";
location / {
proxy_pass $backend_upstream;
}
}
2016/05/25 21:34:35 [debug] 170#170: *67 http cleanup add: 000000000161EFD8
2016/05/25 21:34:35 [debug] 170#170: malloc: 0000000001652F50:160
2016/05/25 21:34:35 [debug] 170#170: resolve: "hello-node.default.svc.cluster.local"
2016/05/25 21:34:35 [debug] 170#170: malloc: 0000000001634DB0:54
2016/05/25 21:34:35 [debug] 170#170: resolve: "hello-node.default.svc.cluster.local" A 49529
2016/05/25 21:34:35 [debug] 170#170: send: fd:4 54 of 54
2016/05/25 21:34:35 [debug] 170#170: malloc: 0000000001634080:96
2016/05/25 21:34:35 [debug] 170#170: event timer add: -1: 30000:1464212105373
2016/05/25 21:34:35 [debug] 170#170: event timer add: -1: 5000:1464212080373
2016/05/25 21:34:35 [debug] 170#170: *67 http finalize request: -4, "/stubborn-poems.doc?" a:1, c:2
2016/05/25 21:34:35 [debug] 170#170: *67 http request count:2 blk:0
2016/05/25 21:34:35 [debug] 170#170: timer delta: 3
2016/05/25 21:34:35 [debug] 170#170: worker cycle
2016/05/25 21:34:35 [debug] 170#170: epoll timer: 5000
2016/05/25 21:34:35 [debug] 170#170: epoll: fd:14 ev:0004 d:00007F7E8D75B551
2016/05/25 21:34:35 [debug] 170#170: *67 http run request: "/stubborn-poems.doc?"
2016/05/25 21:34:35 [debug] 170#170: *67 http upstream check client, write event:1, "/stubborn-poems.doc"
2016/05/25 21:34:35 [debug] 170#170: *67 http upstream recv(): -1 (11: Resource temporarily unavailable)
2016/05/25 21:34:35 [debug] 170#170: epoll: fd:4 ev:0001 d:00007F7E8D75B2B0
2016/05/25 21:34:35 [debug] 170#170: recv: fd:4 86 of 4096
2016/05/25 21:34:35 [debug] 170#170: resolver DNS response 49529 fl:8580 1/2/0/0
2016/05/25 21:34:35 [debug] 170#170: resolver DNS response qt:1 cl:1
2016/05/25 21:34:35 [debug] 170#170: malloc: 00000000016528D0:36
2016/05/25 21:34:35 [debug] 170#170: resolver qs:hello-node.default.svc.cluster.local
2016/05/25 21:34:35 [debug] 170#170: resolver naddrs:2 cname:0000000000000000 ttl:30
2016/05/25 21:34:35 [debug] 170#170: malloc: 0000000001652900:8
2016/05/25 21:34:35 [debug] 170#170: malloc: 0000000001634C80:64
2016/05/25 21:34:35 [debug] 170#170: malloc: 000000000163A390:220
2016/05/25 21:34:35 [debug] 170#170: *67 http upstream resolve: "/stubborn-poems.doc?"
2016/05/25 21:34:35 [debug] 170#170: *67 name was resolved to 10.0.0.8
2016/05/25 21:34:35 [debug] 170#170: *67 name was resolved to 10.0.0.7
2016/05/25 21:34:35 [debug] 170#170: resolve name done: 0
2016/05/25 21:34:35 [debug] 170#170: event timer del: -1: 1464212105373
2016/05/25 21:34:35 [debug] 170#170: resolver expire
2016/05/25 21:34:35 [debug] 170#170: event timer del: -1: 1464212080373
2016/05/25 21:34:35 [debug] 170#170: *67 get rr peer, try: 2
2016/05/25 21:34:35 [debug] 170#170: *67 get rr peer, current: 000000000161F058 -1
2016/05/25 21:34:35 [debug] 170#170: *67 socket 15
2016/05/25 21:34:35 [debug] 170#170: *67 epoll add connection: fd:15 ev:80002005
2016/05/25 21:34:35 [debug] 170#170: *67 connect to 10.0.0.8:3000, fd:15 #68
2016/05/25 21:34:35 [debug] 170#170: *67 http upstream connect: -2
2016/05/25 21:34:35 [debug] 170#170: *67 posix_memalign: 0000000001652F50:128 @16
2016/05/25 21:34:35 [debug] 170#170: *67 event timer add: 15: 60000:1464212135388
2016/05/25 21:34:35 [debug] 170#170: recv: fd:4 86 of 4096
2016/05/25 21:34:35 [debug] 170#170: resolver DNS response 49529 fl:8580 1/2/0/0
2016/05/25 21:34:35 [debug] 170#170: resolver DNS response qt:1 cl:1
2016/05/25 21:34:35 [debug] 170#170: malloc: 00000000016528D0:36
2016/05/25 21:34:35 [debug] 170#170: resolver qs:hello-node.default.svc.cluster.local
2016/05/25 21:34:35 [error] 170#170: unexpected response for hello-node.default.svc.cluster.local
2016/05/25 21:34:35 [debug] 170#170: recv: fd:4 -1 of 4096
2016/05/25 21:34:35 [debug] 170#170: recv() not ready (11: Resource temporarily unavailable)
2016/05/25 21:34:35 [debug] 170#170: timer delta: 15
2016/05/25 21:34:35 [debug] 170#170: worker cycle
2016/05/25 21:34:35 [debug] 170#170: epoll timer: 35498
2016/05/25 21:34:35 [debug] 170#170: epoll: fd:15 ev:0004 d:00007F7E8D75B1D1
2016/05/25 21:34:35 [debug] 170#170: *67 http upstream request: "/stubborn-poems.doc?"
2016/05/25 21:34:35 [debug] 170#170: *67 http upstream send request handler
2016/05/25 21:34:35 [debug] 170#170: *67 http upstream send request
2016/05/25 21:34:35 [debug] 170#170: *67 http upstream send request body
2016/05/25 21:34:35 [debug] 170#170: *67 chain writer buf fl:1 s:331
2016/05/25 21:34:35 [debug] 170#170: *67 chain writer in: 000000000161F2A8
2016/05/25 21:34:35 [debug] 170#170: *67 writev: 331 of 331
2016/05/25 21:34:35 [debug] 170#170: *67 chain writer out: 0000000000000000
2016/05/25 21:34:35 [debug] 170#170: *67 event timer del: 15: 1464212135388
2016/05/25 21:34:35 [debug] 170#170: *67 event timer add: 15: 60000:1464212135388
2016/05/25 21:34:35 [debug] 170#170: timer delta: 0
2016/05/25 21:34:35 [debug] 170#170: worker cycle
2016/05/25 21:34:35 [debug] 170#170: epoll timer: 35498
2016/05/25 21:34:35 [debug] 170#170: epoll: fd:15 ev:2005 d:00007F7E8D75B1D1
2016/05/25 21:34:35 [debug] 170#170: *67 http upstream request: "/stubborn-poems.doc?"
2016/05/25 21:34:35 [debug] 170#170: *67 http upstream process header
2016/05/25 21:34:35 [debug] 170#170: *67 malloc: 000000000161F640:4096
2016/05/25 21:34:35 [debug] 170#170: *67 recv: fd:15 149 of 4096
2016/05/25 21:34:35 [debug] 170#170: *67 http proxy status 200 "200 OK"
2016/05/25 21:34:35 [debug] 170#170: *67 http proxy header: "Content-Type: text/plain"
2016/05/25 21:34:35 [debug] 170#170: *67 http proxy header: "Date: Wed, 25 May 2016 21:34:35 GMT"
2016/05/25 21:34:35 [debug] 170#170: *67 http proxy header: "Connection: close"
2016/05/25 21:34:35 [debug] 170#170: *67 http proxy header done
2016/05/25 21:34:35 [debug] 170#170: *67 HTTP/1.1 200 OK
Notice the line 2016/05/25 21:34:35 [error] 170#170: unexpected response for hello-node.default.svc.cluster.local in the middle.
Thanks!
About this issue
- Original URL
- State: closed
- Created 8 years ago
- Comments: 26 (7 by maintainers)
Hello guys,
Contrary to what I wrote in my last message, I am now almost sure that this is a k8s / docker issue. I can confirm that nginx is performing DNS requests as expected, but is not receiving answers back. This seems to be caused by a wrong configuration of the iptables DNAT of the Node for iptables-based k8s Services when a Pod is re-created.
How to reproduce (k8s v1.3.6):
fooservicewith type ClusterIP10.0.0.10, EndPoint in my case is10.244.167.4:53on both TCP and UDPvalidparameter frequency to resolvefooservice(requests to nginx from outside of the k8s cluster):10.0.0.10, EndPoint is now10.244.168.5:53fooservicevia nginx start to fail (nginx logs):10.0.0.10, not the endpoint!. My assumption is that at this point the iptables rules in the Node are not correctly cleaned by kube-proxy, and the rule for port54993is stalled:I have found several open issues in the k8s lists pointing in this direction: https://github.com/kubernetes/kubernetes/issues/19029 https://github.com/kubernetes/kubernetes/issues/31983 https://github.com/docker/docker/issues/8795
Some of the bug reports on the other tickets also mention that it might be possible to trigger the timeout/refresh of the DNAT Node ports using
conntrackon the Node, but I haven’t still tried.same issue, and resolved by this: https://stackoverflow.com/a/37569397/2526528 .
then I also had to disable the ipv6 for the resolver in nginx:
Interesting observation testing locally with minikube.
I have an nginx pod with an endpoint with
proxy_passto another pod (viaservice_name.default.svc.cluster.local). There is a single kube-dns pod.I access nginx endpoint - it works fine.
Now, if I delete the dns pod and wait for the new dns pod to be
Ready, and then access the nginx endpoint again - it works!But, if I delete the dns pod, but access nginx endpoint before the dns pod is
Ready, it fails! And it keeps failing even after the dns pod becomesReady.I resolved this by coredns docker : my nginx and coredns are all deploy on host
step1: config Corefile in Corefile maybe you should change k8s master config refer: https://coredns.io/plugins/kubernetes/
step2:config docker and then start it
step3: config nginx and then reload
@xingxing9688 I haven’t had a chance to test this again, but by the commit closing this issue, seems it was fixed in k8s 1.7+
@dlouzan great investigation! I have also tried to understand the same issue, and the point I’ve reached so far is watching the iptables entries related to kube-dns, which look like this during ‘normal’ state: (notice the 3 types of entries for the 3 different ips of the dns pods, as well as two “KUBE-SVC…” entries pointing to the kube-dns ClusterIP - 10.0.220.2)
Then if I delete all 3 dns pods simultanously, all of the above entries are deleted except the two “KUBE-SVC…” entries pointing to the ClusterIP. It then takes a few seconds for new entries to appear relating to the new dns pods that were created.
If during these several seconds one of my nginx pods is trying to resolve its endpoint cluster ip, a new conntrack entry will be created for it looking like this:
It’s obviously using the only iptables entry that is relevant for it, which is pointing to the kube-dns ClusterIP, and this connection is not being replied… This conntrack entry has a ttl of 30s, and when running tcpdump on the k8s node I see that my nginx pod is now trying to send requests to this ip every 5 seconds so it will never be removed on its own (anyone has any idea where this 5s configuration is coming from?).
At this point my nginx pod gets stuck on the ‘Operation timed out’ resolver errors, and only some manual action like a conntrack flush or pod restart etc. will resolve the issue.
@dlouzan Thanks for the excellent investigation here! We’re running into this issue to running the latest stable openresty 1.11.2.2 which builds with nginx core 1.11.2. We’re on k8s 1.4.
I’d also like to clarify that, at least according to this article (https://www.nginx.com/blog/dns-service-discovery-nginx-plus/), the
resolverconfiguration for nginx is only available in nginx 1.11.3+. I’m currently in the middle of testing a configuration using nginxupstreamblock configured withmax_failsto brute force DNS resolution until it succeeds. In our case, the issue is intermittent and recovers in about a second.