grpc-go: DNS resolution does not work on Docker

What version of gRPC are you using?

v1.28.0

What version of Go are you using (go version)?

1.14

What operating system (Linux, Windows, …) and version?

Linux

What did you do?

When trying to use chirpstack on Docker DNS resolution does not seem to work.

https://github.com/brocaar/chirpstack-docker

What did you expect to see?

chirpstack-application-server and chirpstack-network-server connect and talk to each other.

What did you see instead?

DNS name resolution failures.

chirpstack-network-server_1 | time="2020-04-27T05:06:18Z" level=warning msg="creating insecure application-server client" server="chirpstack-application-server:8001" chirpstack-network-server_1 | time="2020-04-27T05:06:23Z" level=warning msg="ccResolverWrapper: reporting error to cc: dns: A record lookup error: lookup chirpstack-application-server on 127.0.0.11:53: dial udp 127.0.0.11:53: operation was canceled" chirpstack-network-server_1 | time="2020-04-27T05:06:23Z" level=error msg="gateway: handle gateway stats error" ctx_id=68623603-d2d2-4f2c-a6e4-34f56eae1dad error="get application-server client error: get application-server client error: create application-server api client error: dial application-server api error: context deadline exceeded"

Chirpstack-docker sets up a couple of containers that talk to each other over the network. When run the chirpstack apps connect to the redis, postgresql and mosquitto servers just fine using their DNS names, but the 2 chirpstack apps cannout talk to each other using gRPC due to DNS resolution failures. I verified the DNS names resolve in each container using nslookup on the container’s cli. I also verified that the apps work correctly when using the IP addresses rather than the DNS names.

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Reactions: 4
  • Comments: 21 (9 by maintainers)

Commits related to this issue

Most upvoted comments

This is still an issue here, hopefully the original poster has time to work out some repro code. 👍

I placed a call to net.Dial() right before the grpc.DialContext() that is throwing this error, and it worked. net.Dial() was able resolve the hostname passed to it but grpc.DialContext() failed.

I’m going to see if I can work up some sample code this weekend that reproduces this issue.

Is there anyway to enable additional logging from grpc or the underlying net library to further troubleshoot this issue?

Seeing the same issues

same here. seems to be related to: https://github.com/gliderlabs/docker-alpine/issues/539 a apline/busybox problem.

I’ve done some more investigating and this seems even stranger now.

I have some more logging:

chirpstack-application-server_1  | time="2020-05-01T16:44:30Z" level=warning msg="creating insecure network-server client" server="chirpstack-network-server:8000"
chirpstack-application-server_1  | time="2020-05-01T16:44:30Z" level=debug msg="parsed scheme: \"\""
chirpstack-application-server_1  | time="2020-05-01T16:44:30Z" level=debug msg="scheme \"\" not registered, fallback to default scheme"
chirpstack-application-server_1  | time="2020-05-01T16:44:31Z" level=debug msg="Channel Connectivity change to SHUTDOWN"
chirpstack-application-server_1  | time="2020-05-01T16:44:35Z" level=debug msg="dns: SRV record lookup error: lookup _grpclb._tcp.chirpstack-network-server on 127.0.0.11:53: dial udp 127.0.0.11:53: operation was canceled"
chirpstack-application-server_1  | time="2020-05-01T16:44:35Z" level=debug msg="dns: A record lookup error: lookup chirpstack-network-server on 127.0.0.11:53: dial udp 127.0.0.11:53: operation was canceled"
chirpstack-application-server_1  | time="2020-05-01T16:44:35Z" level=warning msg="ccResolverWrapper: reporting error to cc: dns: A record lookup error: lookup chirpstack-network-server on 127.0.0.11:53: dial udp 127.0.0.11:53: operation was canceled"
chirpstack-application-server_1  | time="2020-05-01T16:44:35Z" level=error msg="finished unary call with code Unknown" ctx_id=f420a2bb-6f03-48a8-9ac9-36849f82fb45 error="rpc error: code = Unknown desc = context deadline exceeded" grpc.code=Unknown grpc.method=Create grpc.service=api.NetworkServerService grpc.start_time="2020-05-01T16:44:30Z" grpc.time_ms=5004.83 peer.address="127.0.0.1:52666" span.kind=server system=grpc

At this point I need to provide a little more background to explain the weird part I discovered. I have Chirpstack setup to create 4 services each running in their own container (redis, postgresql, chirpstack-network-server, and chirpstack-application-server). These containers communicate over a virtual network that does not have access to the internet. I also have chirpstack-network-server and chirpstack-application-server connected to another virtual network to allow my nginx proxy to serve the frontend for chirpstack and for access to my mqtt container. I want to reiterate that neither of these two internal virtual networks have access to the internet.

Here is the interesting part. If I include any network that has internet access to the container then the grpc.DialContext() call resolves the IP and connects right away. I did some experiments and it still seems to me to be a bug in either gprc (likely) or go itself (less likely). I figure this as if it was a bug in go then other libraries would be affected like mqtt, redis, and postgresql. But since those all work with or without internet access, I think this is something deep inside of grpc.

One experiment I did was to start up the container, try adding the network-server address to the application server and it failed as before. I then added and removed a network with internet access without trying to resolve the address. And after that I tried adding the network-server again and it continued to fail. This shows that it’s not simply the fact of adding the internet connected network kick starts something in docker. It shows that there is something that is failing because there is not an internet connection.

Another experiment I tried was increasing the timeout set for the grpc.DialContext() call from chirpstack. This was set at 500ms and I changed it to 60s, however, it appeares that the call still fails at about the same time. So based on the so based on the message “operation was canceled” I’m assuming there is another timeout somewhere I didn’t see when looking at the grpc resolver code.

I’m still trying to setup some example code to repro this, but I’m just a little time limited right now.

I guess the interim workaround for this would be to setup the docker containers with an internet connected virtual network rather than an isolated one. Or, use an isolated virtual network with static IPs assigned to the container and use the IPs in the software.