concourse: Upgrade from 4.0.0 to 4.2.1 results in timeout for any fly command

Bug Report

After upgrading concourse from 4.0.0 to 4.2.1, and after doing a fly sync (which was successful), fly <any command> causes a Get https://<url>/api/v1/info: dial tcp: i/o timeout, even though the ATC is perfectly reachable 1) via browser and 2) via curl.

Previously posted issue (which received no attention so far - perhaps this is a better place to post it?) is here: https://github.com/concourse/fly/issues/264

Steps to Reproduce

By downloading previous versions of fly, it is apparent that the problem occurred between 4.0.0 and 4.1.0:

The following works:

curl -Ls https://github.com/concourse/concourse/releases/download/v4.0.0/fly_linux_amd64 > fly400  && chmod +x fly400 && \
./fly400 -t demo login -c 'https://<url>' --verbose

(and tells me to sync, see details)

logging in to team 'main'

2018/09/18 17:52:05 GET /api/v1/info HTTP/1.1 Host: <sanitized> User-Agent: Go-http-client/1.1 Accept-Encoding: gzip

2018/09/18 17:52:06 HTTP/1.1 200 OK Content-Length: 43 Cache-Control: no-cache Connection: keep-alive Content-Type: application/json Date: Tue, 18 Sep 2018 15:52:06 GMT Pragma: no-cache Server: nginx X-Concourse-Version: 4.2.1 X-Content-Type-Options: nosniff X-Download-Options: noopen X-Xss-Protection: 1; mode=block

{“version”:“4.2.1”,“worker_version”:“2.1”}

WARNING:

fly version (4.0.0) is out of sync with the target (4.2.1). to sync up, run the following:

whereas this fails with a timeout

curl -Ls https://github.com/concourse/concourse/releases/download/v4.1.0/fly_linux_amd64 > fly410  && chmod +x fly410 && \
./fly410 -t demo login -c 'https://<url>' --verbose
logging in to team 'main'

2018/09/18 18:11:28 GET /api/v1/info HTTP/1.1
Host: <url>
User-Agent: Go-http-client/1.1
Accept-Encoding: gzip


could not reach the Concourse server called bridge:

    Get https://<url>/api/v1/info: dial tcp: i/o timeout

is the targeted Concourse running? better go catch it lol

Expected Results

I expect to be able to use fly <any command>

Actual Results

i/o timeout

Additional Context

curl https://<url>/api/v1/info
{"version":"4.2.1","worker_version":"2.1"}

Background to our setup:

We front concourse with a nginx server accepting only https traffic, which does SSL termination (There is no http->https redirect, for instance - firewalls prevent any incoming traffic other than on port 443)

sshing onto the ATC server, then running ./fly421 -t demo login -c 'http://127.0.0.1:8080' --verbose on the ATC server itself works.

Version Info

  • Concourse version: 4.2.1 (4.1.0 also affected)
  • Deployment type (BOSH/Docker/binary): binary
  • Infrastructure/IaaS: AWS EC2, ubuntu16 AMI
  • Browser (if applicable): n/a
  • Did this used to work? yes, any version 2.x, 3.x and including 4.0 works.

About this issue

  • Original URL
  • State: closed
  • Created 6 years ago
  • Comments: 18 (7 by maintainers)

Most upvoted comments

The issue was related to DNS - fly could not deal with multiple DNS entries in the /etc/resolv.conf. I had two DNS servers in mine, and taking one out (any one, it wasn’t related to a particular DNS server) was a sucessful workaround. The latest release of fly and concourse (5.3.0) no longer exhibits this problem, so whatever broke from 4.0.0 to 4.2.1 seems to have been fixed by now. So this issue can be closed.

After some investigation, it seems to be some DNS resolution issues with Go 1.11. If I replace the URL of the concourse server with an IP, it can successfully connect.

It seems that Go 1.11 does not leverage all DNS servers listed in /etc/resolve.conf correctly (if there is more than one).

As a workaround, I configured my DNS connection to use only one DNS server: 1.1.1.1 And now it works.

Thanks for following up!

On Mon, Feb 25, 2019, 1:12 PM Colin Simmons notifications@github.com wrote:

Our issue turned out to be a problematic change we had made to our cloud config. This resulted in the resolv.conf on our worker instance to be something like

cat /etc/resolv.conf nameserver 10.0.1.2 nameserver 10.0.0.2 search eu-west-2.compute.internal

Our mistake was thinking that the reserved DNS IP is always the third IP of the subnet when actually it is the .2 of the VPC super net. It seems that Golang (or maybe fly in particular) is more sensitive to invalid DNS entries than other applications. It was timing out trying to connect over 10.0.1.2 before trying on the valid 10.0.0.2.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/concourse/concourse/issues/2740#issuecomment-467117536, or mute the thread https://github.com/notifications/unsubscribe-auth/AAAHWJfIPoClr15C10cbyDn0Mu9WNdhFks5vRCeMgaJpZM4YAP7P .

Thanks @dam5s ! I can confirm the workaround: If /etc/resolve.conf contains two name servers (it does on my local setup usually), it fails. Removing or commenting out name servers such that only one remains in /etc/resolv.conf solves the problem.

There seems to be a related issue here (though I’m on linux, not OS X). Is fly compiled with or without cgo?