concourse: Workers stall due to dropped connection, client doesn't recover, web node leaks goroutines

Bug Report

In our large-scale algorithm test environment, workers gradually stalled over the course of 2 days:

In investigating one of the workers, the machine seemed fine, but its logs showed that it simply stopped heartbeating. It must have been stuck somewhere.

lsof -p <worker pid> revealed that there wasn’t even an active connection to the TSA anymore (port 222).

Through this time, web node memory usage and goroutines crept up:

Upon restarting all of the workers (i.e. bosh restart worker), the web goroutines dropped dramatically, and the memory usage slightly recovered (note: these graphs are over the past hour):

Steps to Reproduce

Not sure yet. May need to just chaos test this.

Expected Results

I would expect the worker client to notice that the connection is broken and to recover/reconnect.

Actual Results

The worker client hung forever and stopped heartbeating/registering.

Additional Context

Goroutine dumps: https://gist.github.com/vito/a53e8b484a3d254a0872ebe7f5d15773

The dumps are large; I would recommend cloning the gist instead:

$ git clone https://gist.github.com/vito/a53e8b484a3d254a0872ebe7f5d15773

Version Info

Concourse version: master
Deployment type (BOSH/Docker/binary): BOSH
Infrastructure/IaaS: GCP
Browser (if applicable): n/a
Did this used to work? probably not

About this issue

Original URL
State: closed
Created 4 years ago
Reactions: 2
Comments: 49 (44 by maintainers)

Commits related to this issue

[#5082] Close ssh conn on worker when keepalive fails - Close sshClient Conn on keepalive -> SendRequest operation error - Add 5 second timeout for keepalive -> SendRequest operation Signed-off-by: ... — committed to concourse/concourse by xtreme-sameer-vohra 4 years ago
[#5082] Close ssh conn on worker when keepalive fails - Close sshClient Conn on keepalive -> SendRequest operation error - Add 5 second timeout for keepalive -> SendRequest operation Signed-off-by: ... — committed to concourse/concourse by xtreme-sameer-vohra 4 years ago
[#5082] Refactor keepalive into a separate func - pull out keepalive from tsa client - add test coverage for new behaviour — committed to concourse/concourse by xtreme-sameer-vohra 4 years ago
[#5082] Refactor keepalive into a separate func - pull out keepalive from tsa client - add test coverage for new behaviour Signed-off-by: Sameer Vohra <vohra.sam@gmail.com> — committed to concourse/concourse by xtreme-sameer-vohra 4 years ago
[#5082] add release note Signed-off-by: Sameer Vohra <vohra.sam@gmail.com> — committed to concourse/concourse by xtreme-sameer-vohra 4 years ago
[#5082] Close ssh conn on worker when keepalive fails - Close sshClient Conn on keepalive -> SendRequest operation error - Add 5 second timeout for keepalive -> SendRequest operation Signed-off-by: ... — committed to concourse/concourse by xtreme-sameer-vohra 4 years ago
[#5082] Refactor keepalive into a separate func - pull out keepalive from tsa client - add test coverage for new behaviour Signed-off-by: Sameer Vohra <vohra.sam@gmail.com> — committed to concourse/concourse by xtreme-sameer-vohra 4 years ago
[#5082] add release note Signed-off-by: Sameer Vohra <vohra.sam@gmail.com> — committed to concourse/concourse by xtreme-sameer-vohra 4 years ago
[#5082] Increase tsa/client keepalive timeout - Changed the timeout from 5s to 5m as 5s might be too aggresive for remote workers that don't have reliable connections. This timeout is to detect ... — committed to concourse/concourse by xtreme-sameer-vohra 4 years ago
[#5082] Minor update to tsa/keepalive - move comment near applicable command Signed-off-by: Sameer Vohra <vohra.sam@gmail.com> — committed to concourse/concourse by xtreme-sameer-vohra 4 years ago
[#5082] Close ssh conn on worker when keepalive fails - Close sshClient Conn on keepalive -> SendRequest operation error - Add 5 second timeout for keepalive -> SendRequest operation Signed-off-by: ... — committed to concourse/concourse by xtreme-sameer-vohra 4 years ago
[#5082] Refactor keepalive into a separate func - pull out keepalive from tsa client - add test coverage for new behaviour Signed-off-by: Sameer Vohra <vohra.sam@gmail.com> — committed to concourse/concourse by xtreme-sameer-vohra 4 years ago
[#5082] Increase tsa/client keepalive timeout - Changed the timeout from 5s to 5m as 5s might be too aggresive for remote workers that don't have reliable connections. This timeout is to detect ... — committed to concourse/concourse by xtreme-sameer-vohra 4 years ago
[#5082] add release note Signed-off-by: Sameer Vohra <vohra.sam@gmail.com> — committed to concourse/concourse by xtreme-sameer-vohra 4 years ago
[#5082] Close ssh conn on worker when keepalive fails - Close sshClient Conn on keepalive -> SendRequest operation error - Add 5 second timeout for keepalive -> SendRequest operation Signed-off-by: ... — committed to concourse/concourse by xtreme-sameer-vohra 4 years ago
[#5082] Refactor keepalive into a separate func - pull out keepalive from tsa client - add test coverage for new behaviour Signed-off-by: Sameer Vohra <vohra.sam@gmail.com> — committed to concourse/concourse by xtreme-sameer-vohra 4 years ago
[#5082] Increase tsa/client keepalive timeout - Changed the timeout from 5s to 5m as 5s might be too aggresive for remote workers that don't have reliable connections. This timeout is to detect ... — committed to concourse/concourse by xtreme-sameer-vohra 4 years ago
[#5082] add release note Signed-off-by: Sameer Vohra <vohra.sam@gmail.com> — committed to concourse/concourse by xtreme-sameer-vohra 4 years ago
[#5082] Close ssh conn on worker when keepalive fails - Close sshClient Conn on keepalive -> SendRequest operation error - Add 5 second timeout for keepalive -> SendRequest operation Signed-off-by: ... — committed to concourse/concourse by xtreme-sameer-vohra 4 years ago
[#5082] Refactor keepalive into a separate func - pull out keepalive from tsa client - add test coverage for new behaviour Signed-off-by: Sameer Vohra <vohra.sam@gmail.com> — committed to concourse/concourse by xtreme-sameer-vohra 4 years ago

Most upvoted comments

In light of this issue I wonder if it makes sense to start thinking about having all atc <-> worker requests happen over http, and push all the “magic” into the transport. The TSA could become a socks5 proxy. This would let us adopt a more standard approach and make things less error prone.

jwntrs on Mar 12, 2020

Deployed the patch on 5082-worker-stall onto branch release/5.8.x on the algorithm env around 15:45

Worker running/stall state on Algorithm Environment

Screen Shot 2020-03-26 at 1 35 31 PM

Sampled on worker/0, from the logs, there were a few incidences where keepalive had a 5s timeout

worker/cd802654-265f-4dd8-a3f9-f5e85009182e:/var/vcap/sys/log/worker# zgrep "failed sending keepalive request" ./*
worker/cd802654-265f-4dd8-a3f9-f5e85009182e:/var/vcap/sys/log/worker# zgrep "timed out sending keepalive request" ./*
./worker.stdout.log.5.gz:{"timestamp":"2020-03-25T20:18:07.431851225Z","level":"error","source":"worker","message":"worker.beacon-runner.beacon.keepalive.timeout","data":{"error":"timed out sending keepalive request","session":"4.1.2"}}
./worker.stdout.log.5.gz:{"timestamp":"2020-03-25T20:18:37.820903117Z","level":"error","source":"worker","message":"worker.beacon-runner.beacon.keepalive.timeout","data":{"error":"timed out sending keepalive request","session":"4.1.7"}}
./worker.stdout.log.5.gz:{"timestamp":"2020-03-25T20:19:19.572281729Z","level":"error","source":"worker","message":"worker.beacon-runner.beacon.keepalive.timeout","data":{"error":"timed out sending keepalive request","session":"4.1.9"}}

Looking at one occurrence of the timeout, the worker beacon process flaps before registering successfully,

{"timestamp":"2020-03-25T20:18:07.431851225Z","level":"error","source":"worker","message":"worker.beacon-runner.beacon.keepalive.timeout","data":{"error":"timed out sending keepalive request","session":"4.1.2"}}
{"timestamp":"2020-03-25T20:18:07.432110765Z","level":"error","source":"worker","message":"worker.beacon-runner.beacon.run.command-failed","data":{"command":"forward-worker","error":"wait: remote command exited without exit status or exit signal","session":"4.1.3"}}
{"timestamp":"2020-03-25T20:18:07.432191194Z","level":"error","source":"worker","message":"worker.beacon-runner.beacon.exited-with-error","data":{"error":"wait: remote command exited without exit status or exit signal","session":"4.1"}}
{"timestamp":"2020-03-25T20:18:07.432212352Z","level":"debug","source":"worker","message":"worker.beacon-runner.beacon.done","data":{"session":"4.1"}}
{"timestamp":"2020-03-25T20:18:07.432229722Z","level":"error","source":"worker","message":"worker.beacon-runner.failed","data":{"error":"wait: remote command exited without exit status or exit signal","session":"4"}}
{"timestamp":"2020-03-25T20:18:12.432424716Z","level":"info","source":"worker","message":"worker.beacon-runner.restarting","data":{"session":"4"}}
{"timestamp":"2020-03-25T20:18:12.432569861Z","level":"debug","source":"worker","message":"worker.beacon-runner.beacon.start","data":{"session":"4.1"}}
{"timestamp":"2020-03-25T20:18:37.820903117Z","level":"error","source":"worker","message":"worker.beacon-runner.beacon.keepalive.timeout","data":{"error":"timed out sending keepalive request","session":"4.1.7"}}
{"timestamp":"2020-03-25T20:18:37.821187547Z","level":"error","source":"worker","message":"worker.beacon-runner.beacon.failed-to-listen-for-baggageclaim","data":{"error":"EOF","session":"4.1"}}
{"timestamp":"2020-03-25T20:18:37.821238438Z","level":"error","source":"worker","message":"worker.beacon-runner.beacon.exited-with-error","data":{"error":"EOF","session":"4.1"}}
{"timestamp":"2020-03-25T20:18:37.821257482Z","level":"debug","source":"worker","message":"worker.beacon-runner.beacon.done","data":{"session":"4.1"}}
{"timestamp":"2020-03-25T20:18:37.821284768Z","level":"error","source":"worker","message":"worker.beacon-runner.failed","data":{"error":"EOF","session":"4"}}
{"timestamp":"2020-03-25T20:18:42.821423407Z","level":"info","source":"worker","message":"worker.beacon-runner.restarting","data":{"session":"4"}}
{"timestamp":"2020-03-25T20:18:42.821587588Z","level":"debug","source":"worker","message":"worker.beacon-runner.beacon.start","data":{"session":"4.1"}}
{"timestamp":"2020-03-25T20:19:19.572281729Z","level":"error","source":"worker","message":"worker.beacon-runner.beacon.keepalive.timeout","data":{"error":"timed out sending keepalive request","session":"4.1.9"}}
{"timestamp":"2020-03-25T20:19:19.572563907Z","level":"error","source":"worker","message":"worker.beacon-runner.beacon.failed-to-listen-for-garden","data":{"error":"EOF","session":"4.1"}}
{"timestamp":"2020-03-25T20:19:19.572620892Z","level":"error","source":"worker","message":"worker.beacon-runner.beacon.exited-with-error","data":{"error":"EOF","session":"4.1"}}
{"timestamp":"2020-03-25T20:19:19.572645343Z","level":"debug","source":"worker","message":"worker.beacon-runner.beacon.done","data":{"session":"4.1"}}
{"timestamp":"2020-03-25T20:19:19.572667403Z","level":"error","source":"worker","message":"worker.beacon-runner.failed","data":{"error":"EOF","session":"4"}}
{"timestamp":"2020-03-25T20:19:24.572819154Z","level":"info","source":"worker","message":"worker.beacon-runner.restarting","data":{"session":"4"}}
{"timestamp":"2020-03-25T20:19:24.572957615Z","level":"debug","source":"worker","message":"worker.beacon-runner.beacon.start","data":{"session":"4.1"}}
{"timestamp":"2020-03-25T20:19:24.613967778Z","level":"info","source":"guardian","message":"guardian.list-containers.starting","data":{"session":"38"}}
{"timestamp":"2020-03-25T20:19:24.614195639Z","level":"info","source":"guardian","message":"guardian.list-containers.finished","data":{"session":"38"}}
{"timestamp":"2020-03-25T20:19:29.584247754Z","level":"debug","source":"worker","message":"worker.beacon-runner.beacon.keepalive.tick","data":{"session":"4.1.11"}}

xtreme-sameer-vohra on Mar 26, 2020

Update

$ fly -t al workers
name                                  containers  platform  tags  team  state    version  age
00af6708-cc9c-42c3-9ce8-1ed1534c365a  67          linux     none  none  running  2.2      4d
1cd47b0a-745a-4469-a631-b8cf7fd0726f  86          linux     none  none  running  2.2      4d
4210cee2-a763-41c7-ae76-bcca9436b098  47          linux     none  none  running  2.2      4d
4807111d-8924-4c2a-af02-a9ef1b0e6b14  34          linux     none  none  running  2.2      4d
501edb90-78c3-4c20-8ef1-d888579514d1  74          linux     none  none  running  2.2      4d
743b25bd-e005-45b4-aad9-8aec21b33dbf  63          linux     none  none  running  2.2      4d
7c279688-403a-4b55-a3ed-dc5d5be472f3  37          linux     none  none  running  2.2      4d
8504814a-866d-4216-84ae-52587d488e8b  40          linux     none  none  running  2.2      4d
86c1b6d9-adb4-4ba4-9814-233914635bdb  43          linux     none  none  running  2.2      4d
9f3ab78e-398b-44ac-8c3a-e8a43dd28f23  23          linux     none  none  running  2.2      4d
a1026981-01cb-49bc-9d8b-ef18d9f03022  61          linux     none  none  running  2.2      4d
b25f6e4d-e0de-4b72-b800-f3efd7c3f819  70          linux     none  none  running  2.2      4d
b94e5a14-4749-42ae-9664-e64e4923fee5  68          linux     none  none  running  2.2      4d
b96b9f3c-aa31-4e94-9071-c7b0b0da8804  51          linux     none  none  running  2.2      4d
bc375505-c0dd-40b0-b334-6f6be85e1ad0  57          linux     none  none  running  2.2      4d
c199752f-0223-42cf-b0aa-3c7bdf9a27ba  36          linux     none  none  running  2.2      4d
d7dc4a52-ad11-44e0-bac3-157e980c82c3  73          linux     none  none  running  2.2      4d
d8095726-ec06-44d6-8c9f-0605a3577413  44          linux     none  none  running  2.2      4d
e4360603-5ea1-476e-8113-fd495abbeade  65          linux     none  none  running  2.2      4d
f149ba16-d3de-4819-bbf2-009152e67adc  91          linux     none  none  running  2.2      4d


the following workers have not checked in recently:

name                                  containers  platform  tags  team  state    version  age
1ff3e50f-c141-463b-a054-5bdd5209656d  12          linux     none  none  stalled  2.2      4d
42324e06-1404-46e2-b907-52efad99daa2  25          linux     none  none  stalled  2.2      4d
466ae9a7-95eb-44dd-acbe-2a6d3aa25c0e  23          linux     none  none  stalled  2.2      4d
591a0953-b5e9-4de3-a978-f5144fefee53  62          linux     none  none  stalled  2.2      4d
681eff36-eb59-4932-ad28-1f2cc8b4a79d  24          linux     none  none  stalled  2.2      4d
695f7f58-0ddc-44b4-a59b-01781499bc0d  8           linux     none  none  stalled  2.2      4d
6ab00e9f-b20d-46c3-9999-416691cb50dc  20          linux     none  none  stalled  2.2      4d
77a923a2-4177-48c6-8013-77ae5fcd06da  19          linux     none  none  stalled  2.2      4d
ada20384-3f68-4fbb-860b-4c7168bd3b98  48          linux     none  none  stalled  2.2      4d
b5e0b037-dfce-4545-98cf-dfaf84496856  33          linux     none  none  stalled  2.2      4d

these stalled workers can be cleaned up by running:

    fly -t al prune-worker -w (name)

Looking at their logs

bosh -e prod -d concourse-algorithm ssh worker/1ff3e50f-09c5-45a1-bc12-40b541693387 -c 'sudo grep "2222: i/o timeout" /var/vcap/sys/log/worker/*'
worker/1ff3e50f-09c5-45a1-bc12-40b541693387: stdout | /var/vcap/sys/log/worker/worker.stdout.log:{"timestamp":"2020-03-03T01:43:52.384226417Z","level":"error","source":"worker","message":"worker.beacon-runner.beacon.keepalive.failed","data":{"error":"write tcp 10.1.0.22:56446-\u003e10.1.0.96:2222: i/o timeout","session":"4.1.2"}}

bosh -e prod -d concourse-algorithm ssh worker/42324e06-9279-4911-abf1-768a2a8520ec -c 'sudo grep "2222: i/o timeout" /var/vcap/sys/log/worker/*'
worker/42324e06-9279-4911-abf1-768a2a8520ec: stdout | /var/vcap/sys/log/worker/worker.stdout.log:{"timestamp":"2020-03-06T19:39:30.967925271Z","level":"error","source":"worker","message":"worker.beacon-runner.beacon.keepalive.failed","data":{"error":"write tcp 10.1.0.86:55658-\u003e10.1.0.20:2222: i/o timeout","session":"4.1.3989498"}}

// DOESNT HAVE KEEPALIVE TIMEOUT

bosh -e prod -d concourse-algorithm ssh worker/466ae9a7-e516-44ab-8ab1-b77a0d670e56 -c 'sudo grep "2222: i/o timeout" /var/vcap/sys/log/worker/*'

bosh -e prod -d concourse-algorithm ssh worker/591a0953-e59f-4342-970e-6f66e190e195 -c 'sudo grep "2222: i/o timeout" /var/vcap/sys/log/worker/*'
worker/591a0953-e59f-4342-970e-6f66e190e195: stdout | /var/vcap/sys/log/worker/worker.stdout.log:{"timestamp":"2020-03-06T19:05:15.555281213Z","level":"error","source":"worker","message":"worker.beacon-runner.beacon.keepalive.failed","data":{"error":"write tcp 10.1.0.133:40566-\u003e10.1.0.20:2222: i/o timeout","session":"4.1.4394576"}}

// DOESNT HAVE KEEPALIVE TIMEOUT

bosh -e prod -d concourse-algorithm ssh worker/681eff36-0e39-417f-8e3b-b4df9d73f544 -c 'sudo grep "2222: i/o timeout" /var/vcap/sys/log/worker/*'

bosh -e prod -d concourse-algorithm ssh worker/695f7f58-a31b-4816-b59c-8e08cc9209c7 -c 'sudo grep "2222: i/o timeout" /var/vcap/sys/log/worker/*'
worker/695f7f58-a31b-4816-b59c-8e08cc9209c7: stdout | /var/vcap/sys/log/worker/worker.stdout.log:{"timestamp":"2020-03-04T12:06:02.794700162Z","level":"error","source":"worker","message":"worker.beacon-runner.beacon.keepalive.failed","data":{"error":"write tcp 10.1.0.111:49944-\u003e10.1.0.96:2222: i/o timeout","session":"4.1.2"}}

bosh -e prod -d concourse-algorithm ssh worker/6ab00e9f-f511-400e-a833-d9efefd699a5 -c 'sudo grep "2222: i/o timeout" /var/vcap/sys/log/worker/*'
worker/6ab00e9f-f511-400e-a833-d9efefd699a5: stdout | /var/vcap/sys/log/worker/worker.stdout.log:{"timestamp":"2020-03-06T18:59:50.425905275Z","level":"error","source":"worker","message":"worker.beacon-runner.beacon.keepalive.failed","data":{"error":"write tcp 10.1.0.85:38688-\u003e10.1.0.20:2222: i/o timeout","session":"4.1.4665214"}}

bosh -e prod -d concourse-algorithm ssh worker/77a923a2-1ab4-439c-b797-0bdb8713d3c7 -c 'sudo grep "2222: i/o timeout" /var/vcap/sys/log/worker/*'
worker/77a923a2-1ab4-439c-b797-0bdb8713d3c7: stdout | /var/vcap/sys/log/worker/worker.stdout.log:{"timestamp":"2020-03-06T08:38:39.730227549Z","level":"error","source":"worker","message":"worker.beacon-runner.beacon.keepalive.failed","data":{"error":"write tcp 10.1.0.122:53818-\u003e10.1.0.20:2222: i/o timeout","session":"4.1.4439291"}}

// DOESNT HAVE KEEPALIVE TIMEOUT

bosh -e prod -d concourse-algorithm ssh worker/ada20384-a69a-486b-9fec-df89b3255887 -c 'sudo grep "2222: i/o timeout" /var/vcap/sys/log/worker/*'

bosh -e prod -d concourse-algorithm ssh worker/b5e0b037-48f9-4a77-86dc-0a5624db52b1 -c 'sudo grep "2222: i/o timeout" /var/vcap/sys/log/worker/*'
worker/b5e0b037-48f9-4a77-86dc-0a5624db52b1: stdout | /var/vcap/sys/log/worker/worker.stdout.log:{"timestamp":"2020-03-06T19:40:05.839687848Z","level":"error","source":"worker","message":"worker.beacon-runner.beacon.keepalive.failed","data":{"error":"write tcp 10.1.0.124:36280-\u003e10.1.0.20:2222: i/o timeout","session":"4.1.4048217"}}

TLDR;

70% of instances had the timeout error. For these, closing the TCP conn in the keepalive section on failure would result in the worker reconnecting

For the remaining 30%, I suspect that they are stuck in

_, _, err := sshClient.Conn.SendRequest("keepalive", true, []byte("sup"))

in which case wrapping the request in a select with a timeout might prevent the worker from waiting indefinitely.

Next Steps, look at their stack trace to confirm if that is indeed the case.

xtreme-sameer-vohra on Mar 6, 2020

thanks for sharing!

I might be very wrong, but I think we shouldn’t worry much about those bridge-related messages (coming from the wbrdg-* devices) - AFAIK those bridge devices are there just to provide the network connectivity to the containers that garden creates, which is not something that beacon would be involved with 🤔


	VM-1.......................................
	.  
	.  host-net-namespace
	.
	.  beacon
	.  bclaim
	.  gdn
	.
  <--> eth0 <--> bridge 
	.          |
	.          |
	.        host-veth
	.          |
	.          |     container net ns .........
	.          |     .
	.          '-----+--container-veth
	.                .
	.                .

(i.e., beacon and any other processes in the host-net-namespace would not be using the bridge devices mentioned there)

cirocosta on Mar 3, 2020

update (in case @cirocosta is interested and has suggestions 😄 )

on our test environment (deployed yesterday morning), datadog showed a worker stalled out at about 12:44 AM UTC.

we jumped on the stalled worker and looked at it’s logs (specifically for the "worker.beacon-runner.beacon.heartbeated" message), seeing them happen about every 30 seconds, except after 12:43:30, no more of those messages show up!

we did notice a GIANT influx of "baggageclaim.api.volume-server.get-volume.get-volume.volume-not-found" messages surrounding the heartbeated message.

looking at dmesg, at 12:43 AM UTC we got a lot of:

...
   3098 [  +0.349426] wbrdg-0afe0118: port 1(wu48a1glbs24-0) entered blocking state
   3099 [  +0.000003] wbrdg-0afe0118: port 1(wu48a1glbs24-0) entered forwarding state
   3100 [  +0.067942] wbrdg-0afe0094: port 1(wu48a1glbs13-0) entered blocking state
   3101 [  +0.000004] wbrdg-0afe0094: port 1(wu48a1glbs13-0) entered forwarding state
   3102 [  +0.054608] wbrdg-0afe00b4: port 1(wu48a1glbs1b-0) entered blocking state
   3103 [  +0.000003] wbrdg-0afe00b4: port 1(wu48a1glbs1b-0) entered forwarding state
...

anywho, we grabbed a stackdump and noticed some hanging ssh library calls Screen Shot 2020-03-03 at 3 34 14 PM

which we are currently investigating!

jomsie on Mar 3, 2020