go: x/build: frequent "communication error to buildlet" failures on `plan9-arm`

#!watchflakes
post <- builder == "plan9-arm" && `communication error to buildlet`
plan9-arm at 349cc83389f71c459b7820b0deecdf81221ba46c
…
communication error to buildlet (promoted to terminal error): network error promoted to terminal error: runTests: dist test failed: all buildlets had network errors or timeouts, yet tests remain

greplogs --dashboard -md -l -e '\Aplan9-arm.*(\n.*)*communication error to buildlet' --since=2022-01-01 2022-05-02T14:54:05-349cc83/plan9-arm 2022-04-27T14:23:28-f0c0e0f/plan9-arm 2022-04-26T02:28:58-17d7983/plan9-arm 2022-04-11T16:31:53-0179331/plan9-arm 2022-04-07T23:06:24-c451a02/plan9-arm 2022-04-05T14:15:59-62bceae/plan9-arm 2022-03-31T05:34:15-2b8178c/plan9-arm 2022-03-31T00:27:01-0a6ddcc/plan9-arm 2022-03-31T00:26:58-0775730/plan9-arm 2022-03-30T01:12:57-8fefeab/plan9-arm 2022-03-21T19:10:16-efbff6e/plan9-arm 2022-03-07T18:17:40-dcb6547/plan9-arm 2022-03-03T21:19:37-87a345c/plan9-arm 2022-03-01T19:32:51-44e92e1/plan9-arm 2022-02-25T00:25:34-b8b3196/plan9-arm 2022-02-01T18:15:07-125c5a3/plan9-arm 2022-01-27T21:25:18-ad345c2/plan9-arm 2022-01-19T16:33:11-985d97e/plan9-arm 2022-01-10T22:49:07-4ceb5a9/plan9-arm

@millerresearch, can something be done to prevent this builder from getting wedged?

(Compare #49756.)

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Comments: 21 (13 by maintainers)

Most upvoted comments

There was another failure mode: one of the raspberry pi builders had only 1GB of RAM and no swap configured. I’ve added some swap space so it should be more stable now.

IMO it would not be appropriate to retry the test — if it times out on one run, what’s to stop it from timing out on the next one?

Whenever I do a manual retry using the retrybuilds command after a communication error failure, the next attempt always succeeds. My strong hunch is that it’s something in the underlying platform that’s stalling non-deterministically, not within the go code.

I will set up a process on my local builders to monitor progress on the log output file. If nothing is emitted for say 15 minutes, it will send an alert so I can go in with the debugger and try to find out what’s stalled.