go: x/build/env/openbsd-386: openbsd-386-62 gomote instance crashes periodically
@ianlancetaylor has reported in https://github.com/golang/go/issues/36996#issuecomment-581596950 that the openbsd-386-62
gomote instance was crashing, which made debugging an OpenBSD issue more difficult and time consuming:
Unfortunately, the gomote then crashed before I could look at all the data. The gomote continues to crash periodically, forcing me to rebuild everything before I can do more testing.
We should investigate and try to fix that, or find another solution to make it easier to debug OpenBSD issues. This is the tracking issue for that. /cc @cagedmantis @toothrot
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Comments: 16 (15 by maintainers)
This change has been deployed. I’ve managed to keep a GCE based gomote session alive for hours.
I’m going to close this issue.
@ianlancetaylor Hopefully you can finish your debugging of OpenBSD now!
I just verified that my test instance was successfully deleted by the remote buildlet cleanup process (as opposed to the abandoned VM process), as intended.
OK! I believe I have tracked it down.
When using
gomote ssh
, a property namedExpires
onRemoteBuildlet
is updated every minute while a SSH session is active: https://github.com/golang/build/blob/5bb938ef020fb4b7f22d366b1e0dc8f9b425cc2f/buildlet/remote.go#L171For GCE VMs, we also track a different attribute,
delete-at
in instance metadata. This property is not updated while SSHing, meaning we will eventually hit the default 45 minute timeout on these VMs and expire them here: https://github.com/golang/build/blob/5bb938ef020fb4b7f22d366b1e0dc8f9b425cc2f/cmd/coordinator/gce.go#L577We could do one or more of the following:
gce.go
to account for active SSH sessionsdelete-at
when SSHing, as we’ll rely on the remote buildlet cleanup instead.I’m not sure which is best yet, or if some combination of them is best. I’ll keep looking. The majority of the knowledge of this code I believe is tied up in @bradfitz and @crawshaw.
My current belief is this issue is specifically related to GCE VMs, which is a narrow-ish subset of our VMs. I’m still reading through the coordinator code to fully understand how it works before saying with confidence what is causing it, but I have my suspicions.
The default timeout is 45 minutes: https://github.com/golang/build/blob/17a7d8724fa7128cd79bcb78e1fbe087043bf810/cmd/coordinator/coordinator.go#L140
I’m still tracing through this code, but it seems like this should happen for all GCE VMs. I’ll keep digging.