go: x/build/cmd/coordinator: failures are sometimes missing error output

https://build.golang.org/log/dab786adef1a18622f61641285864ac9c63fb7e3 is marked with fail on the dashboard, but the word FAIL does not appear in the output file at all.

Either the output is truncated, or the last test in the log (misc/cgo/testshared) exited with a nonzero status and no output.

CC @dmitshur @toothrot @cagedmantis

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Reactions: 2
  • Comments: 24 (23 by maintainers)

Most upvoted comments

In our team meeting this week, we’ve discussed this and agreed as the next step here to move this to 1.19 milestone, to give us a chance to prioritize investigation of this issue during the 1.19 dev cycle. To make progress here, we need to do development work on x/build/cmd/coordinator, and the 1.18 freeze isn’t the right period for that work.

For the 1.18 freeze, if there are linux-386-longtest test failures that happen at a rate that is more than can be attributed to flaky failures, we’ll need to look for optimal ways to reproduce and understand them even if the build logs do not include as much information as we’d like them to.

Updating this issue so that it’s more up to date with the latest reality, and we can adjust it further as needed.

I feel your pain, and many thanks for doing this unrewarding work.

Still, this seems like a process problem. I see no reason that an issue like this should block a release. If the linux-386-longtest builder failed 50% of the time then it would have to block the release because it might be hiding other problems. But that’s not the case here; we are getting enough successful runs to have reason to believe that the build is basically OK.

You are pointing out that if these issues are not marked as release blockers, then they will never be addressed. That is the problem to fix. We shouldn’t get ourselves into a position where we have to hold up a release because of a problem with the builders, where we have no reason to believe that the problem is or is hiding a problem with the release.

So I sympathize with the approach that you are taking, but I think we need to develop a different approach.

it looks to me like an issue with the builder system rather than something that will be fixed in the release.

In addition to the interaction with the porting policy, to me this kind of issue is also a matter of equity.

I watch the builders to check whether there have been regressions in the parts of the project for which I am responsible, and failures in the linux-386-longtest builder matter to me: for one, many of the cmd/go tests only run on the longtest builders, and for two, many of the fuzzing tests I have reviewed this cycle have behaviors unique to the linux-386-longtest builder, because that builder has the unique combination of running non-short tests and having a non-amd64 GOARCH (which impacts fuzzing instrumentation).

So when I see failures on this builder, I check them. A significant rate of false-positive failures causes a significant amount of unproductive, avoidable triage work, and that in turn contributes to feelings of frustration and burnout. Since the Go project does not seem to have anyone else triaging new or existing builder failures with any regularity, I feel that the costs of this ongoing flakiness have been externalized on to me.

#33598 went through a trajectory very similar to this issue: we had a series of recurring failures on the builders for darwin/amd64, which is also nominally a first-class port. I identified a way to reliably reproduce the problem in March 2020, and the issue remained unaddressed until I diagnosed and fixed it myself in October 2021 (CL 353549).

#39665 was also similar: Dmitri reported longtest failures on windows/amd64 (also a first-class port) in June 2020, and no apparent progress was made on even a diagnosis until I reported a new failure mode in November 2021 (in #49457) and marked it as a release-blocker, at which point the underlying issue was apparently fixed.

If we consider subrepo tests, there are many more examples. As I understand it from #11811 the policy of the project is that subrepo tests should be passing before a release, but for at least the past couple of years we have cut releases with frequently- or persistently-broken subrepo builders. (Some examples: #45700, #31567, #36163; the latter on windows/amd64, which is a first-class port.)

My takeaway from those cases is that persistent builder issues generally will not be addressed unless I address them myself (as in #33598, #31567, and #36163), or they actively interfere with running x/build/cmd/releasebot, or they are explicitly marked with the release-blocker label.

Letting these kinds of persistent issues linger was understandable as a short-term situation in 2020, but it isn’t tenable as a steady state for a large, staffed project. We all want to land releases, and our existing policy (at least as I understand it) is that to land a release we also need to keep the build healthy. Adhering to that policy provides some backpressure on the accumulation of unplanned technical debt, and helps to “internalize the externality” of flaky and broken builders.