kubernetes: verify-openapi-spec.sh frequently hanging in CI
Many PRs have been failing due to the pull-kubernetes-verify job timing out.
It looks like both PR and CI jobs are affected: http://k8s-testgrid.appspot.com/presubmits-kubernetes-blocking#pull-kubernetes-verify&width=5&graph-metrics=test-duration-minutes http://k8s-testgrid.appspot.com/sig-release-master-blocking#verify&width=20&graph-metrics=test-duration-minutes
The failures aren’t consistent, but there appears to be a clear uptick starting around 13:00 PDT on Monday, April 30.
There aren’t any notable test-infra changes from that time.
For suspicious kubernetes changes, nothing immediately leaps out. The only two PRs I find suspicious:
- https://github.com/kubernetes/kubernetes/pull/58474, since it affects apiserver (which we run in update-openapi-spec.sh) and appears to have had several similar timeouts in verify-openapi-spec.sh before it merged.
- https://github.com/kubernetes/kubernetes/pull/60034, since it affects startup/shutdown logic, but in kubelet/device manager, so I don’t think it’d affect this change.
I’ve been set -x debugging the verify scripts, and the last lines printed before hanging are
W0502 20:36:35.746] + echo '/go/src/k8s.io/kubernetes/api/openapi-spec up to date.'
W0502 20:36:35.746] + cp -a /go/src/k8s.io/kubernetes/_tmp/openapi-spec /go/src/k8s.io/kubernetes/api/openapi-spec/..
W0502 20:36:35.746] + rm -rf /go/src/k8s.io/kubernetes/_tmp
W0502 20:36:35.746] + echo 0
W0502 20:36:35.746] + tr -d '\n'
which corresponds to https://github.com/kubernetes/kubernetes/blob/0d43bdec2b8598ff542a1afdee876d417b4e7668/third_party/forked/shell2junit/sh2ju.sh#L48-L49 and https://github.com/kubernetes/kubernetes/blob/0d43bdec2b8598ff542a1afdee876d417b4e7668/third_party/forked/shell2junit/sh2ju.sh#L110-L112
so there might be some weird pipe/buffering nonsense going on causing everything to get stuck. (I’m still not sure what would cause this to start failing, though.)
About this issue
- Original URL
- State: closed
- Created 6 years ago
- Comments: 16 (16 by maintainers)
Commits related to this issue
- Merge pull request #63380 from liggitt/revert-lease Automatic merge from submit-queue. If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github... — committed to kubernetes/kubernetes by deleted user 6 years ago
- Merge pull request #63383 from liggitt/lease-reconciler Automatic merge from submit-queue (batch tested with PRs 63315, 63383, 63318, 63439). If you want to cherry-pick this change to another branch,... — committed to kubernetes/kubernetes by deleted user 6 years ago
we can do both, but if circumstances beyond our control mean etcd is dead, the apiserver should still be well-behaved
Would a better fix be to update the script to wait for the apiserver to exit?
If I take d39eac929f3babfc19e372a89af71d8fa3cbdcf1 but pass
--endpoint-reconciler-type=master-countto the apiserver in hack/update-openapi-spec.sh, the process is killed after the script runs.