kubernetes: pull-kubernetes-e2e-gce is nearing its timeout
If forget which timeout applies to which layer, but there is both:
--timeout=90
(ref: https://github.com/kubernetes/test-infra/blob/5df2f2f9ff8a75d5e0128889a314241b14749b2b/config/jobs/kubernetes/sig-gcp/sig-gcp-gce-config.yaml#L43)--timeout=65m
(ref: https://github.com/kubernetes/test-infra/blob/5df2f2f9ff8a75d5e0128889a314241b14749b2b/config/jobs/kubernetes/sig-gcp/sig-gcp-gce-config.yaml#L56)
Tests are near 60m at this point. https://testgrid.k8s.io/presubmits-kubernetes-blocking#pull-kubernetes-e2e-gce&graph-metrics=test-duration-minutes&include-filter-by-regex=Timeout|Overall
It’s difficult to tell how we’re doing on the CI equivalent of this job because it seems to be flaking so badly that it’s perpetually failing? https://testgrid.k8s.io/sig-release-master-blocking#gce-cos-master-default&width=5 (this seems like a separate issue)
I asked BigQuery and exported into data studio to get a chart of the time since 2018
SELECT
timestamp_trunc(started, day) day,
avg(elapsed)
FROM
`k8s-gubernator.build.all`
WHERE
job = "ci-kubernetes-e2e-gci-gce"
AND started >= timestamp('2018-01-01')
GROUP BY
day
ORDER BY
day asc
So yeah it’s been steadily going up with some notable bumps here and there:
Opening this because I believe it’s more productive to hold the current threshold than it is to just raise the timeout. We should identify some of the top offenders in slowness and kick them out.
/kind cleanup /sig release as owners of this job /priority important-soon We may need to bump this up if we start hitting the timeout more
About this issue
- Original URL
- State: closed
- Created 5 years ago
- Reactions: 2
- Comments: 54 (36 by maintainers)
/remove-lifecycle rotten I re-ran the query listed in the description, which shows duration for the CI job
It looks like we might need to reconsider this in a year but it’s probably fine as closed for now
Another problem we have is not that any particular test is slow but just that we only add more tests.
With lots of features trying to go GA, and every one of them adding a conformance test, we’ve added a bunch more tests and times have gone up for presubmits cc @aojea.
/remove-lifecycle stale /milestone clear I don’t think this should be tracked against the release cycle
correct on the x/y, I can provide a link to something that has more concrete numbers but the main point was to illustrate that it’s been continually going up and to the right