test-infra: release-blocking jobs must run in dedicated cluster: ci-kubernetes-build
What should be cleaned up or changed:
This is part of #18549
To properly monitor the outcome of this, you should be a member of k8s-infra-prow-viewers@kubernetes.io. PR yourself into https://github.com/kubernetes/k8s.io/blob/master/groups/groups.yaml#L603-L628 if you’re not a member.
NOTE: I am not tagging this as “help wanted” because it is blocked on https://github.com/kubernetes/k8s.io/issues/846. I would also recommend doing ci-kubernetes-build-fast first. Here is my guess at how we could do this:
- create a duplicate job that pushes to the new bucket writable by k8s-infra-prow-build
- ensure it’s building and pushing appropriately
- update a release-blocking job to pull from the new bucket
- if no problems, roll out changes progressively
- a few more jobs in release-blocking
- all jobs in release-blocking that use this job’s results
- a job that still runs in the “default” cluster
- all jobs that use this job’s results
- rename jobs / get rid of the job that runs on the “default” cluster
- do the same for release-branch variants, can probably do a faster rollout
It will be helpful to note the date/time that PR’s merge. This will allow you to compare before/after behavior.
Things to watch for the job
- https://prow.k8s.io/?job=ci-kubernetes-build
- does the job start failing more often?
- does the job start going into error state?
- https://testgrid.k8s.io/presubmits-kubernetes-blocking#ci-kubernetes-build&graph-metrics=test-duration-minutes
- does the job duration look worse than before? spikier than before?
- https://storage.googleapis.com/k8s-gubernator/triage/index.html?pr=1&job=ci-kubernetes-build
- do more failures show up than before?
- https://prow.k8s.io/job-history/gs/kubernetes-jenkins/pr-logs/directory/ci-kubernetes-build
- (can be used to answer some of the same questions as above)
- metrics explorer: CPU limit utilization for
ci-kubernetes-buildfor 6h- is the job wildly underutilizing its CPU limit? if so, perhaps tune down (if uncertain, post evidence in this issue and ask)
- (it will probably be helpful to look at different time resolutions like 1h, 6h, 1d, 1w)
- metrics explorer: Memory limit utilization for
ci-kubernetes-buildfor 6h- is the job wildly underutilizing its memory limit? if so, perhaps tune down (if uncertain, post evidence in this issue and ask)
- (it will probably be helpful to look at different time resolutions like 1h, 6h, 1d, 1w)
Things to watch for the build cluster
- prow-build dashboard 1w
- is the build cluster scaling as needed? (e.g. maybe it can’t scale because we’ve hit some kind of quota)
- (it will probably be helpful to look at different time resolutions like 1h, 6h, 1d, 1w)
- prowjobs-experiment 1w
- (shows resource consumption of all job runs, pretty noisy but putting this here for completeness)
- https://monitoring.prow.k8s.io/d/wSrfvNxWz/boskos-resource-usage?orgId=1
Keep this open for at least 24h of weekday PR traffic. If everything continues to look good, then this can be closed.
/wg k8s-infra /sig testing /area jobs
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Comments: 24 (24 by maintainers)
Commits related to this issue
- Add canary job for ci-kubernetes-build jobs variants. Ref : https://github.com/kubernetes/test-infra/issues/19483 Part of: https://github.com/kubernetes/test-infra/issues/18549 Signed-off-by: Arnaud... — committed to ameukam/test-infra by ameukam 4 years ago
- Add canary job for ci-kubernetes-build jobs variants. Ref : https://github.com/kubernetes/test-infra/issues/19483 Part of: https://github.com/kubernetes/test-infra/issues/18549 Signed-off-by: Arnaud... — committed to ameukam/test-infra by ameukam 4 years ago
- Add canary job for ci-kubernetes-build jobs variants. Ref : https://github.com/kubernetes/test-infra/issues/19483 Part of: https://github.com/kubernetes/test-infra/issues/18549 Signed-off-by: Arnaud... — committed to ameukam/test-infra by ameukam 4 years ago
- Add canary job for ci-kubernetes-build jobs variants. Ref : https://github.com/kubernetes/test-infra/issues/19483 Part of: https://github.com/kubernetes/test-infra/issues/18549 Signed-off-by: Arnaud... — committed to ameukam/test-infra by ameukam 4 years ago
- Merge pull request #19904 from ameukam/GH-19483 Add canary jobs for ci-kubernetes-build job variants. — committed to kubernetes/test-infra by k8s-ci-robot 4 years ago
- Add canary jobs for periodic-kubernetes-bazel-build variants Ensure periodic-kubernetes-bazel-build can run k8s-infra-prow-build cluster and push artifacts on k8s-release-dev bucket. Related to : ht... — committed to ameukam/test-infra by ameukam 3 years ago
kinder downloads image tarbals from e.g.:
and then mutates them to be
k8s.gcr.io/....so i don’t think kinder (or kubeadm CI) will be affected by the gcr.io/kubernetes-ci-images -> gcr.io/k8-staging-ci-images change. however reading https://github.com/kubernetes/k8s.io/issues/846 my understanding is that kinder needs to migrate away from downloading from gs://kubernetes-release-dev to gs://k8s-release-dev.
should we do this now - i.e. is gs://k8s-release-dev ready for usage?
for the gcr.io/kubernetes-ci-images -> gcr.io/k8-staging-ci-images change. it can be considered as a breaking change for the kubeadm API:
EDIT, logged: https://github.com/kubernetes/kubeadm/issues/2355 https://github.com/kubernetes/kubeadm/issues/2356