test-infra: release-blocking jobs must run in dedicated cluster: ci-kubernetes-build

What should be cleaned up or changed:

This is part of #18549

To properly monitor the outcome of this, you should be a member of k8s-infra-prow-viewers@kubernetes.io. PR yourself into https://github.com/kubernetes/k8s.io/blob/master/groups/groups.yaml#L603-L628 if you’re not a member.

NOTE: I am not tagging this as “help wanted” because it is blocked on https://github.com/kubernetes/k8s.io/issues/846. I would also recommend doing ci-kubernetes-build-fast first. Here is my guess at how we could do this:

create a duplicate job that pushes to the new bucket writable by k8s-infra-prow-build
ensure it’s building and pushing appropriately
update a release-blocking job to pull from the new bucket
if no problems, roll out changes progressively
- a few more jobs in release-blocking
- all jobs in release-blocking that use this job’s results
- a job that still runs in the “default” cluster
- all jobs that use this job’s results
rename jobs / get rid of the job that runs on the “default” cluster
do the same for release-branch variants, can probably do a faster rollout

It will be helpful to note the date/time that PR’s merge. This will allow you to compare before/after behavior.

Things to watch for the job

https://prow.k8s.io/?job=ci-kubernetes-build
- does the job start failing more often?
- does the job start going into error state?
https://testgrid.k8s.io/presubmits-kubernetes-blocking#ci-kubernetes-build&graph-metrics=test-duration-minutes
- does the job duration look worse than before? spikier than before?
https://storage.googleapis.com/k8s-gubernator/triage/index.html?pr=1&job=ci-kubernetes-build
- do more failures show up than before?
https://prow.k8s.io/job-history/gs/kubernetes-jenkins/pr-logs/directory/ci-kubernetes-build
- (can be used to answer some of the same questions as above)
metrics explorer: CPU limit utilization for ci-kubernetes-build for 6h
- is the job wildly underutilizing its CPU limit? if so, perhaps tune down (if uncertain, post evidence in this issue and ask)
- (it will probably be helpful to look at different time resolutions like 1h, 6h, 1d, 1w)
metrics explorer: Memory limit utilization for ci-kubernetes-build for 6h
- is the job wildly underutilizing its memory limit? if so, perhaps tune down (if uncertain, post evidence in this issue and ask)
- (it will probably be helpful to look at different time resolutions like 1h, 6h, 1d, 1w)

Things to watch for the build cluster

prow-build dashboard 1w
- is the build cluster scaling as needed? (e.g. maybe it can’t scale because we’ve hit some kind of quota)
- (it will probably be helpful to look at different time resolutions like 1h, 6h, 1d, 1w)
prowjobs-experiment 1w
- (shows resource consumption of all job runs, pretty noisy but putting this here for completeness)
https://monitoring.prow.k8s.io/d/wSrfvNxWz/boskos-resource-usage?orgId=1

Keep this open for at least 24h of weekday PR traffic. If everything continues to look good, then this can be closed.

/wg k8s-infra /sig testing /area jobs

About this issue

Original URL
State: closed
Created 4 years ago
Comments: 24 (24 by maintainers)

Commits related to this issue

Add canary job for ci-kubernetes-build jobs variants. Ref : https://github.com/kubernetes/test-infra/issues/19483 Part of: https://github.com/kubernetes/test-infra/issues/18549 Signed-off-by: Arnaud... — committed to ameukam/test-infra by ameukam 4 years ago
Add canary job for ci-kubernetes-build jobs variants. Ref : https://github.com/kubernetes/test-infra/issues/19483 Part of: https://github.com/kubernetes/test-infra/issues/18549 Signed-off-by: Arnaud... — committed to ameukam/test-infra by ameukam 4 years ago
Add canary job for ci-kubernetes-build jobs variants. Ref : https://github.com/kubernetes/test-infra/issues/19483 Part of: https://github.com/kubernetes/test-infra/issues/18549 Signed-off-by: Arnaud... — committed to ameukam/test-infra by ameukam 4 years ago
Add canary job for ci-kubernetes-build jobs variants. Ref : https://github.com/kubernetes/test-infra/issues/19483 Part of: https://github.com/kubernetes/test-infra/issues/18549 Signed-off-by: Arnaud... — committed to ameukam/test-infra by ameukam 4 years ago
Merge pull request #19904 from ameukam/GH-19483 Add canary jobs for ci-kubernetes-build job variants. — committed to kubernetes/test-infra by k8s-ci-robot 4 years ago
Add canary jobs for periodic-kubernetes-bazel-build variants Ensure periodic-kubernetes-bazel-build can run k8s-infra-prow-build cluster and push artifacts on k8s-release-dev bucket. Related to : ht... — committed to ameukam/test-infra by ameukam 3 years ago

Most upvoted comments

identify jobs that depend on gcr.io/kubernetes-ci-images

Based on https://cs.k8s.io/?q=kubernetes-ci-images&i=nope&files=&repos=

jobs that use kubeadm that don’t explicitly set ClusterConfiguration.ImageRepository will default to using gcr.io/kubernetes-ci-images if ClusterConfiguration.KubernetesVersion starts with ci/ or ci-cross/

could impact kubeadm-kinder jobs

could impact cluster-api jobs

other jobs that use kubeadm?

kinder downloads image tarbals from e.g.:

https://storage.googleapis.com/kubernetes-release-dev/ci/v1.20.0-beta.2.88+e3de62298a7304/bin/linux/amd64/kube-apiserver.tar

and then mutates them to be k8s.gcr.io/....

so i don’t think kinder (or kubeadm CI) will be affected by the gcr.io/kubernetes-ci-images -> gcr.io/k8-staging-ci-images change. however reading https://github.com/kubernetes/k8s.io/issues/846 my understanding is that kinder needs to migrate away from downloading from gs://kubernetes-release-dev to gs://k8s-release-dev.

should we do this now - i.e. is gs://k8s-release-dev ready for usage?

It is unclear to me whether kubeadm needs to be patched?

change hardcoded constant from gcr.io/kubernetes-ci-images to gcr.io/k8-staging-ci-images

add flag / config option for CI image repo if we ever rename the repo again

make no changes and explicitly set --image-repository

for the gcr.io/kubernetes-ci-images -> gcr.io/k8-staging-ci-images change. it can be considered as a breaking change for the kubeadm API:

the correct way would be to make this change as part of a new API version and older versions of the API would have to pass imageRepository (or the flag) explicitly, which is a breaking change to old API users.
the easier option is to just change the default in both the older and newer APIs, which is just easier for everyone, but may annoy API purist a little. in any case, it feels like this is the better option.

EDIT, logged: https://github.com/kubernetes/kubeadm/issues/2355 https://github.com/kubernetes/kubeadm/issues/2356

neolit123 on Nov 30, 2020