kubernetes: NPD jobs are failing: "failed to push gcr.io/node-problem-detector-staging/ci/node-problem-detector [...] 403 Forbidden"

Which jobs are failing?

Job name Config source Testgrid (or job history) link
ci-npd-build Source https://testgrid.k8s.io/sig-node-node-problem-detector#ci-npd-build
pull-npd-e2e-test Source https://prow.k8s.io/job-history/gs/kubernetes-jenkins/pr-logs/directory/pull-npd-e2e-test
pull-npd-e2e-node Source https://prow.k8s.io/job-history/gs/kubernetes-jenkins/pr-logs/directory/pull-npd-e2e-node

Which tests are failing?

Jobs fail to start. The container image is built but it fails on push with the following error:

#33 pushing layers
#33 ...
#34 [auth] node-problem-detector-staging/ci/node-problem-detector:pull,push token for gcr.io
#34 DONE 0.0s
#33 exporting to image
#33 pushing layers 1.4s done
#33 ERROR: failed to push gcr.io/node-problem-detector-staging/ci/node-problem-detector:v0.8.13-44-g5558643-20230710.1614: failed to authorize: failed to fetch oauth token: unexpected status: 403 Forbidden
------
 > exporting to image:
------
ERROR: failed to solve: failed to push gcr.io/node-problem-detector-staging/ci/node-problem-detector:v0.8.13-44-g5558643-20230710.1614: failed to authorize: failed to fetch oauth token: unexpected status: 403 Forbidden
make: *** [Makefile:270: push-container] Error 1

Since when has it been failing?

2023-07-04 ~12:40 PDT (CI job first was in 2023-07-01 ~06:30 PDT)

Testgrid link

No response

Reason for failure (if possible)

Jobs were migrated to EKS in https://github.com/kubernetes/test-infra/pull/29751, it seems that this is the culprit.

Anything else we need to know?

No response

Relevant SIG(s)

/sig node

About this issue

  • Original URL
  • State: open
  • Created a year ago
  • Comments: 15 (15 by maintainers)

Most upvoted comments

Job pull-npd-e2e-node is still failing. It seems that the service account in cluster k8s-infra-prow-build doesn’t have permission to push to bucket gs://node-problem-detector-staging. We probably want to grant it permissions, since other jobs depend on it (like ci-npd-build).

@rjsadow We don’t want to migrate jobs depending on any GCP resource. It doesn’t make much sense to, let’s say, push an image from EKS to GKE. Running it on EKS would mean higher bandwidth/traffic charges because of transferring data (in this case images) to GKE, and that’s usually much more expensive than just running the job on GCP and uploading from there. That said, I believe those jobs should be reverted back to the GKE cluster.