test-infra: error during gcloud auth activate-service-account --key-file=/etc/service-account/service-account.json: exit status 1
SOLUTION: See https://github.com/kubernetes/test-infra/issues/27157#issuecomment-1318950082 - thanks @chaodaiG !
notes from @liggitt on Nov 17:
- This just started happening again on 2022-11-16 - https://storage.googleapis.com/k8s-triage/index.html?pr=1&text=error during gcloud auth activate-service-account
- Looks like it is failing ~20% of https://testgrid.k8s.io/google-gce#gce-containerd&width=20 runs
- Are particular nodes hitting the issue? looks like all the jobs in https://testgrid.k8s.io/google-gce#gce-containerd&width=20 are running on gke-prow-e2-default-pool-bdc23de7 nodepool … did that node pool change configuration / version / etc?
Previous Issue body: Example log: https://storage.googleapis.com/kubernetes-jenkins/pr-logs/pull/111859/pull-kubernetes-e2e-gce-storage-slow/1559899720485179392/build-log.txt
This is happening a lot across a variety of CI jobs. See chatter on #testing-ops as well ( https://kubernetes.slack.com/archives/C7J9RP96G/p1660676173294389 )
I0817 13:47:34.328] Call: gcloud auth activate-service-account --key-file=/etc/service-account/service-account.json
W0817 13:47:34.969] ERROR: (gcloud.auth.activate-service-account) There was a problem refreshing your current auth tokens: ('invalid_grant: Invalid JWT Signature.', {'error': 'invalid_grant', 'error_description': 'Invalid JWT Signature.'})
W0817 13:47:34.969] Please run:
W0817 13:47:34.969]
W0817 13:47:34.969] $ gcloud auth login
W0817 13:47:34.969]
W0817 13:47:34.970] to obtain new credentials.
W0817 13:47:34.970]
W0817 13:47:34.970] If you have already logged in with a different account:
W0817 13:47:34.970]
W0817 13:47:34.970] $ gcloud config set account ACCOUNT
W0817 13:47:34.970]
W0817 13:47:34.970] to select an already authenticated account to use.
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Comments: 40 (38 by maintainers)
Commits related to this issue
- post-test-infra-push-prow: explicitly disable GCS credentials secret Slack thread: https://kubernetes.slack.com/archives/C09QZ4DQB/p1660866377880809?thread_ts=1660813165.240589&cid=C09QZ4DQB Let's s... — committed to listx/test-infra by listx 2 years ago
- post-test-infra-push-prow: explicitly disable GCS credentials secret Slack thread: https://kubernetes.slack.com/archives/C09QZ4DQB/p1660866377880809?thread_ts=1660813165.240589&cid=C09QZ4DQB Let's s... — committed to listx/test-infra by listx 2 years ago
- Service account key used for default build cluster stored in GCP secret manager The Prow team advocates for workload identity instead of relying on physical json keys for authenticating with GCS in P... — committed to chaodaiG/test-infra by chaodaiG 2 years ago
I’ve followed https://github.com/kubernetes/test-infra/issues/27157#issuecomment-1318950082 and rotated the keys again. For posterity the steps were:
kubernetes-jenkins-pullproject’s GCP console. Click onIAM & Adminand thenService Accounts. Find thepr-kubekins@kubernetes-jenkins-pull.iam.gserviceaccount.comentry in the list and create a new key for it. Create a new JSON key (private key) forpr-kubekins@kubernetes-jenkins-pull.iam.gserviceaccount.com. This will download the key to your local disk.k8s-prow-buildsproject GCP console. Go toSecurity->Secret Manager. Then finddefault-k8s-build-cluster-service-account-keyin the list. Now upload the JSON key from step 1 here as a new version for this secret. Behind the scenes, the kubernetes-external-secrets pod in the k8s-prow cluster will update this secret in O(seconds).Note that these same steps were automated in https://github.com/kubernetes/test-infra/pull/28053 but the job has been unhealthy: https://prow.k8s.io/job-history/gs/kubernetes-jenkins/logs/ci-test-infra-rotate-legacy-default-build-sa-json-key
I’ll see if I can at least get that job past the exec error.
UPDATE: See https://github.com/kubernetes/test-infra/pull/28786 for the exec error fix.
shadowing what you were doing was good experience @chaodaiG !! appreciate it.
cc @hakman @tobiasgiese @bobbypage @chaodaiG @BenTheElder
THANK YOU @listx 🙏
I’ve enabled the API in the
k8s-prowproject and after retrying the job, it succeeded: https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/ci-test-infra-rotate-legacy-default-build-sa-json-key/1630751309038620672I can see in the Cloud Console that a new key has been created for
pr-kubekins@kubernetes-jenkins-pull.iam.gserviceaccount.com. I also see that this key has been loaded up in GCP Secret Manager (as expected).So it appears that the job performed all of the manual steps I described in https://github.com/kubernetes/test-infra/issues/27157#issuecomment-1435088376.
/remove-lifecycle stale
We’re going to have this problem on a regular basis until we can migrate CI out of google.com, which won’t be anytime this year given the kubernetes.io budget issues.
This appears to be happening again.
See: https://github.com/kubernetes/test-infra/issues/27157#issuecomment-1220982143 for why moving to podutils / workload identity isn’t a workable answer.
Yes, that’s the driving reason. Creating a lot of keys was causing issues. E.G. It meant the driver tests were attempting to cleanup keys, and a bug caused the main CI key to be deleted, which was a fun day 🙃
https://github.com/kubernetes/test-infra/issues/27157#issuecomment-1318950082 has the hotfix approach, for someone with access.
This just started happening again on 2022-11-16 - https://storage.googleapis.com/k8s-triage/index.html?pr=1&text=error during gcloud auth activate-service-account
Looks like it is failing ~20% of https://testgrid.k8s.io/google-gce#gce-containerd&width=20 runs
@chaodaiG
https://cs.k8s.io/?q=E2E_GOOGLE_APPLICATION_CREDENTIALS&i=nope&files=&excludeFiles=&repos=
IIRC there are some number of e2e jobs that need to provide a service account key to a gce pd driver deployed to the cluster under test. The clusters these jobs stand up aren’t guaranteed to be GKE clusters, so I’m not sure changing the gce pd driver deployment to use workload identity is an option.
From https://github.com/kubernetes-sigs/gcp-compute-persistent-disk-csi-driver/blob/master/docs/kubernetes/user-guides/driver-install.md#install-driver:
Replacing use of a shared service account key would involve jobs having to run something like the driver’s
setup-project.shscript prior to launching tests, which means permission to create a service account and service account keys in each project. I think it’s possible to provide jobs with this privilege via workload identity, but I forget if the churn/noise of key creation is the reason a shared account key was used in the first place.cc @msau42 who I think is more familiar with this than I am
thanks @chaodaiG, please see https://github.com/kubernetes/test-infra/pull/27169 for the reverts
this is not surprising. Having someone to remember to manually rotate this every 80 days doesn’t seem like a sustainable solution, so at this point I’m very curious to understand whether there is any job that has no choice but use this physical service account key file.
The second goal, is to figure out whether all these jobs are maintained or not
@chaodaiG looks like there are tons of these jobs with that preset - https://cs.k8s.io/?q=preset-service-account&i=nope&files=&excludeFiles=&repos=kubernetes/test-infra
Let me start with just the ones in
k8s-cri-containerdproject used by containerd.