test-infra: Test cluster `default` reports `unauthorized` error
What happened:
We’re not able to run jobs across multiple projects, where it always seems to affect the cluster default.
Example: https://github.com/kubernetes-sigs/release-sdk/pull/169#event-8728805504
Pod can not be created: create pod test-pod ... 5bca66f65b in cluster default: Unauthorized BaseSHA:8a85aa260e42313a68b0ad487b537b2b616641fc
What you expected to happen: Being able to run the jobs.
How to reproduce it (as minimally and precisely as possible): Right now it reproduces across multiple repositories, including k/k.
Please provide links to example occurrences, if any:
Anything else we need to know?: cc @kubernetes/sig-k8s-infra
About this issue
- Original URL
- State: closed
- Created a year ago
- Reactions: 6
- Comments: 17 (17 by maintainers)
Things should be fixed now. It seems that the root cause of this outage was the KES deployment getting stuck on some internal error that resulted in neither the pod crashing nor metrics indicating a failed secret sync (for which we already have an alert).
Please don’t switch everything to the community cluster: We’re still very very tight on GCP budget this year and that cluster has already had capacity issues of late. We don’t want to resolve them by increasing autoscaling capacity due to the tight budget (we’re still on track for at least 3.4M on 3M credits this year and actively working to cut costs).
There is an EKS cluster coming online that workloads could switch to in the near future. Hopefully we’ll have this resolved before then anyhow though.
Ah I see a lot of errors from the kubernetes-external-secrets deployment like the following. I think that could be the explanation for the kubeconfig going stale:
I’ve kicked over the pod and now it has synced the secret.
This cluster is part of the google infrastructure. I will advise to move the community-owned infrastructure by adding the
cluster: k8s-infra-prow-buildif this is critical. On-call folx are in PST. Will take some time before an intervention happening.