test-infra: Test cluster `default` reports `unauthorized` error

What happened: We’re not able to run jobs across multiple projects, where it always seems to affect the cluster default.

Example: https://github.com/kubernetes-sigs/release-sdk/pull/169#event-8728805504

Pod can not be created: create pod test-pod ... 5bca66f65b in cluster default: Unauthorized BaseSHA:8a85aa260e42313a68b0ad487b537b2b616641fc

What you expected to happen: Being able to run the jobs.

How to reproduce it (as minimally and precisely as possible): Right now it reproduces across multiple repositories, including k/k.

Please provide links to example occurrences, if any:

Anything else we need to know?: cc @kubernetes/sig-k8s-infra

About this issue

  • Original URL
  • State: closed
  • Created a year ago
  • Reactions: 6
  • Comments: 17 (17 by maintainers)

Most upvoted comments

Things should be fixed now. It seems that the root cause of this outage was the KES deployment getting stuck on some internal error that resulted in neither the pod crashing nor metrics indicating a failed secret sync (for which we already have an alert).

Please don’t switch everything to the community cluster: We’re still very very tight on GCP budget this year and that cluster has already had capacity issues of late. We don’t want to resolve them by increasing autoscaling capacity due to the tight budget (we’re still on track for at least 3.4M on 3M credits this year and actively working to cut costs).

There is an EKS cluster coming online that workloads could switch to in the near future. Hopefully we’ll have this resolved before then anyhow though.

Ah I see a lot of errors from the kubernetes-external-secrets deployment like the following. I think that could be the explanation for the kubeconfig going stale:

{"level":30,"message_time":"2023-03-13T19:25:54.105Z","pid":18,"hostname":"kubernetes-external-secrets-5f98c9ff97-ngs9k","payload":{},"msg":"starting poller for prow-monitoring/prometheus-alert-slack-post-testing-ops-secret-url"}
{"level":50,"message_time":"2023-03-13T19:25:54.106Z","pid":18,"hostname":"kubernetes-external-secrets-5f98c9ff97-ngs9k","payload":{"err":{"type":"TypeError","message":"Cannot read property 'get' of undefined","stack":"TypeError: Cannot read property 'get' of undefined\n    at Poller._scheduleNextPoll (/app/lib/poller.js:361:30)\n    at Poller.start (/app/lib/poller.js:415:10)\n    at Daemon._addPoller (/app/lib/daemon.js:59:43)\n    at Daemon.start (/app/lib/daemon.js:89:16)\n    at runMicrotasks (<anonymous>)\n    at processTicksAndRejections (internal/process/task_queues.js:95:5)"}},"msg":"status check went boom for prow-monitoring/prometheus-alert-slack-post-testing-ops-secret-url"}

I’ve kicked over the pod and now it has synced the secret.

This cluster is part of the google infrastructure. I will advise to move the community-owned infrastructure by adding the cluster: k8s-infra-prow-build if this is critical. On-call folx are in PST. Will take some time before an intervention happening.