kueue: Non-leading replica fails due to not started cert-controller

What happened:

If I run two replicas the manager crashes after a while, it looks like that the healthiness probe fails and restarts the pod

What you expected to happen:

Both pods as running fine

How to reproduce it (as minimally and precisely as possible):

Anything else we need to know?:

I1212 18:02:44.725166 1 leaderelection.go:250] attempting to acquire leader lease kueue-system/c1f6bfd2.kueue.x-k8s.io...
{"level":"info","ts":"2023-12-12T18:02:44.725142061Z","caller":"controller/controller.go:178","msg":"Starting EventSource","controller":"cert-rotator","source":"kind source: *v1.Secret"}
{"level":"info","ts":"2023-12-12T18:02:44.726132561Z","caller":"controller/controller.go:178","msg":"Starting EventSource","controller":"cert-rotator","source":"kind source: *unstructured.Unstructured"}
{"level":"info","ts":"2023-12-12T18:02:44.726421287Z","caller":"controller/controller.go:178","msg":"Starting EventSource","controller":"cert-rotator","source":"kind source: *unstructured.Unstructured"}
{"level":"info","ts":"2023-12-12T18:02:44.726453798Z","caller":"controller/controller.go:186","msg":"Starting Controller","controller":"cert-rotator"}
{"level":"error","ts":"2023-12-12T18:04:44.726776367Z","caller":"controller/controller.go:203","msg":"Could not wait for Cache to sync","controller":"cert-rotator","error":"failed to wait for cert-rotator caches to sync: timed out waiting for cache to be synced for Kind *v1.Secret","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.1\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.16.2/pkg/internal/controller/controller.go:203\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.16.2/pkg/internal/controller/controller.go:208\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.16.2/pkg/internal/controller/controller.go:234\nsigs.k8s.io/controller-runtime/pkg/manager.(*runnableGroup).reconcile.func1\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.16.2/pkg/manager/runnable_group.go:223"}
{"level":"info","ts":"2023-12-12T18:04:44.72690754Z","caller":"manager/internal.go:516","msg":"Stopping and waiting for non leader election runnables"}
{"level":"info","ts":"2023-12-12T18:04:44.72693391Z","caller":"manager/internal.go:520","msg":"Stopping and waiting for leader election runnables"}
{"level":"error","ts":"2023-12-12T18:04:44.726934751Z","caller":"manager/internal.go:490","msg":"error received after stop sequence was engaged","error":"failed waiting for reader to sync","errorVerbose":"failed waiting for reader to sync\ngithub.com/open-policy-agent/cert-controller/pkg/rotator.(*CertRotator).Start\n\t/go/pkg/mod/github.com/open-policy-agent/cert-controller@v0.10.0/pkg/rotator/rotator.go:258\nsigs.k8s.io/controller-runtime/pkg/manager.(*runnableGroup).reconcile.func1\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.16.2/pkg/manager/runnable_group.go:223\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1650","stacktrace":"sigs.k8s.io/controller-runtime/pkg/manager.(*controllerManager).engageStopProcedure.func1\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.16.2/pkg/manager/internal.go:490"}
{"level":"info","ts":"2023-12-12T18:04:44.72773714Z","caller":"manager/internal.go:526","msg":"Stopping and waiting for caches"}
{"level":"info","ts":"2023-12-12T18:04:44.727923434Z","caller":"manager/internal.go:530","msg":"Stopping and waiting for webhooks"}
{"level":"info","ts":"2023-12-12T18:04:44.727976656Z","caller":"manager/internal.go:533","msg":"Stopping and waiting for HTTP servers"}
{"level":"info","ts":"2023-12-12T18:04:44.727990216Z","logger":"controller-runtime.metrics","caller":"server/server.go:231","msg":"Shutting down metrics server with timeout of 1 minute"}
{"level":"info","ts":"2023-12-12T18:04:44.727998496Z","caller":"manager/server.go:43","msg":"shutting down server","kind":"health probe","addr":"[::]:8081"}
{"level":"info","ts":"2023-12-12T18:04:44.728063958Z","caller":"manager/internal.go:537","msg":"Wait completed, proceeding to shutdown the manager"}
{"level":"error","ts":"2023-12-12T18:04:44.728084368Z","logger":"setup","caller":"kueue/main.go:182","msg":"Could not run manager","error":"failed to wait for cert-rotator caches to sync: timed out waiting for cache to be synced for Kind *v1.Secret","stacktrace":"main.main\n\t/workspace/cmd/kueue/main.go:182\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:267"}

Environment:

Kubernetes version (use kubectl version): Server Version: v1.28.3-eks-4f4795d
Kueue version (use git describe --tags --dirty --always): v0.5.1
Cloud provider or hardware configuration: EKS
OS (e.g: cat /etc/os-release): bottlerocket os
Kernel (e.g. uname -a):
Install tools: helm
Others:

About this issue

Original URL
State: closed
Created 7 months ago
Comments: 22 (19 by maintainers)

Most upvoted comments

A new version of cert-controller has been released with the fix, and I’ve open #1509 that upgrades our dependency. That fixes the non-leading replicas starting issue.

I’ve also open #1510 to track the work for making the visibility extension API server highly available.

astefanutti on Dec 22, 2023

@astefanutti can you explain why the binary is terminating given the bug in cert-controller?

I thought this had something to do with the probes, but our probes just use Ping, which would return Ready/Live if the binary is able to respond to the Http request.

@alculquicondor right, I don’t think it has something to do with the probes.

So what happens in the non-leader elected mode is the following:

cert-controller is added to the manager (as a runnable)
the manager starts
the cert-controller runnable is started
that waits for its cache to sync, but that cache is never started (because controller-runtime defaults to starting runnables only in the leader-elected instance)
the above calls times out and returns an error
the manager start call return an error, which is confirmed by the Could not run manager message
the binary exits

astefanutti on Dec 19, 2023

controller-runtime exposes the LeaderElectionRunnable interface, so controllers can implement its NeedLeaderElection method to control whether the manager should start them in non-leader instances. Also managed webhooks are always started, irrespective of leader election.

In the case of the OPA cert-controller, there is a RequireLeaderElection option, that’s correctly set by Kueue, but I suspect there is an issue in cert-controller that makes it not taken into account, which is the root cause of that issue. I’ll fix it upstream.

For the visibility extension API server, we would need to make sure it’s safe to run multiple instances of ClusterQueueReconciler concurrently, or find a way to only run the read-only part?

astefanutti on Dec 14, 2023

Yeah, ideally all replicas should reply to webhooks.

However, we recently introduced this feature https://github.com/kubernetes-sigs/kueue/tree/main/keps/168-2-pending-workloads-visibility. In this case, only the leader can respond. Another alternative would be for non-leaders to also maintain the queues (we do this in kube-scheduler), so that they can also respond to api-extensions requests.

I’m not actually sure about what is the behavior that controller-runtime applies.

alculquicondor on Dec 13, 2023