kuberay: [Bug][Operator][Leader election?] Operator failure and restart, logs attached
Search before asking
- I searched the issues and found no similar issues.
KubeRay Component
ray-operator
What happened + What you expected to happen
I was running the nightly KubeRay operator for development purposes and observed a failure with some strange error messages logged.
The logs had roughly 1000 lines of Read request instance not found error
with names of pods unrelated to KubeRay,
followed by some complaints about leases and leader election,
followed by a crash and restart.
Ideally, there shouldn’t be any issues related to leader election, since we don’t support leader election right now.
Only the last few lines of Read request instance not found error!
are pasted below.
2022-09-28T18:52:13.868Z INFO controllers.RayCluster Read request instance not found error! {"name": "kube-system/konnectivity-agent-795c8dddf.171917992063a820"}
2022-09-28T18:52:13.869Z INFO controllers.RayCluster Read request instance not found error! {"name": "kube-system/konnectivity-agent-795c8dddf.171917a01bf956aa"}
2022-09-28T18:52:13.869Z INFO controllers.RayCluster Read request instance not found error! {"name": "kube-system/konnectivity-agent.171913fcc9e51685"}
2022-09-28T18:52:13.869Z INFO controllers.RayCluster Read request instance not found error! {"name": "kube-system/konnectivity-agent.1719140d14d73e54"}
2022-09-28T18:52:23.872Z INFO controllers.RayCluster Read request instance not found error! {"name": "kube-system/konnectivity-agent-795c8dddf-cqw2g.1719179920d0d07b"}
2022-09-28T18:52:53.884Z INFO controllers.RayCluster Read request instance not found error! {"name": "kube-system/konnectivity-agent-795c8dddf-pp5jv.171917a01c50dde1"}
2022-09-28T19:59:07.446Z INFO controllers.RayCluster Read request instance not found error! {"name": "default/gk3-autopilot-cluster-1-nap-1l3c0g4h-554d0488-ltwl.1718f0a38651b857"}
2022-09-28T19:59:07.447Z INFO controllers.RayCluster Read request instance not found error! {"name": "default/gk3-autopilot-cluster-1-nap-1l3c0g4h-83672b81-t6rs.1718f0a3d8493393"}
E0928 21:27:20.873442 1 leaderelection.go:330] error retrieving resource lock ray-system/ray-operator-leader: Get "https://10.8.56.1:443/api/v1/namespaces/ray-system/configmaps/ray-operator-leader": context deadline exceeded
2022-09-28T21:27:20.874Z DEBUG events Normal {"object": {"kind":"ConfigMap","apiVersion":"v1"}, "reason": "LeaderElection", "message": "kuberay-operator-56749d657d-x5f65_085aa91b-20ef-4011-b7f4-4edff573f5c7 stopped leading"}
2022-09-28T21:27:20.874Z DEBUG events Normal {"object": {"kind":"Lease","namespace":"ray-system","name":"ray-operator-leader","uid":"e6a67c4c-c3a9-4b83-a0c6-7a9a3f02a1a2","apiVersion":"coordination.k8s.io/v1","resourceVersion":"31046663"}, "reason": "LeaderElection", "message": "kuberay-operator-56749d657d-x5f65_085aa91b-20ef-4011-b7f4-4edff573f5c7 stopped leading"}
I0928 21:27:20.874714 1 leaderelection.go:283] failed to renew lease ray-system/ray-operator-leader: timed out waiting for the condition
2022-09-28T21:27:20.903Z INFO Stopping and waiting for non leader election runnables
2022-09-28T21:27:20.903Z INFO Stopping and waiting for leader election runnables
2022-09-28T21:27:20.903Z INFO Stopping and waiting for caches
2022-09-28T21:27:20.903Z INFO Stopping and waiting for webhooks
2022-09-28T21:27:20.903Z INFO Wait completed, proceeding to shutdown the manager
2022-09-28T21:27:20.903Z ERROR setup problem running manager {"error": "leader election lost"}
Reproduction script
I don’t know how to reproduce this yet.
Anything else
I don’t know yet.
Are you willing to submit a PR?
- Yes I am willing to submit a PR!
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Comments: 17 (1 by maintainers)
we are filtering out the events that we do not care here.
https://github.com/ray-project/kuberay/blob/af8fb0c833a850cd925bed8c59321d7b4bde6514/ray-operator/controllers/ray/raycluster_controller.go#L122
So I think we may need to change this part to make it only process the pod events it cares about.
This is generally ok since this is the case when we have an unhealthy pod but we are not going to deal with it. If this is also not something we want, we can remove it later.
Not seeing spam of messages anymore, but seeing these ~10 log messages associated with unrelated pods at startup:
Yes, will test today and report back!
Yes, I’ll deploy from the PR branch tomorrow and let it run for awhile to collect metrics. Thanks for working on this so promptly! 😃
Let’s try to filter the event with owner type = RayCluster. that’s enough to bring down the memory usage.