kuberay: [Bug][Operator][Leader election?] Operator failure and restart, logs attached

Search before asking

  • I searched the issues and found no similar issues.

KubeRay Component

ray-operator

What happened + What you expected to happen

I was running the nightly KubeRay operator for development purposes and observed a failure with some strange error messages logged.

The logs had roughly 1000 lines of Read request instance not found error with names of pods unrelated to KubeRay, followed by some complaints about leases and leader election, followed by a crash and restart.

Ideally, there shouldn’t be any issues related to leader election, since we don’t support leader election right now.

Only the last few lines of Read request instance not found error! are pasted below.

2022-09-28T18:52:13.868Z	INFO	controllers.RayCluster	Read request instance not found error!	{"name": "kube-system/konnectivity-agent-795c8dddf.171917992063a820"}
2022-09-28T18:52:13.869Z	INFO	controllers.RayCluster	Read request instance not found error!	{"name": "kube-system/konnectivity-agent-795c8dddf.171917a01bf956aa"}
2022-09-28T18:52:13.869Z	INFO	controllers.RayCluster	Read request instance not found error!	{"name": "kube-system/konnectivity-agent.171913fcc9e51685"}
2022-09-28T18:52:13.869Z	INFO	controllers.RayCluster	Read request instance not found error!	{"name": "kube-system/konnectivity-agent.1719140d14d73e54"}
2022-09-28T18:52:23.872Z	INFO	controllers.RayCluster	Read request instance not found error!	{"name": "kube-system/konnectivity-agent-795c8dddf-cqw2g.1719179920d0d07b"}
2022-09-28T18:52:53.884Z	INFO	controllers.RayCluster	Read request instance not found error!	{"name": "kube-system/konnectivity-agent-795c8dddf-pp5jv.171917a01c50dde1"}
2022-09-28T19:59:07.446Z	INFO	controllers.RayCluster	Read request instance not found error!	{"name": "default/gk3-autopilot-cluster-1-nap-1l3c0g4h-554d0488-ltwl.1718f0a38651b857"}
2022-09-28T19:59:07.447Z	INFO	controllers.RayCluster	Read request instance not found error!	{"name": "default/gk3-autopilot-cluster-1-nap-1l3c0g4h-83672b81-t6rs.1718f0a3d8493393"}
E0928 21:27:20.873442       1 leaderelection.go:330] error retrieving resource lock ray-system/ray-operator-leader: Get "https://10.8.56.1:443/api/v1/namespaces/ray-system/configmaps/ray-operator-leader": context deadline exceeded
2022-09-28T21:27:20.874Z	DEBUG	events	Normal	{"object": {"kind":"ConfigMap","apiVersion":"v1"}, "reason": "LeaderElection", "message": "kuberay-operator-56749d657d-x5f65_085aa91b-20ef-4011-b7f4-4edff573f5c7 stopped leading"}
2022-09-28T21:27:20.874Z	DEBUG	events	Normal	{"object": {"kind":"Lease","namespace":"ray-system","name":"ray-operator-leader","uid":"e6a67c4c-c3a9-4b83-a0c6-7a9a3f02a1a2","apiVersion":"coordination.k8s.io/v1","resourceVersion":"31046663"}, "reason": "LeaderElection", "message": "kuberay-operator-56749d657d-x5f65_085aa91b-20ef-4011-b7f4-4edff573f5c7 stopped leading"}
I0928 21:27:20.874714       1 leaderelection.go:283] failed to renew lease ray-system/ray-operator-leader: timed out waiting for the condition
2022-09-28T21:27:20.903Z	INFO	Stopping and waiting for non leader election runnables
2022-09-28T21:27:20.903Z	INFO	Stopping and waiting for leader election runnables
2022-09-28T21:27:20.903Z	INFO	Stopping and waiting for caches
2022-09-28T21:27:20.903Z	INFO	Stopping and waiting for webhooks
2022-09-28T21:27:20.903Z	INFO	Wait completed, proceeding to shutdown the manager
2022-09-28T21:27:20.903Z	ERROR	setup	problem running manager	{"error": "leader election lost"}

Reproduction script

I don’t know how to reproduce this yet.

Anything else

I don’t know yet.

Are you willing to submit a PR?

  • Yes I am willing to submit a PR!

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Comments: 17 (1 by maintainers)

Most upvoted comments

we are filtering out the events that we do not care here.

https://github.com/ray-project/kuberay/blob/af8fb0c833a850cd925bed8c59321d7b4bde6514/ray-operator/controllers/ray/raycluster_controller.go#L122

So I think we may need to change this part to make it only process the pod events it cares about.

This PR seems to suggest that the events watching was meant to support recovering from readiness/liveness probe failures. My understanding is this is already covered by watching the child pods. I’m curious if I’m missing something, and should be looking out for something in particular while we run a test with events watching disabled.

Not seeing spam of messages anymore, but seeing these ~10 log messages associated with unrelated pods at startup:

2023-03-28T03:24:55.557Z	INFO	controllers.RayCluster	no ray node pod found for event	{"event": "&Event{ObjectMeta:{gke-metadata-server-kf45m.175075bb8d2a4d81  kube-system  c3768cad-377d-43c6-9aef-c648170a4dd7 500053417 0 2023-03-28 02:55:09 +0000 UTC <nil> <nil> map[] map[] [] []  [{kubelet Update v1 2023-03-28 02:55:09 +0000 UTC FieldsV1 {\"f:count\":{},\"f:firstTimestamp\":{},\"f:involvedObject\":{},\"f:lastTimestamp\":{},\"f:message\":{},\"f:reason\":{},\"f:source\":{\"f:component\":{},\"f:host\":{}},\"f:type\":{}} }]},InvolvedObject:ObjectReference{Kind:Pod,Namespace:kube-system,Name:gke-metadata-server-kf45m,UID:4c109758-c5db-42a4-afaa-f759dd984ce3,APIVersion:v1,ResourceVersion:4099672145,FieldPath:spec.containers{gke-metadata-server},},Reason:Unhealthy,Message:Readiness probe failed: Get \"http://127.0.0.1:989/healthz\": dial tcp 127.0.0.1:989: connect: connection refused,Source:EventSource{Component:kubelet,Host:gke-cr-west1-n1-32-208-preempt-202303-c73cb1c6-rw5j,},FirstTimestamp:2023-03-28 02:55:09 +0000 UTC,LastTimestamp:2023-03-28 02:55:09 +0000 UTC,Count:1,Type:Warning,EventTime:0001-01-01 00:00:00 +0000 UTC,Series:nil,Action:,Related:nil,ReportingController:,ReportingInstance:,}"}
2023-03-28T03:24:55.557Z	INFO	controllers.RayCluster	no ray node pod found for event	{"event": "&Event{ObjectMeta:{gke-metadata-server-qcd6c.175075bb7adbf927  kube-system  a705c3d8-2121-43f5-b8ce-49492d8f3f87 500053397 0 2023-03-28 02:55:09 +0000 UTC <nil> <nil> map[] map[] [] []  [{kubelet Update v1 2023-03-28 02:55:09 +0000 UTC FieldsV1 {\"f:count\":{},\"f:firstTimestamp\":{},\"f:involvedObject\":{},\"f:lastTimestamp\":{},\"f:message\":{},\"f:reason\":{},\"f:source\":{\"f:component\":{},\"f:host\":{}},\"f:type\":{}} }]},InvolvedObject:ObjectReference{Kind:Pod,Namespace:kube-system,Name:gke-metadata-server-qcd6c,UID:c607a02f-f2ae-4b41-a6af-c7f4f26421be,APIVersion:v1,ResourceVersion:4099672288,FieldPath:spec.containers{gke-metadata-server},},Reason:Unhealthy,Message:Readiness probe failed: Get \"http://127.0.0.1:989/healthz\": dial tcp 127.0.0.1:989: connect: connection refused,Source:EventSource{Component:kubelet,Host:gke-cr-west1-n1-32-208-preempt-202303-4964322f-69mr,},FirstTimestamp:2023-03-28 02:55:09 +0000 UTC,LastTimestamp:2023-03-28 02:55:09 +0000 UTC,Count:1,Type:Warning,EventTime:0001-01-01 00:00:00 +0000 UTC,Series:nil,Action:,Related:nil,ReportingController:,ReportingInstance:,}"}
2023-03-28T03:24:55.557Z	INFO	controllers.RayCluster	no ray node pod found for event	{"event": "&Event{ObjectMeta:{gke-metadata-server-cmx5x.175075c006fe45fd  kube-system  630d5f76-84e2-4453-a86e-de454edb9fce 500053961 0 2023-03-28 02:55:28 +0000 UTC <nil> <nil> map[] map[] [] []  [{kubelet Update v1 2023-03-28 02:55:28 +0000 UTC FieldsV1 {\"f:count\":{},\"f:firstTimestamp\":{},\"f:involvedObject\":{},\"f:lastTimestamp\":{},\"f:message\":{},\"f:reason\":{},\"f:source\":{\"f:component\":{},\"f:host\":{}},\"f:type\":{}} }]},InvolvedObject:ObjectReference{Kind:Pod,Namespace:kube-system,Name:gke-metadata-server-cmx5x,UID:ebd44c55-0353-4673-8fd5-74c9ffef055c,APIVersion:v1,ResourceVersion:4099672960,FieldPath:spec.containers{gke-metadata-server},},Reason:Unhealthy,Message:Readiness probe failed: Get \"http://127.0.0.1:989/healthz\": dial tcp 127.0.0.1:989: connect: connection refused,Source:EventSource{Component:kubelet,Host:gke-cr-west1-n1-32-208-preempt-202303-4964322f-7l6p,},FirstTimestamp:2023-03-28 02:55:28 +0000 UTC,LastTimestamp:2023-03-28 02:55:29 +0000 UTC,Count:2,Type:Warning,EventTime:0001-01-01 00:00:00 +0000 UTC,Series:nil,Action:,Related:nil,ReportingController:,ReportingInstance:,}"}

This is generally ok since this is the case when we have an unhealthy pod but we are not going to deal with it. If this is also not something we want, we can remove it later.

Not seeing spam of messages anymore, but seeing these ~10 log messages associated with unrelated pods at startup:

2023-03-28T03:24:55.557Z	INFO	controllers.RayCluster	no ray node pod found for event	{"event": "&Event{ObjectMeta:{gke-metadata-server-kf45m.175075bb8d2a4d81  kube-system  c3768cad-377d-43c6-9aef-c648170a4dd7 500053417 0 2023-03-28 02:55:09 +0000 UTC <nil> <nil> map[] map[] [] []  [{kubelet Update v1 2023-03-28 02:55:09 +0000 UTC FieldsV1 {\"f:count\":{},\"f:firstTimestamp\":{},\"f:involvedObject\":{},\"f:lastTimestamp\":{},\"f:message\":{},\"f:reason\":{},\"f:source\":{\"f:component\":{},\"f:host\":{}},\"f:type\":{}} }]},InvolvedObject:ObjectReference{Kind:Pod,Namespace:kube-system,Name:gke-metadata-server-kf45m,UID:4c109758-c5db-42a4-afaa-f759dd984ce3,APIVersion:v1,ResourceVersion:4099672145,FieldPath:spec.containers{gke-metadata-server},},Reason:Unhealthy,Message:Readiness probe failed: Get \"http://127.0.0.1:989/healthz\": dial tcp 127.0.0.1:989: connect: connection refused,Source:EventSource{Component:kubelet,Host:gke-cr-west1-n1-32-208-preempt-202303-c73cb1c6-rw5j,},FirstTimestamp:2023-03-28 02:55:09 +0000 UTC,LastTimestamp:2023-03-28 02:55:09 +0000 UTC,Count:1,Type:Warning,EventTime:0001-01-01 00:00:00 +0000 UTC,Series:nil,Action:,Related:nil,ReportingController:,ReportingInstance:,}"}
2023-03-28T03:24:55.557Z	INFO	controllers.RayCluster	no ray node pod found for event	{"event": "&Event{ObjectMeta:{gke-metadata-server-qcd6c.175075bb7adbf927  kube-system  a705c3d8-2121-43f5-b8ce-49492d8f3f87 500053397 0 2023-03-28 02:55:09 +0000 UTC <nil> <nil> map[] map[] [] []  [{kubelet Update v1 2023-03-28 02:55:09 +0000 UTC FieldsV1 {\"f:count\":{},\"f:firstTimestamp\":{},\"f:involvedObject\":{},\"f:lastTimestamp\":{},\"f:message\":{},\"f:reason\":{},\"f:source\":{\"f:component\":{},\"f:host\":{}},\"f:type\":{}} }]},InvolvedObject:ObjectReference{Kind:Pod,Namespace:kube-system,Name:gke-metadata-server-qcd6c,UID:c607a02f-f2ae-4b41-a6af-c7f4f26421be,APIVersion:v1,ResourceVersion:4099672288,FieldPath:spec.containers{gke-metadata-server},},Reason:Unhealthy,Message:Readiness probe failed: Get \"http://127.0.0.1:989/healthz\": dial tcp 127.0.0.1:989: connect: connection refused,Source:EventSource{Component:kubelet,Host:gke-cr-west1-n1-32-208-preempt-202303-4964322f-69mr,},FirstTimestamp:2023-03-28 02:55:09 +0000 UTC,LastTimestamp:2023-03-28 02:55:09 +0000 UTC,Count:1,Type:Warning,EventTime:0001-01-01 00:00:00 +0000 UTC,Series:nil,Action:,Related:nil,ReportingController:,ReportingInstance:,}"}
2023-03-28T03:24:55.557Z	INFO	controllers.RayCluster	no ray node pod found for event	{"event": "&Event{ObjectMeta:{gke-metadata-server-cmx5x.175075c006fe45fd  kube-system  630d5f76-84e2-4453-a86e-de454edb9fce 500053961 0 2023-03-28 02:55:28 +0000 UTC <nil> <nil> map[] map[] [] []  [{kubelet Update v1 2023-03-28 02:55:28 +0000 UTC FieldsV1 {\"f:count\":{},\"f:firstTimestamp\":{},\"f:involvedObject\":{},\"f:lastTimestamp\":{},\"f:message\":{},\"f:reason\":{},\"f:source\":{\"f:component\":{},\"f:host\":{}},\"f:type\":{}} }]},InvolvedObject:ObjectReference{Kind:Pod,Namespace:kube-system,Name:gke-metadata-server-cmx5x,UID:ebd44c55-0353-4673-8fd5-74c9ffef055c,APIVersion:v1,ResourceVersion:4099672960,FieldPath:spec.containers{gke-metadata-server},},Reason:Unhealthy,Message:Readiness probe failed: Get \"http://127.0.0.1:989/healthz\": dial tcp 127.0.0.1:989: connect: connection refused,Source:EventSource{Component:kubelet,Host:gke-cr-west1-n1-32-208-preempt-202303-4964322f-7l6p,},FirstTimestamp:2023-03-28 02:55:28 +0000 UTC,LastTimestamp:2023-03-28 02:55:29 +0000 UTC,Count:2,Type:Warning,EventTime:0001-01-01 00:00:00 +0000 UTC,Series:nil,Action:,Related:nil,ReportingController:,ReportingInstance:,}"}

Yes, will test today and report back!

Are you able to try our patched version later and help us to verify this?

Yes, I’ll deploy from the PR branch tomorrow and let it run for awhile to collect metrics. Thanks for working on this so promptly! 😃

Let’s try to filter the event with owner type = RayCluster. that’s enough to bring down the memory usage.