kuberay: [Bug] --forced-cluster-upgrade Causes termination loop for ray head node

Search before asking

  • I searched the issues and found no similar issues.

KubeRay Component

ray-operator

What happened + What you expected to happen

  1. When I enable --forced-cluster-upgrade on the ray operator (v0.3) then it continuously terminates the ray head node (which it then replaces with an apparently identical pod)
  2. I expect this flag to terminate my cluster no more than once when first enabled
022-09-13T13:39:37.524Z	INFO	controllers.RayCluster	reconcilePods 	{"head pod found": "deployment-prod-eks-kuberay-head-gwfrx"}
2022-09-13T13:39:37.524Z	INFO	controllers.RayCluster	reconcilePods	{"head pod is up and running... checking workers": "deployment-prod-eks-kuberay-head-gwfrx"}
2022-09-13T13:39:37.524Z	INFO	controllers.RayCluster	reconcilePods	{"removing the pods in the scaleStrategy of": "workergroup"}
2022-09-13T13:39:37.524Z	INFO	controllers.RayCluster	reconcilePods	{"all workers already exist for group": "workergroup"}
2022-09-13T13:39:37.525Z	INFO	controllers.RayCluster	updateStatus	{"service port's name is empty. Not adding it to RayCluster status.endpoints": {"protocol":"TCP","port":6379,"targetPort":6379}}
2022-09-13T13:39:40.744Z	INFO	controllers.RayCluster	reconciling RayCluster	{"cluster name": "deployment-dev-eks-kuberay"}
2022-09-13T13:39:40.744Z	INFO	controllers.RayCluster	reconcileServices 	{"headService service found": "deployment-dev-eks-kuberay-head-svc"}

For some reason my RayCluster has this list of workersToDelete which feels relevant.

  workerGroupSpecs:
    - groupName: workergroup
      maxReplicas: 75
      minReplicas: 0
      rayStartParams:
        block: 'true'
        node-ip-address: $MY_POD_IP
        redis-password: LetMeInRay
      replicas: 0
      scaleStrategy:
        workersToDelete:
          - ty-deployment-prod-eks-kuberay-worker-workergroup-5qg4z
          - ty-deployment-prod-eks-kuberay-worker-workergroup-6bhv4
          - ty-deployment-prod-eks-kuberay-worker-workergroup-75tj6
          - ty-deployment-prod-eks-kuberay-worker-workergroup-bwb99
          - ty-deployment-prod-eks-kuberay-worker-workergroup-d6mmb
          - ty-deployment-prod-eks-kuberay-worker-workergroup-d8hjt
          - ty-deployment-prod-eks-kuberay-worker-workergroup-gmvn5
          - ty-deployment-prod-eks-kuberay-worker-workergroup-gv8z8
          - ty-deployment-prod-eks-kuberay-worker-workergroup-hx989
          - ty-deployment-prod-eks-kuberay-worker-workergroup-jh26v
          - ty-deployment-prod-eks-kuberay-worker-workergroup-kb9xv
          - ty-deployment-prod-eks-kuberay-worker-workergroup-lhgpf
          - ty-deployment-prod-eks-kuberay-worker-workergroup-mbb75
          - ty-deployment-prod-eks-kuberay-worker-workergroup-nlvtq
          - ty-deployment-prod-eks-kuberay-worker-workergroup-pp4lk
          - ty-deployment-prod-eks-kuberay-worker-workergroup-q6gkt
          - ty-deployment-prod-eks-kuberay-worker-workergroup-qcj9p
          - ty-deployment-prod-eks-kuberay-worker-workergroup-qmb54
          - ty-deployment-prod-eks-kuberay-worker-workergroup-vrf7c
          - ty-deployment-prod-eks-kuberay-worker-workergroup-xt5bb
          - ty-deployment-prod-eks-kuberay-worker-workergroup-xwl97
          - ty-deployment-prod-eks-kuberay-worker-workergroup-zxcb2

Reproduction script

  1. Deploy kuberay 0.3
  2. Create a raycluster
  3. Add args to kuberay operator deployment
          args: [ '--forced-cluster-upgrade' ]

Anything else

No response

Are you willing to submit a PR?

  • Yes I am willing to submit a PR!

About this issue

  • Original URL
  • State: open
  • Created 2 years ago
  • Comments: 17 (3 by maintainers)

Most upvoted comments

@alex-treebeard could you provide more detail about the situation? For example is there any extra container inject or is there any change to the head pod template by the webhook when a pod is created?

With the current implementation, I believe either of these fairly common situations would lead to the behavior observed here.

I think this will be finally resolved by supporting rolling upgrade.

We will need to expedite the design and implementation of rolling upgrade support #527

Btw, Dima, can you give me more context on --forced-cluster-upgrade ? @DmitriGekhtman

^ After taking a look at the code, I’m almost certain that this is what’s happening.

There are quite a few circumstances where the K8s environment may mutate a pod’s resource requests or limits – for example, GKE autopilot may do this to suit its internal bin-packing.

With the current implementation of --forced-cluster-upgrade, this would lead to the sort of churning behavior observed in this issue.