kuberay: [Bug] --forced-cluster-upgrade Causes termination loop for ray head node
Search before asking
- I searched the issues and found no similar issues.
KubeRay Component
ray-operator
What happened + What you expected to happen
- When I enable
--forced-cluster-upgrade
on the ray operator (v0.3) then it continuously terminates the ray head node (which it then replaces with an apparently identical pod) - I expect this flag to terminate my cluster no more than once when first enabled
022-09-13T13:39:37.524Z INFO controllers.RayCluster reconcilePods {"head pod found": "deployment-prod-eks-kuberay-head-gwfrx"}
2022-09-13T13:39:37.524Z INFO controllers.RayCluster reconcilePods {"head pod is up and running... checking workers": "deployment-prod-eks-kuberay-head-gwfrx"}
2022-09-13T13:39:37.524Z INFO controllers.RayCluster reconcilePods {"removing the pods in the scaleStrategy of": "workergroup"}
2022-09-13T13:39:37.524Z INFO controllers.RayCluster reconcilePods {"all workers already exist for group": "workergroup"}
2022-09-13T13:39:37.525Z INFO controllers.RayCluster updateStatus {"service port's name is empty. Not adding it to RayCluster status.endpoints": {"protocol":"TCP","port":6379,"targetPort":6379}}
2022-09-13T13:39:40.744Z INFO controllers.RayCluster reconciling RayCluster {"cluster name": "deployment-dev-eks-kuberay"}
2022-09-13T13:39:40.744Z INFO controllers.RayCluster reconcileServices {"headService service found": "deployment-dev-eks-kuberay-head-svc"}
For some reason my RayCluster has this list of workersToDelete which feels relevant.
workerGroupSpecs:
- groupName: workergroup
maxReplicas: 75
minReplicas: 0
rayStartParams:
block: 'true'
node-ip-address: $MY_POD_IP
redis-password: LetMeInRay
replicas: 0
scaleStrategy:
workersToDelete:
- ty-deployment-prod-eks-kuberay-worker-workergroup-5qg4z
- ty-deployment-prod-eks-kuberay-worker-workergroup-6bhv4
- ty-deployment-prod-eks-kuberay-worker-workergroup-75tj6
- ty-deployment-prod-eks-kuberay-worker-workergroup-bwb99
- ty-deployment-prod-eks-kuberay-worker-workergroup-d6mmb
- ty-deployment-prod-eks-kuberay-worker-workergroup-d8hjt
- ty-deployment-prod-eks-kuberay-worker-workergroup-gmvn5
- ty-deployment-prod-eks-kuberay-worker-workergroup-gv8z8
- ty-deployment-prod-eks-kuberay-worker-workergroup-hx989
- ty-deployment-prod-eks-kuberay-worker-workergroup-jh26v
- ty-deployment-prod-eks-kuberay-worker-workergroup-kb9xv
- ty-deployment-prod-eks-kuberay-worker-workergroup-lhgpf
- ty-deployment-prod-eks-kuberay-worker-workergroup-mbb75
- ty-deployment-prod-eks-kuberay-worker-workergroup-nlvtq
- ty-deployment-prod-eks-kuberay-worker-workergroup-pp4lk
- ty-deployment-prod-eks-kuberay-worker-workergroup-q6gkt
- ty-deployment-prod-eks-kuberay-worker-workergroup-qcj9p
- ty-deployment-prod-eks-kuberay-worker-workergroup-qmb54
- ty-deployment-prod-eks-kuberay-worker-workergroup-vrf7c
- ty-deployment-prod-eks-kuberay-worker-workergroup-xt5bb
- ty-deployment-prod-eks-kuberay-worker-workergroup-xwl97
- ty-deployment-prod-eks-kuberay-worker-workergroup-zxcb2
Reproduction script
- Deploy kuberay 0.3
- Create a raycluster
- Add args to kuberay operator deployment
args: [ '--forced-cluster-upgrade' ]
Anything else
No response
Are you willing to submit a PR?
- Yes I am willing to submit a PR!
About this issue
- Original URL
- State: open
- Created 2 years ago
- Comments: 17 (3 by maintainers)
With the current implementation, I believe either of these fairly common situations would lead to the behavior observed here.
I think this will be finally resolved by supporting rolling upgrade.
We will need to expedite the design and implementation of rolling upgrade support #527
Btw, Dima, can you give me more context on
--forced-cluster-upgrade
? @DmitriGekhtman^ After taking a look at the code, I’m almost certain that this is what’s happening.
There are quite a few circumstances where the K8s environment may mutate a pod’s resource requests or limits – for example, GKE autopilot may do this to suit its internal bin-packing.
With the current implementation of
--forced-cluster-upgrade
, this would lead to the sort of churning behavior observed in this issue.