kuberay: [Bug][RayJobs] Ray spawns more pods than threshold (Max Workers for specific Group).
Search before asking
- I searched the issues and found no similar issues.
KubeRay Component
Others
What happened + What you expected to happen
During worker pod termination after a Ray Job, the cluster will try respawning new workers and it will start to scale above the threshold. I’ve seen Ray clusters scale to over workers 100 pods due to this issue when the threshold is at 25.
The workersToDelete:
array continues to grow and more pods continue to spawn. Not that the threshold is only 5
for this type of worker.
I think it’s because the Ray Workers get stuck in a terminating state, so it tries to respawn them to gracefully terminate. This behavior happens randomly and isn’t after every job. Is there a way for to tell the autoscaler to not spawn additional pods if the replicas
are equal to maxWorkers?
Reproduction script
apiVersion: ray.io/v1alpha1
kind: RayCluster
metadata:
name: raycluster-complete
namespace: ray
labels:
controller-tools.k8s.io: '1.0'
spec:
enableInTreeAutoscaling: true
headGroupSpec:
rayStartParams:
port: '6379'
dashboard-host: 0.0.0.0
node-manager-port: '9998'
block: 'true'
no-monitor: 'true'
num-gpus: '0'
object-manager-port: '9999'
node-ip-address: $MY_POD_IP
object-store-memory: '100000000'
replicas: 1
serviceType: ClusterIP
template:
metadata:
labels:
groupName: headgroup
rayCluster: raycluster
rayNodeType: head
spec:
containers:
- env:
- name: CPU_REQUEST
valueFrom:
resourceFieldRef:
containerName: ray-head
resource: requests.cpu
- name: CPU_LIMITS
valueFrom:
resourceFieldRef:
containerName: ray-head
resource: limits.cpu
- name: MEMORY_LIMITS
valueFrom:
resourceFieldRef:
containerName: ray-head
resource: limits.memory
- name: MEMORY_REQUESTS
valueFrom:
resourceFieldRef:
containerName: ray-head
resource: requests.memory
- name: MY_POD_IP
valueFrom:
fieldRef:
fieldPath: status.podIP
image: 'rayproject/ray:2.0.0-py39'
imagePullPolicy: Always
lifecycle:
preStop:
exec:
command:
- /bin/sh
- '-c'
- ray stop
name: ray-head
ports:
- containerPort: 6379
name: gcs
protocol: TCP
- containerPort: 8265
name: dashboard
protocol: TCP
- containerPort: 10001
name: client
protocol: TCP
resources:
limits:
cpu: '20'
memory: 20Gi
requests:
cpu: '10'
memory: 20Gi
rayVersion: 2.0.0
workerGroupSpecs:
- groupName: cpu-group
maxReplicas: 15
minReplicas: 0
rayStartParams:
block: 'true'
node-ip-address: $MY_POD_IP
num-gpus: '0'
replicas: 0
template:
spec:
containers:
- env:
- name: RAY_DISABLE_DOCKER_CPU_WARNING
value: '1'
- name: TYPE
value: worker
- name: CPU_REQUEST
valueFrom:
resourceFieldRef:
containerName: ray-worker
resource: requests.cpu
- name: CPU_LIMITS
valueFrom:
resourceFieldRef:
containerName: ray-worker
resource: limits.cpu
- name: MEMORY_LIMITS
valueFrom:
resourceFieldRef:
containerName: ray-worker
resource: limits.memory
- name: MEMORY_REQUESTS
valueFrom:
resourceFieldRef:
containerName: ray-worker
resource: requests.memory
- name: MY_POD_NAME
valueFrom:
fieldRef:
fieldPath: metadata.name
- name: MY_POD_IP
valueFrom:
fieldRef:
fieldPath: status.podIP
image: 'rayproject/ray:2.0.0-py39-9'
lifecycle:
preStop:
exec:
command:
- /bin/sh
- '-c'
- ray stop
name: ray-worker
ports:
- containerPort: 80
protocol: TCP
resources:
limits:
cpu: '18'
memory: 28Gi
requests:
cpu: '18'
memory: 28Gi
volumeMounts:
- mountPath: /var/log
name: log-volume
initContainers:
- command:
- sh
- '-c'
- >-
until nslookup $RAY_IP.$(cat
/var/run/secrets/kubernetes.io/serviceaccount/namespace).svc.cluster.local;
do echo waiting for myservice; sleep 2; done
image: 'busybox:1.28'
name: init-myservice
volumes:
- emptyDir: {}
name: log-volume
- groupName: gpu-group
maxReplicas: 5
minReplicas: 0
rayStartParams:
block: 'true'
node-ip-address: $MY_POD_IP
num-gpus: '1'
replicas: 9
scaleStrategy:
workersToDelete:
- raycluster-complete-worker-gpu-group-tm72g
- raycluster-complete-worker-gpu-group-tm71g
- raycluster-complete-worker-gpu-group-tm42g
- raycluster-complete-worker-gpu-group-tm32g
- raycluster-complete-worker-gpu-group-tm62g
- raycluster-complete-worker-gpu-group-tm22g
- raycluster-complete-worker-gpu-group-tm52g
- raycluster-complete-worker-gpu-group-tm12g
- raycluster-complete-worker-gpu-group-tm32g
containers:
- env:
- name: RAY_DISABLE_DOCKER_CPU_WARNING
value: '1'
- name: TYPE
value: worker
- name: CPU_REQUEST
valueFrom:
resourceFieldRef:
containerName: ray-worker-gpu
resource: requests.cpu
- name: CPU_LIMITS
valueFrom:
resourceFieldRef:
containerName: ray-worker-gpu
resource: limits.cpu
- name: MEMORY_LIMITS
valueFrom:
resourceFieldRef:
containerName: ray-worker-gpu
resource: limits.memory
- name: MEMORY_REQUESTS
valueFrom:
resourceFieldRef:
containerName: ray-worker-gpu
resource: requests.memory
- name: MY_POD_NAME
valueFrom:
fieldRef:
fieldPath: metadata.name
- name: MY_POD_IP
valueFrom:
fieldRef:
fieldPath: status.podIP
image: 'rayproject/ray:2.0.0-py39-cu111'
lifecycle:
preStop:
exec:
command:
- /bin/sh
- '-c'
- ray stop
name: gpu
ports:
- containerPort: 80
protocol: TCP
resources:
limits:
cpu: 20
memory: 20Gi
nvidia.com/gpu: 1
requests:
cpu: 1
memory: 20Gi
nvidia.com/gpu: 1
volumeMounts:
- mountPath: /var/log
name: log-volume
initContainers:
- command:
- sh
- '-c'
- >-
until nslookup $RAY_IP.$(cat
/var/run/secrets/kubernetes.io/serviceaccount/namespace).svc.cluster.local;
do echo waiting for myservice; sleep 2; done
image: 'busybox:1.28'
name: init-myservice
volumes:
- emptyDir: {}
name: log-volume
Anything else
Running Ray 2.0 with the latest stable Kuberay Operator.
Are you willing to submit a PR?
- Yes I am willing to submit a PR!
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Comments: 20 (1 by maintainers)
Commits related to this issue
- [autoscaler][kuberay] Never request more than maxReplicas worker pods (#29770) Partially addresses ray-project/kuberay#560, in which it was observed that "replicas" was being set higher than "maxRepl... — committed to ray-project/ray by DmitriGekhtman 2 years ago
- [autoscaler][kuberay] Never request more than maxReplicas worker pods (#29770) Partially addresses ray-project/kuberay#560, in which it was observed that "replicas" was being set higher than "maxRepl... — committed to WeichenXu123/ray by DmitriGekhtman 2 years ago
The output of
ray status
from within the Ray cluster is also very useful.@peterghaddad do you mean autoscaler termination of worker pods due to idle timeout? As opposed to RayCluster termination due to job completion? Actually, I’m not 100% sure RayCluster termination due to job completion is a feature of RayJobs right now (need to have another look at the docs and code)
I’m trying to narrow down whether the issue is with the job controller, the raycluster controller, the autoscaler, or some combination of the three.
Some safeguards around the maxWorkers in the RayCluster controller would be the simplest surface-level fix.
@DmitriGekhtman and @harryge00 This is a pretty big problem that we are continuing to experience.
For example, if we have a Ray remote function @ray.remote(num_gpus=1) that is only called once it sometimes gets into a weird state as such:
It is trying to resize to 18 GPUs but maxWorkers for GPUs is at 10.