kuberay: [Bug][RayJobs] Ray spawns more pods than threshold (Max Workers for specific Group).

Search before asking

  • I searched the issues and found no similar issues.

KubeRay Component

Others

What happened + What you expected to happen

During worker pod termination after a Ray Job, the cluster will try respawning new workers and it will start to scale above the threshold. I’ve seen Ray clusters scale to over workers 100 pods due to this issue when the threshold is at 25.

The workersToDelete: array continues to grow and more pods continue to spawn. Not that the threshold is only 5 for this type of worker.

I think it’s because the Ray Workers get stuck in a terminating state, so it tries to respawn them to gracefully terminate. This behavior happens randomly and isn’t after every job. Is there a way for to tell the autoscaler to not spawn additional pods if the replicas are equal to maxWorkers?

Reproduction script

apiVersion: ray.io/v1alpha1
kind: RayCluster
metadata:
  name: raycluster-complete
  namespace: ray
  labels:
    controller-tools.k8s.io: '1.0'
spec:
  enableInTreeAutoscaling: true
  headGroupSpec:
    rayStartParams:
      port: '6379'
      dashboard-host: 0.0.0.0
      node-manager-port: '9998'
      block: 'true'
      no-monitor: 'true'
      num-gpus: '0'
      object-manager-port: '9999'
      node-ip-address: $MY_POD_IP
      object-store-memory: '100000000'
    replicas: 1
    serviceType: ClusterIP
    template:
      metadata:
        labels:
          groupName: headgroup
          rayCluster: raycluster
          rayNodeType: head
      spec:
        containers:
          - env:
              - name: CPU_REQUEST
                valueFrom:
                  resourceFieldRef:
                    containerName: ray-head
                    resource: requests.cpu
              - name: CPU_LIMITS
                valueFrom:
                  resourceFieldRef:
                    containerName: ray-head
                    resource: limits.cpu
              - name: MEMORY_LIMITS
                valueFrom:
                  resourceFieldRef:
                    containerName: ray-head
                    resource: limits.memory
              - name: MEMORY_REQUESTS
                valueFrom:
                  resourceFieldRef:
                    containerName: ray-head
                    resource: requests.memory
              - name: MY_POD_IP
                valueFrom:
                  fieldRef:
                    fieldPath: status.podIP
            image: 'rayproject/ray:2.0.0-py39'
            imagePullPolicy: Always
            lifecycle:
              preStop:
                exec:
                  command:
                    - /bin/sh
                    - '-c'
                    - ray stop
            name: ray-head
            ports:
              - containerPort: 6379
                name: gcs
                protocol: TCP
              - containerPort: 8265
                name: dashboard
                protocol: TCP
              - containerPort: 10001
                name: client
                protocol: TCP
            resources:
              limits:
                cpu: '20'
                memory: 20Gi
              requests:
                cpu: '10'
                memory: 20Gi
  rayVersion: 2.0.0
  workerGroupSpecs:
    - groupName: cpu-group
      maxReplicas: 15
      minReplicas: 0
      rayStartParams:
        block: 'true'
        node-ip-address: $MY_POD_IP
        num-gpus: '0'
      replicas: 0
      template:
        spec:
          containers:
            - env:
                - name: RAY_DISABLE_DOCKER_CPU_WARNING
                  value: '1'
                - name: TYPE
                  value: worker
                - name: CPU_REQUEST
                  valueFrom:
                    resourceFieldRef:
                      containerName: ray-worker
                      resource: requests.cpu
                - name: CPU_LIMITS
                  valueFrom:
                    resourceFieldRef:
                      containerName: ray-worker
                      resource: limits.cpu
                - name: MEMORY_LIMITS
                  valueFrom:
                    resourceFieldRef:
                      containerName: ray-worker
                      resource: limits.memory
                - name: MEMORY_REQUESTS
                  valueFrom:
                    resourceFieldRef:
                      containerName: ray-worker
                      resource: requests.memory
                - name: MY_POD_NAME
                  valueFrom:
                    fieldRef:
                      fieldPath: metadata.name
                - name: MY_POD_IP
                  valueFrom:
                    fieldRef:
                      fieldPath: status.podIP
              image: 'rayproject/ray:2.0.0-py39-9'
              lifecycle:
                preStop:
                  exec:
                    command:
                      - /bin/sh
                      - '-c'
                      - ray stop
              name: ray-worker
              ports:
                - containerPort: 80
                  protocol: TCP
              resources:
                limits:
                  cpu: '18'
                  memory: 28Gi
                requests:
                  cpu: '18'
                  memory: 28Gi
              volumeMounts:
                - mountPath: /var/log
                  name: log-volume
          initContainers:
            - command:
                - sh
                - '-c'
                - >-
                  until nslookup $RAY_IP.$(cat
                  /var/run/secrets/kubernetes.io/serviceaccount/namespace).svc.cluster.local;
                  do echo waiting for myservice; sleep 2; done
              image: 'busybox:1.28'
              name: init-myservice
          volumes:
            - emptyDir: {}
              name: log-volume
    - groupName: gpu-group
      maxReplicas: 5
      minReplicas: 0
      rayStartParams:
        block: 'true'
        node-ip-address: $MY_POD_IP
        num-gpus: '1'
      replicas: 9
      scaleStrategy:
        workersToDelete:
          - raycluster-complete-worker-gpu-group-tm72g
          - raycluster-complete-worker-gpu-group-tm71g
          - raycluster-complete-worker-gpu-group-tm42g
          - raycluster-complete-worker-gpu-group-tm32g
          - raycluster-complete-worker-gpu-group-tm62g
          - raycluster-complete-worker-gpu-group-tm22g 
          - raycluster-complete-worker-gpu-group-tm52g
          - raycluster-complete-worker-gpu-group-tm12g
          - raycluster-complete-worker-gpu-group-tm32g
          containers:
            - env:
                - name: RAY_DISABLE_DOCKER_CPU_WARNING
                  value: '1'
                - name: TYPE
                  value: worker
                - name: CPU_REQUEST
                  valueFrom:
                    resourceFieldRef:
                      containerName: ray-worker-gpu
                      resource: requests.cpu
                - name: CPU_LIMITS
                  valueFrom:
                    resourceFieldRef:
                      containerName: ray-worker-gpu
                      resource: limits.cpu
                - name: MEMORY_LIMITS
                  valueFrom:
                    resourceFieldRef:
                      containerName: ray-worker-gpu
                      resource: limits.memory
                - name: MEMORY_REQUESTS
                  valueFrom:
                    resourceFieldRef:
                      containerName: ray-worker-gpu
                      resource: requests.memory
                - name: MY_POD_NAME
                  valueFrom:
                    fieldRef:
                      fieldPath: metadata.name
                - name: MY_POD_IP
                  valueFrom:
                    fieldRef:
                      fieldPath: status.podIP
              image: 'rayproject/ray:2.0.0-py39-cu111'
              lifecycle:
                preStop:
                  exec:
                    command:
                      - /bin/sh
                      - '-c'
                      - ray stop
              name: gpu
              ports:
                - containerPort: 80
                  protocol: TCP
              resources:
                limits:
                  cpu: 20
                  memory: 20Gi
                  nvidia.com/gpu: 1
                requests:
                  cpu: 1
                  memory: 20Gi
                  nvidia.com/gpu: 1
              volumeMounts:
                - mountPath: /var/log
                  name: log-volume
          initContainers:
            - command:
                - sh
                - '-c'
                - >-
                  until nslookup $RAY_IP.$(cat
                  /var/run/secrets/kubernetes.io/serviceaccount/namespace).svc.cluster.local;
                  do echo waiting for myservice; sleep 2; done
              image: 'busybox:1.28'
              name: init-myservice
          volumes:
            - emptyDir: {}
              name: log-volume

Anything else

Running Ray 2.0 with the latest stable Kuberay Operator.

Are you willing to submit a PR?

  • Yes I am willing to submit a PR!

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Comments: 20 (1 by maintainers)

Commits related to this issue

Most upvoted comments

The output of ray status from within the Ray cluster is also very useful.

During worker pod termination after a Ray Job

@peterghaddad do you mean autoscaler termination of worker pods due to idle timeout? As opposed to RayCluster termination due to job completion? Actually, I’m not 100% sure RayCluster termination due to job completion is a feature of RayJobs right now (need to have another look at the docs and code)

I’m trying to narrow down whether the issue is with the job controller, the raycluster controller, the autoscaler, or some combination of the three.

Some safeguards around the maxWorkers in the RayCluster controller would be the simplest surface-level fix.

@DmitriGekhtman and @harryge00 This is a pretty big problem that we are continuing to experience.

For example, if we have a Ray remote function @ray.remote(num_gpus=1) that is only called once it sometimes gets into a weird state as such:

172214[2m[1m[36m(scheduler +1h9m10s)[0m Resized to 220 CPUs, 10 GPUs.
172215[2m[1m[36m(scheduler +1h9m10s)[0m Removing 975 nodes of type gpu-group (max_workers_per_type).
172216[2m[1m[36m(scheduler +1h11m20s)[0m Resized to 320 CPUs, 15 GPUs.
172217[2m[1m[36m(scheduler +1h11m20s)[0m Removing 2 nodes of type gpu-group (idle).
172218[2m[1m[36m(scheduler +1h11m20s)[0m Removing 1223 nodes of type gpu-group (max_workers_per_type).
172219[2m[1m[33m(scheduler +1h11m20s)[0m Error: No available node types can fulfill resource request {'node:': 0.01}. Add suitable node types to this cluster to resolve this issue.
172220[2m[1m[36m(scheduler +1h14m1s)[0m Removing 5 nodes of type gpu-group (idle).
172221[2m[1m[36m(scheduler +1h14m1s)[0m Removing 1531 nodes of type gpu-group (max_workers_per_type).172222[2m[1m[36m(scheduler +1h17m19s)[0m Resized to 340 CPUs, 16 GPUs.
172223[2m[1m[36m(scheduler +1h17m19s)[0m Removing 1 nodes of type gpu-group (idle).
172224[2m[1m[36m(scheduler +1h17m19s)[0m Removing 1877 nodes of type gpu-group (max_workers_per_type).
172225[2m[1m[33m(scheduler +1h17m19s)[0m Error: No available node types can fulfill resource request {'node:': 0.01}. Add suitable node types to this cluster to resolve this issue.
172226[2m[1m[36m(scheduler +1h21m15s)[0m Resized to 380 CPUs, 18 GPUs.
172227[2m[1m[36m(scheduler +1h21m15s)[0m Removing 2303 nodes of type gpu-group (max_workers_per_type).
172228[2m[1m[36m(scheduler +1h25m59s)[0m Removing 6 nodes of type gpu-group (idle).
172229[2m[1m[36m(scheduler +1h25m59s)[0m Removing 2767 nodes of type gpu-group (max_workers_per_type).
172230[2m[1m[33m(scheduler +1h25m59s)[0m Error: No available node types can fulfill resource request {'node:': 0.01}. Add suitable node types to this cluster to resolve this issue.
172231[2m[1m[36m(scheduler +1h31m36s)[0m Resized to 440 CPUs, 21 GPUs.172232[2m[1m[36m(scheduler +1h31m36s)[0m Removing 3269 nodes of type gpu-group (max_workers_per_type).

It is trying to resize to 18 GPUs but maxWorkers for GPUs is at 10.