ray: [Bug] Resource demand backlog caps out at 10+ in multi-node K8s cluster

Search before asking

  • I searched the issues and found no similar issues.

Ray Component

Ray Clusters

What happened + What you expected to happen

My Setup:

I have deployed a Ray Cluster on Kubernetes with EKS. The cluster has a worker type with 15 CPUs that is always on (min_workers: 1). Other worker types include 7 CPUs, 15 CPUs, 30 CPUs.

I’ve also configured: upscaling_speed=9999, AUTOSCALER_MAX_LAUNCH_BATCH=9999, AUTOSCALER_MAX_CONCURRENT_LAUNCHES=9999 as recommended here: https://github.com/ray-project/ray/issues/21683#issuecomment-1018015823

Main Issue:

Whenever I launch 200 tasks, the autoscaler only launches a single 7-cpu worker. After that worker is ready, it launches another single 7-cpu worker. I expect it to launch at least enough workers for 185 CPUs in the first iteration.

In the Ray Operator Pod Logs, I see {'CPU': 1.0}: 1+ pending tasks/actors. I imagine this should be something like “185+ pending tasks/actors”?

However, I can successfully use ray.autoscaler.sdk.request_resources to request 200 cpus all at once.


Ray Operator Pod Logs: kubectl logs -f po/ray-operator-987dc99b9-47v92 | grep Resources -A 10

Resources
---------------------------------------------------------------
Usage:
 0.0/15.0 CPU
 0.00/25.900 GiB memory
 0.00/10.249 GiB object_store_memory

Demands:
 (no resource demands)
py38-cu112,karpenter:2022-02-04 10:47:45,337    DEBUG gcs_utils.py:245 -- internal_kv_put b'__autoscaling_status_legacy' b"Cluster status: 1 nodes\n - MostDelayedHeartbeats: {'10.16.115.16': 0.6050498485565186, '10.16.126.217': 0.6049869060516357}\n - NodeIdleSeconds: Min=0 Mean=38219 Max=76437\n - ResourceUsage: 0.0/15.0 CPU, 0.0 GiB/25.9 GiB memory, 0.0 GiB/10.25 GiB object_store_memory\n - TimeSinceLastHeartbeat: Min=0 Mean=0 Max=0\nWorker node types:\n - wkr-15cpu30g-ondemand: 1" True None
py38-cu112,karpenter:2022-02-04 10:47:45,338    DEBUG legacy_info_string.py:26 -- Cluster status: 1 nodes
--
Resources
---------------------------------------------------------------
Usage:
 4.0/15.0 CPU
 0.00/25.900 GiB memory
 0.00/10.249 GiB object_store_memory

Demands:
 {'CPU': 1.0}: 1+ pending tasks/actors
py38-cu112,karpenter:2022-02-04 10:47:51,278    DEBUG gcs_utils.py:245 -- internal_kv_put b'__autoscaling_status_legacy' b"Cluster status: 1 nodes\n - MostDelayedHeartbeats: {'10.16.115.16': 0.5087563991546631, '10.16.126.217': 0.5086848735809326}\n - NodeIdleSeconds: Min=0 Mean=0 Max=0\n - ResourceUsage: 4.0/15.0 CPU, 0.0 GiB/25.9 GiB memory, 0.0 GiB/10.25 GiB object_store_memory\n - TimeSinceLastHeartbeat: Min=0 Mean=0 Max=0\nWorker node types:\n - wkr-15cpu30g-ondemand: 1" True None
py38-cu112,karpenter:2022-02-04 10:47:51,289    DEBUG legacy_info_string.py:26 -- Cluster status: 1 nodes
--
Resources
---------------------------------------------------------------
Usage:
 10.0/15.0 CPU
 0.00/25.900 GiB memory
 0.00/10.249 GiB object_store_memory

Demands:
 {'CPU': 1.0}: 1+ pending tasks/actors
py38-cu112,karpenter:2022-02-04 10:47:57,323    DEBUG gcs_utils.py:245 -- internal_kv_put b'__autoscaling_status_legacy' b"Cluster status: 1 nodes\n - MostDelayedHeartbeats: {'10.16.115.16': 0.4549856185913086, '10.16.126.217': 0.4549136161804199}\n - NodeIdleSeconds: Min=0 Mean=0 Max=0\n - ResourceUsage: 10.0/15.0 CPU, 0.0 GiB/25.9 GiB memory, 0.0 GiB/10.25 GiB object_store_memory\n - TimeSinceLastHeartbeat: Min=0 Mean=0 Max=0\nWorker node types:\n - wkr-15cpu30g-ondemand: 1" True None
py38-cu112,karpenter:2022-02-04 10:47:57,324    DEBUG legacy_info_string.py:26 -- Cluster status: 1 nodes
--
Resources
---------------------------------------------------------------
Usage:
 15.0/15.0 CPU
 0.00/25.900 GiB memory
 0.00/10.249 GiB object_store_memory

Demands:
 {'CPU': 1.0}: 1+ pending tasks/actors
py38-cu112,karpenter:2022-02-04 10:48:03,348    DEBUG gcs_utils.py:245 -- internal_kv_put b'__autoscaling_status_legacy' b"Cluster status: 1 nodes\n - MostDelayedHeartbeats: {'10.16.115.16': 0.54921555519104, '10.16.126.217': 0.5491447448730469}\n - NodeIdleSeconds: Min=0 Mean=0 Max=0\n - ResourceUsage: 15.0/15.0 CPU, 0.0 GiB/25.9 GiB memory, 0.0 GiB/10.25 GiB object_store_memory\n - TimeSinceLastHeartbeat: Min=0 Mean=0 Max=0\nWorker node types:\n - wkr-15cpu30g-ondemand: 1" True None
py38-cu112,karpenter:2022-02-04 10:48:03,350    DEBUG legacy_info_string.py:26 -- Cluster status: 1 nodes
--
Resources
---------------------------------------------------------------
Usage:
 15.0/15.0 CPU
 0.00/25.900 GiB memory
 0.00/10.249 GiB object_store_memory

Demands:
 {'CPU': 1.0}: 1+ pending tasks/actors

Ray Pod Description:

Name:         ray-operator-987dc99b9-47v92
Namespace:    karpenter
Priority:     0
Node:         ip-10-16-65-175.us-west-2.compute.internal/10.16.65.175
Start Time:   Thu, 03 Feb 2022 13:05:01 -0600
Labels:       cluster.ray.io/component=operator
              pod-template-hash=987dc99b9
Annotations:  kubernetes.io/psp: eks.privileged
Status:       Running
IP:           10.16.68.93
IPs:
  IP:           10.16.68.93
Controlled By:  ReplicaSet/ray-operator-987dc99b9
Containers:
  ray:
    Container ID:  docker://f78dc8efee6c1f8483438e325c275a281c310341a1f304f4811aa8c80d263903
    Image:         rayproject/ray:bbc64e
    Image ID:      docker-pullable://rayproject/ray@sha256:6a0520c449555a31418ed79188827553e43276db6abc1bc4341edd16dd832eb3
    Port:          <none>
    Host Port:     <none>
    Command:
      ray-operator
    State:          Running
      Started:      Thu, 03 Feb 2022 13:05:03 -0600
    Ready:          True
    Restart Count:  0
    Limits:
      cpu:     1
      memory:  2Gi
    Requests:
      cpu:                1
      ephemeral-storage:  1Gi
      memory:             1Gi
    Environment:
      AUTOSCALER_MAX_NUM_FAILURES:         inf
      AUTOSCALER_MAX_LAUNCH_BATCH:         9999
      AUTOSCALER_MAX_CONCURRENT_LAUNCHES:  9999
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-wq2kl (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             True
  ContainersReady   True
  PodScheduled      True
Volumes:
  kube-api-access-wq2kl:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   Burstable
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:                      <none>

Versions / Dependencies

Ray Operator Image: rayproject/ray:bbc64e Ray Client Version: 1.9.2 Python 3.8 Kubernetes Version: 1.21 with EKS

Reproduction script

from pprint import pprint

import ray
ray.init("ray://mycluster.internal:10001")

@ray.remote
def task():
    import time
    time.sleep(30)

pprint(ray.cluster_resources())
results = ray.get([task.remote() for _ in range(200)])

Anything else

This issue occurs every time.

Are you willing to submit a PR?

  • Yes I am willing to submit a PR!

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Comments: 48 (39 by maintainers)

Most upvoted comments

You need to wait for the workers to come up before it can be reproduced.

@iycheng after a bit of fiddling I was able to reproduce on a K8s cluster: https://console.anyscale.com/o/anyscale-internal/clusters/ses_ts49ySqkJ1bSxYD7tXuhjWmH

I think the main thing is you need a large number of worker nodes (wait for 20 nodes to start). In that condition, when I ran the following script I saw that the CPUs in the cluster took a long time to become utilized, though eventually they did (and also eventually the demands increased from 1->9k):

import ray
import time

@ray.remote
def foo():
      import time
      time.sleep(999)

ray.get([foo.remote() for _ in range(10000)])

I tried it today and it seems not reproducible with a cluster on ec2.

Resources
---------------------------------------------------------------
Usage:
 0.00/164.205 GiB memory
 0.0/1.0 node
 0.00/74.365 GiB object_store_memory

Demands:
 {'CPU': 1.0}: 200+ pending tasks/actors

The commit is 40fa56f40cb1dbdc50d33291e5adf45e7f32d471.

Ray Client -> VPN -> AWS NLB (managed as a Kubernetes LoadBalancer Service using aws-load-balancer-controller) -> Ray Head Kubernetes Pod.

Clients connect from outside of the Kubernetes cluster. I connect using ray.init("ray://mycluster.internal:10001"). I set up mycluster.internal so it resolves to the AWS NLB.

Here are the kubernetes objects. I masked some IDs with “********”.

kubectl describe svc/ray-cluster

Name:                     ray-cluster
Namespace:                karpenter
Labels:                   <none>
Annotations:              service.beta.kubernetes.io/aws-load-balancer-name: ray-cluster-gjdk
                          service.beta.kubernetes.io/aws-load-balancer-nlb-target-type: ip
                          service.beta.kubernetes.io/aws-load-balancer-type: external
Selector:                 cluster.ray.io/component=ray-py38-cu112-ray-head
Type:                     LoadBalancer
IP Family Policy:         SingleStack
IP Families:              IPv4
IP:                       172.20.108.183
IPs:                      172.20.108.183
LoadBalancer Ingress:     ray-cluster-gjdk-********.elb.us-west-2.amazonaws.com
Port:                     <unset>  10001/TCP
TargetPort:               10001/TCP
NodePort:                 <unset>  32570/TCP
Endpoints:                10.16.102.85:10001
Session Affinity:         None
External Traffic Policy:  Cluster
Events:                   <none>

kubectl describe po/ray-py38-cu112-head-4xnxj

Name:         ray-py38-cu112-head-4xnxj
Namespace:    karpenter
Priority:     0
Node:         ip-10-16-99-98.us-west-2.compute.internal/10.16.99.98
Start Time:   Mon, 07 Feb 2022 13:29:34 -0600
Labels:       cluster.ray.io/component=ray-py38-cu112-ray-head
              ray-cluster-name=ray-py38-cu112
              ray-file-mounts-contents=********
              ray-launch-config=********
              ray-node-name=ray-ray-py38-cu112-head
              ray-node-status=up-to-date
              ray-node-type=head
              ray-node-uuid=********
              ray-runtime-config=********
              ray-user-node-type=head
Annotations:  kubernetes.io/psp: eks.privileged
Status:       Running
IP:           10.16.102.85
IPs:
  IP:           10.16.102.85
Controlled By:  RayCluster/ray-py38-cu112
Containers:
  ray-node:
    Container ID:  docker://6e6d69dc421a3a890efa5a75a15528e945c5196f1a8f67076f536a7ba309017e
    Image:         rayproject/ray-ml:1.10.0-py38-cu112
    Image ID:      docker-pullable://rayproject/ray-ml@sha256:604197d454df1d1de3bef55cdba72529120eeef0dae1adb90a24f3500dea6bff
    Ports:         6379/TCP, 10001/TCP, 8265/TCP, 8000/TCP
    Host Ports:    0/TCP, 0/TCP, 0/TCP, 0/TCP
    Command:
      /bin/bash
      -c
      --
    Args:
      trap : TERM INT; sleep infinity & wait;
    State:          Running
      Started:      Mon, 07 Feb 2022 13:34:58 -0600
    Ready:          True
    Restart Count:  0
    Limits:
      cpu:     0
      memory:  7G
    Requests:
      cpu:     0
      memory:  7G
    Environment:
      RAY_gcs_server_rpc_server_thread_num:  1
      RAY_PROFILING:                         1
    Mounts:
      /dev/shm from dshm (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-2lnvv (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             True 
  ContainersReady   True 
  PodScheduled      True 
Volumes:
  dshm:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     Memory
    SizeLimit:  <unset>
  kube-api-access-2lnvv:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   Burstable
Node-Selectors:              karpenter.sh/capacity-type=on-demand
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:                      <none>

Autoscaler is only seeing 1 request queued. This is going to be tough to debug without a repro.

resource_load_by_shape {
  resource_demands {
    shape {
      key: "CPU"
      value: 1.0
    }
    num_ready_requests_queued: 1
  }
}

@vicyap can you try to wrap your submission code in a remote task so we can try to rule out whether it’s a ray core or ray client issue?

Hi @DmitriGekhtman, thank you for looking into this. About “One suggestion is to trying using the same Ray version across all components”, I did initially use the same version, 1.9.2, on all components and still hit this bug/behavior. I thought upgrading the ray operator to the latest docker image (rayproject/ray:bbc64e for me at the time) might resolve it, but it did not.

I’ll try again with everything at 1.10 and set AUTOSCALER_LOG_RESOURCE_BATCH_DATA=1.

I was not able to replicate with a similar set-up but everything running with Ray 1.10.0

One suggestion is to trying using the same Ray version across all components (client, operator, head, workers). (We’ll add that recommendation to the docs soon.)

Issues with unexpected resource demands are kind of a recurring theme lately. I’m going to push a PR adding a debug flag to enable logging the entire resource message in the autoscaler logs.