ray: [Bug] Resource demand backlog caps out at 10+ in multi-node K8s cluster
Search before asking
- I searched the issues and found no similar issues.
Ray Component
Ray Clusters
What happened + What you expected to happen
My Setup:
I have deployed a Ray Cluster on Kubernetes with EKS. The cluster has a worker type with 15 CPUs that is always on (min_workers: 1). Other worker types include 7 CPUs, 15 CPUs, 30 CPUs.
I’ve also configured: upscaling_speed=9999, AUTOSCALER_MAX_LAUNCH_BATCH=9999, AUTOSCALER_MAX_CONCURRENT_LAUNCHES=9999 as recommended here: https://github.com/ray-project/ray/issues/21683#issuecomment-1018015823
Main Issue:
Whenever I launch 200 tasks, the autoscaler only launches a single 7-cpu worker. After that worker is ready, it launches another single 7-cpu worker. I expect it to launch at least enough workers for 185 CPUs in the first iteration.
In the Ray Operator Pod Logs, I see {'CPU': 1.0}: 1+ pending tasks/actors. I imagine this should be something like “185+ pending tasks/actors”?
However, I can successfully use ray.autoscaler.sdk.request_resources to request 200 cpus all at once.
Ray Operator Pod Logs:
kubectl logs -f po/ray-operator-987dc99b9-47v92 | grep Resources -A 10
Resources
---------------------------------------------------------------
Usage:
0.0/15.0 CPU
0.00/25.900 GiB memory
0.00/10.249 GiB object_store_memory
Demands:
(no resource demands)
py38-cu112,karpenter:2022-02-04 10:47:45,337 DEBUG gcs_utils.py:245 -- internal_kv_put b'__autoscaling_status_legacy' b"Cluster status: 1 nodes\n - MostDelayedHeartbeats: {'10.16.115.16': 0.6050498485565186, '10.16.126.217': 0.6049869060516357}\n - NodeIdleSeconds: Min=0 Mean=38219 Max=76437\n - ResourceUsage: 0.0/15.0 CPU, 0.0 GiB/25.9 GiB memory, 0.0 GiB/10.25 GiB object_store_memory\n - TimeSinceLastHeartbeat: Min=0 Mean=0 Max=0\nWorker node types:\n - wkr-15cpu30g-ondemand: 1" True None
py38-cu112,karpenter:2022-02-04 10:47:45,338 DEBUG legacy_info_string.py:26 -- Cluster status: 1 nodes
--
Resources
---------------------------------------------------------------
Usage:
4.0/15.0 CPU
0.00/25.900 GiB memory
0.00/10.249 GiB object_store_memory
Demands:
{'CPU': 1.0}: 1+ pending tasks/actors
py38-cu112,karpenter:2022-02-04 10:47:51,278 DEBUG gcs_utils.py:245 -- internal_kv_put b'__autoscaling_status_legacy' b"Cluster status: 1 nodes\n - MostDelayedHeartbeats: {'10.16.115.16': 0.5087563991546631, '10.16.126.217': 0.5086848735809326}\n - NodeIdleSeconds: Min=0 Mean=0 Max=0\n - ResourceUsage: 4.0/15.0 CPU, 0.0 GiB/25.9 GiB memory, 0.0 GiB/10.25 GiB object_store_memory\n - TimeSinceLastHeartbeat: Min=0 Mean=0 Max=0\nWorker node types:\n - wkr-15cpu30g-ondemand: 1" True None
py38-cu112,karpenter:2022-02-04 10:47:51,289 DEBUG legacy_info_string.py:26 -- Cluster status: 1 nodes
--
Resources
---------------------------------------------------------------
Usage:
10.0/15.0 CPU
0.00/25.900 GiB memory
0.00/10.249 GiB object_store_memory
Demands:
{'CPU': 1.0}: 1+ pending tasks/actors
py38-cu112,karpenter:2022-02-04 10:47:57,323 DEBUG gcs_utils.py:245 -- internal_kv_put b'__autoscaling_status_legacy' b"Cluster status: 1 nodes\n - MostDelayedHeartbeats: {'10.16.115.16': 0.4549856185913086, '10.16.126.217': 0.4549136161804199}\n - NodeIdleSeconds: Min=0 Mean=0 Max=0\n - ResourceUsage: 10.0/15.0 CPU, 0.0 GiB/25.9 GiB memory, 0.0 GiB/10.25 GiB object_store_memory\n - TimeSinceLastHeartbeat: Min=0 Mean=0 Max=0\nWorker node types:\n - wkr-15cpu30g-ondemand: 1" True None
py38-cu112,karpenter:2022-02-04 10:47:57,324 DEBUG legacy_info_string.py:26 -- Cluster status: 1 nodes
--
Resources
---------------------------------------------------------------
Usage:
15.0/15.0 CPU
0.00/25.900 GiB memory
0.00/10.249 GiB object_store_memory
Demands:
{'CPU': 1.0}: 1+ pending tasks/actors
py38-cu112,karpenter:2022-02-04 10:48:03,348 DEBUG gcs_utils.py:245 -- internal_kv_put b'__autoscaling_status_legacy' b"Cluster status: 1 nodes\n - MostDelayedHeartbeats: {'10.16.115.16': 0.54921555519104, '10.16.126.217': 0.5491447448730469}\n - NodeIdleSeconds: Min=0 Mean=0 Max=0\n - ResourceUsage: 15.0/15.0 CPU, 0.0 GiB/25.9 GiB memory, 0.0 GiB/10.25 GiB object_store_memory\n - TimeSinceLastHeartbeat: Min=0 Mean=0 Max=0\nWorker node types:\n - wkr-15cpu30g-ondemand: 1" True None
py38-cu112,karpenter:2022-02-04 10:48:03,350 DEBUG legacy_info_string.py:26 -- Cluster status: 1 nodes
--
Resources
---------------------------------------------------------------
Usage:
15.0/15.0 CPU
0.00/25.900 GiB memory
0.00/10.249 GiB object_store_memory
Demands:
{'CPU': 1.0}: 1+ pending tasks/actors
Ray Pod Description:
Name: ray-operator-987dc99b9-47v92
Namespace: karpenter
Priority: 0
Node: ip-10-16-65-175.us-west-2.compute.internal/10.16.65.175
Start Time: Thu, 03 Feb 2022 13:05:01 -0600
Labels: cluster.ray.io/component=operator
pod-template-hash=987dc99b9
Annotations: kubernetes.io/psp: eks.privileged
Status: Running
IP: 10.16.68.93
IPs:
IP: 10.16.68.93
Controlled By: ReplicaSet/ray-operator-987dc99b9
Containers:
ray:
Container ID: docker://f78dc8efee6c1f8483438e325c275a281c310341a1f304f4811aa8c80d263903
Image: rayproject/ray:bbc64e
Image ID: docker-pullable://rayproject/ray@sha256:6a0520c449555a31418ed79188827553e43276db6abc1bc4341edd16dd832eb3
Port: <none>
Host Port: <none>
Command:
ray-operator
State: Running
Started: Thu, 03 Feb 2022 13:05:03 -0600
Ready: True
Restart Count: 0
Limits:
cpu: 1
memory: 2Gi
Requests:
cpu: 1
ephemeral-storage: 1Gi
memory: 1Gi
Environment:
AUTOSCALER_MAX_NUM_FAILURES: inf
AUTOSCALER_MAX_LAUNCH_BATCH: 9999
AUTOSCALER_MAX_CONCURRENT_LAUNCHES: 9999
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-wq2kl (ro)
Conditions:
Type Status
Initialized True
Ready True
ContainersReady True
PodScheduled True
Volumes:
kube-api-access-wq2kl:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
QoS Class: Burstable
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events: <none>
Versions / Dependencies
Ray Operator Image: rayproject/ray:bbc64e Ray Client Version: 1.9.2 Python 3.8 Kubernetes Version: 1.21 with EKS
Reproduction script
from pprint import pprint
import ray
ray.init("ray://mycluster.internal:10001")
@ray.remote
def task():
import time
time.sleep(30)
pprint(ray.cluster_resources())
results = ray.get([task.remote() for _ in range(200)])
Anything else
This issue occurs every time.
Are you willing to submit a PR?
- Yes I am willing to submit a PR!
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Comments: 48 (39 by maintainers)
You need to wait for the workers to come up before it can be reproduced.
@iycheng after a bit of fiddling I was able to reproduce on a K8s cluster: https://console.anyscale.com/o/anyscale-internal/clusters/ses_ts49ySqkJ1bSxYD7tXuhjWmH
I think the main thing is you need a large number of worker nodes (wait for 20 nodes to start). In that condition, when I ran the following script I saw that the CPUs in the cluster took a long time to become utilized, though eventually they did (and also eventually the demands increased from 1->9k):
I tried it today and it seems not reproducible with a cluster on ec2.
The commit is
40fa56f40cb1dbdc50d33291e5adf45e7f32d471.Cc @iycheng
Ray Client -> VPN -> AWS NLB (managed as a Kubernetes LoadBalancer Service using aws-load-balancer-controller) -> Ray Head Kubernetes Pod.
Clients connect from outside of the Kubernetes cluster. I connect using
ray.init("ray://mycluster.internal:10001"). I set upmycluster.internalso it resolves to the AWS NLB.Here are the kubernetes objects. I masked some IDs with “********”.
kubectl describe svc/ray-clusterkubectl describe po/ray-py38-cu112-head-4xnxjAutoscaler is only seeing 1 request queued. This is going to be tough to debug without a repro.
@vicyap can you try to wrap your submission code in a remote task so we can try to rule out whether it’s a ray core or ray client issue?
Hi @DmitriGekhtman, thank you for looking into this. About “One suggestion is to trying using the same Ray version across all components”, I did initially use the same version, 1.9.2, on all components and still hit this bug/behavior. I thought upgrading the ray operator to the latest docker image (
rayproject/ray:bbc64efor me at the time) might resolve it, but it did not.I’ll try again with everything at 1.10 and set AUTOSCALER_LOG_RESOURCE_BATCH_DATA=1.
I was not able to replicate with a similar set-up but everything running with Ray 1.10.0
One suggestion is to trying using the same Ray version across all components (client, operator, head, workers). (We’ll add that recommendation to the docs soon.)
Issues with unexpected resource demands are kind of a recurring theme lately. I’m going to push a PR adding a debug flag to enable logging the entire resource message in the autoscaler logs.