ray: [Bug] Ray Autoscaler is not spinning down idle nodes due to secondary object copies

Search before asking

  • I searched the issues and found no similar issues.

Ray Component

Ray Tune

What happened + What you expected to happen

Ray Autoscaler is not spinning down idle nodes if they ever ran a trial for the active Ray Tune job

The issue is seen on Ray Tune on a CPU-head w/GPU-workers (min=0,max=9) Ray 1.9.1 cluster.
The Higgs Ray Tune job is set up to run up to 10 trials using async hyperband for 1hr with
max_concurrency of 3.  I see at most 3 trials running (each requiring 1 gpu and 4 cpus).
Except in the first logging at the job startup, no PENDING trials are reported.

At the time the Ray Tune job is stopped for the 1hr time limit at 12:43:48, the console log (see below) shows:
*) 3 nodes running Higgs trials (10.0.4.71, 10.0.6.5, 10.0.4.38)
*) 2 nodes that previously ran Higgs trials but are not doing so now (10.0.2.129, 10.0.3.245).
The latter 2 nodes last reported running trials at 12:21:30, so they should be spun down.

Note that, in this run, multiple Ray Tune jobs were running in the same Ray cluster with some overlap:
 MushroomEdibility Ray Tune 1hr job ran from 11:20-l2:20
 ForestCover       Ray tune 1hr job ran from 11:22-12:22
 Higgs             Ray Tune 1hr job ran from 11:45-12:45
After 12:22, there was no overlap of jobs and hence no other use of 2 idle workers that remained on other than the historical Higgs use.
Two other nodes that became idle when MushroomEdibility and ForestCover completed were spun down at that point, leaving the other 2 idle nodes that higgs had used running.
In the same kind of scenario later in the run, I observed that after the Higgs job completed, all Higgs trial workers were spun down.

Current time: 2022-01-24 12:43:48 (running for 00:59:30.27) ...
Number of trials: 9/10 (3 RUNNING, 6 TERMINATED)

+----------------+------------+-----------------+------------------------+--------------------------+----------------------+------------------------+------------------------------+-----------------+---------------------+-----------------------+-----------------------+------------------------+--------------------------+--------+------------------+----------------+
| Trial name     | status     | loc             |   combiner.bn_momentum |   combiner.bn_virtual_bs |   combiner.num_steps |   combiner.output_size |   combiner.relaxation_factor |   combiner.size |   combiner.sparsity |   training.batch_size |   training.decay_rate |   training.decay_steps |   training.learning_rate |   iter |   total time (s) |   metric_score |
|----------------+------------+-----------------+------------------------+--------------------------+----------------------+------------------------+------------------------------+-----------------+---------------------+-----------------------+-----------------------+------------------------+--------------------------+--------+------------------+----------------|
| trial_04bb0f22 | RUNNING    | 10.0.4.71:1938  |                   0.7  |                     2048 |                    4 |                      8 |                          1   |              32 |              0.0001 |                  8192 |                  0.95 |                    500 |                    0.025 |     18 |         3566.91  |       0.489641 |
| trial_3787a9c4 | RUNNING    | 10.0.6.5:17263  |                   0.9  |                     4096 |                    7 |                     24 |                          1.5 |              64 |              0      |                   256 |                  0.9  |                  10000 |                    0.01  |        |                  |                |
| trial_39a0ad6e | RUNNING    | 10.0.4.38:8657  |                   0.8  |                      256 |                    3 |                     16 |                          1.2 |              64 |              0.001  |                  4096 |                  0.95 |                   2000 |                    0.005 |      4 |         1268.2   |       0.50659  |
| trial_05396980 | TERMINATED | 10.0.2.129:2985 |                   0.8  |                      256 |                    9 |                    128 |                          1   |              32 |              0      |                  2048 |                  0.95 |                  10000 |                    0.005 |      1 |          913.295 |       0.53046  |
| trial_059befa6 | TERMINATED | 10.0.3.245:282  |                   0.98 |                     1024 |                    3 |                      8 |                          1   |               8 |              1e-06  |                  1024 |                  0.8  |                    500 |                    0.005 |      1 |          316.455 |       0.573849 |
| trial_c433a60c | TERMINATED | 10.0.3.245:281  |                   0.8  |                     1024 |                    7 |                     24 |                          2   |               8 |              0.001  |                   256 |                  0.95 |                  20000 |                    0.01  |      1 |         1450.99  |       0.568653 |
| trial_277d1a8a | TERMINATED | 10.0.4.38:8658  |                   0.9  |                      256 |                    5 |                     64 |                          1.5 |              64 |              0.0001 |                   512 |                  0.95 |                  20000 |                    0.005 |      1 |          861.914 |       0.56506  |
| trial_26f6b0b0 | TERMINATED | 10.0.2.129:3079 |                   0.6  |                      256 |                    3 |                     16 |                          1.2 |              16 |              0.01   |                  1024 |                  0.9  |                   8000 |                    0.005 |      1 |          457.482 |       0.56582  |
| trial_2acddc5e | TERMINATED | 10.0.3.245:504  |                   0.6  |                      512 |                    5 |                     32 |                          2   |               8 |              0      |                  2048 |                  0.95 |                  10000 |                    0.025 |      1 |          447.483 |       0.594953 |
+----------------+------------+-----------------+------------------------+--------------------------+----------------------+------------------------+------------------------------+-----------------+---------------------+-----------------------+-----------------------+------------------------+--------------------------+--------+------------------+----------------+

Versions / Dependencies

Ray 1.9.1

Reproduction script

https://github.com/ludwig-ai/experiments/blob/main/automl/validation/run_nodeless.sh run with Ray deployed on a K8s cluster. Can provide the Ray deployment script if desired.

Anything else

This problem is highly reproducible for me.

Are you willing to submit a PR?

  • Yes I am willing to submit a PR!

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Comments: 67 (29 by maintainers)

Commits related to this issue

Most upvoted comments

Hmm do we know what created those objects and what is referencing them? “ray memory” can show you more information on this.

Yes, thank you @mwtian and @iycheng that worked! At least it worked fine in my single worker node repro scenario above, so hopefully, it will work in general.

I was able to do the “pip install” for the workers using the “setupCommands:” and I was able to the “pip install” for the head here:

  headStartRayCommands:
    - ray stop
    - pip install -U https://s3-us-west-2.amazonaws.com/ray-wheels/latest/ray-2.0.0.dev0-cp37-cp37m-manylinux2014_x86_64.whl
    - ulimit -n 65536; ray start --head --no-monitor --dashboard-host 0.0.0.0 &> /tmp/raylogs
    ```

Don’t report memory usage of secondary objects for the purposes of autoscaler downscaling.

I vote for this approach.

Still looking at this. Will have a fix soon.

My experimental run last night with this change looked great, with the Ray Autoscaler scaling down as desired. Thank you very much!!

I did not restart the Ray Cluster after upgrading. But “ray --version” showed the expected version version 2.0.0.dev0 after upgrading on both the head and the worker.

In a k8s deployment, the autoscaler is running in a separate operator node; that node shows that it is running version 2.0.0.dev0 as well.

I can run the pip install in the setupCommands for the workers, but unfortunately, Ray does not support specifying setupCommands for the head when deploying onto K8s. But what I think I can do is bring up the Ray head (with 0 workers), update the head, and then restart the head, if you think that may work.

I’d recommend if possible running head and worker images with the changes built-in rather than running pip install at runtime or in setup commands.

diff --git a/deploy/components/example_cluster.yaml b/deploy/components/example_cluster.yaml
index 1513e8fde..d4359fb77 100644
--- a/deploy/components/example_cluster.yaml
+++ b/deploy/components/example_cluster.yaml
@@ -4,14 +4,14 @@ metadata:
   name: example-cluster
 spec:
   # The maximum number of workers nodes to launch in addition to the head node.
-  maxWorkers: 3
+  maxWorkers: 9
   # The autoscaler will scale up the cluster faster with higher upscaling speed.
   # E.g., if the task requires adding more nodes then autoscaler will gradually
   # scale up the cluster in chunks of upscaling_speed*currently_running_nodes.
   # This number should be > 0.
   upscalingSpeed: 1.0
   # If a node is idle for this many minutes, it will be removed.
-  idleTimeoutMinutes: 5
+  idleTimeoutMinutes: 3
   # Specify the pod type for the ray head node (as configured below).
   headPodType: head-node
   # Optionally, configure ports for the Ray head service.
@@ -35,6 +35,8 @@ spec:
       metadata:
         # The operator automatically prepends the cluster name to this field.
         generateName: ray-head-
+        annotations:
+          node.elotl.co/instance-family-exclusions: "g3,g3s"
       spec:
         restartPolicy: Never
 
@@ -48,11 +50,25 @@ spec:
         containers:
         - name: ray-node
           imagePullPolicy: Always
-          image: rayproject/ray:latest
+          # image: rayproject/ray:latest
+          image: ludwigai/ludwig-ray-gpu:tf-legacy
           # Do not change this command - it keeps the pod alive until it is
           # explicitly killed.
           command: ["/bin/bash", "-c", "--"]
           args: ["trap : TERM INT; touch /tmp/raylogs; tail -f /tmp/raylogs; sleep infinity & wait;"]
+          env:
+          - name: AWS_ACCESS_KEY_ID
+            value: "<censored>"
+          - name: AWS_SECRET_ACCESS_KEY
+            value: "<censored>"
+          - name: AWS_DEFAULT_REGION
+            value: "us-west-2"
+          - name: TUNE_TRIAL_STARTUP_GRACE_PERIOD
+            value: "120.0"
+          - name: TUNE_TRIAL_RESULT_WAIT_TIME_S
+            value: "120"
+          - name: TUNE_STATE_REFRESH_PERIOD
+            value: "5"
           ports:
           - containerPort: 6379  # Redis port
           - containerPort: 10001  # Used by Ray Client
@@ -67,9 +83,9 @@ spec:
             name: dshm
           resources:
             requests:
-              cpu: 1000m
-              memory: 512Mi
-              ephemeral-storage: 1Gi
+              cpu: 7
+              memory: 50Gi
+              ephemeral-storage: 64Gi
             limits:
               # The maximum memory that this pod is allowed to use. The
               # limit will be detected by ray and split to use 10% for
@@ -78,21 +94,25 @@ spec:
               # the object store size is not set manually, ray will
               # allocate a very large object store in each pod that may
               # cause problems for other pods.
-              memory: 512Mi
+              memory: 50Gi
   - name: worker-node
     # Minimum number of Ray workers of this Pod type.
-    minWorkers: 2
+    minWorkers: 0
     # Maximum number of Ray workers of this Pod type. Takes precedence over minWorkers.
-    maxWorkers: 3
+    maxWorkers: 9
     # User-specified custom resources for use by Ray.
     # (Ray detects CPU and GPU from pod spec resource requests and limits, so no need to fill those here.)
     rayResources: {"example-resource-a": 1, "example-resource-b": 1}
+    setupCommands:
+    - pip install boto3==1.17.106 awscli==1.19.106 botocore==1.20.106 s3fs==2021.10.0 aiobotocore==1.4.2 s3transfer==0.4.0 fsspec==2021.10.0
     podConfig:
       apiVersion: v1
       kind: Pod
       metadata:
         # The operator automatically prepends the cluster name to this field.
         generateName: ray-worker-
+        annotations:
+          node.elotl.co/instance-family-exclusions: "g3,g3s"
       spec:
         restartPolicy: Never
         volumes:
@@ -102,9 +122,23 @@ spec:
         containers:
         - name: ray-node
           imagePullPolicy: Always
-          image: rayproject/ray:latest
+          # image: rayproject/ray:latest
+          image: ludwigai/ludwig-ray-gpu:tf-legacy
           command: ["/bin/bash", "-c", "--"]
           args: ["trap : TERM INT; touch /tmp/raylogs; tail -f /tmp/raylogs; sleep infinity & wait;"]
+          env:
+          - name: AWS_ACCESS_KEY_ID
+            value: "<censored>"
+          - name: AWS_SECRET_ACCESS_KEY
+            value: "<censored>"
+          - name: AWS_DEFAULT_REGION
+            value: "us-west-2"
+          - name: TUNE_TRIAL_STARTUP_GRACE_PERIOD
+            value: "120.0"
+          - name: TUNE_TRIAL_RESULT_WAIT_TIME_S
+            value: "120"
+          - name: TUNE_STATE_REFRESH_PERIOD
+            value: "5"
           # This volume allocates shared memory for Ray to use for its plasma
           # object store. If you do not provide this, Ray will fall back to
           # /tmp which cause slowdowns if is not a shared memory volume.
@@ -113,9 +147,9 @@ spec:
             name: dshm
           resources:
             requests:
-              cpu: 1000m
-              memory: 512Mi
-              ephemeral-storage: 1Gi
+              cpu: 7
+              memory: 28Gi
+              ephemeral-storage: 64Gi
             limits:
               # The maximum memory that this pod is allowed to use. The
               # limit will be detected by ray and split to use 10% for
@@ -124,7 +158,8 @@ spec:
               # the object store size is not set manually, ray will
               # allocate a very large object store in each pod that may
               # cause problems for other pods.
-              memory: 512Mi
+              memory: 28Gi
+              nvidia.com/gpu: 1 # requesting 1 GPU
   # Commands to start Ray on the head node. You don't need to change this.
   # Note dashboard-host is set to 0.0.0.0 so that Kubernetes can port forward.
   headStartRayCommands: