ray: [xgboost_ray] Model Train Hangs/freeze when multiple actors used

Description

Background/Context

I was following the guide here for running a sample XGBoost model using xgboost_ray.

Problem

I have enough memory allocated(8 GB) and the program runs, but it seems to be training indefinitely for a long time.

Ray dashboard doesn’t seem to show any errors in the logs.

Screen Shot 2022-07-15 at 2 44 50 PM

I ran the job using the following configuration for RayParams

num_actors = 2
cpus_per_actor = 1

However, if I change num_actors to 1 instead of 2, it completes successfully - and DOES NOT get stuck.

Not sure why it seems to fail when more than 1 actor is used even though the guide link above says to use 2 actors. 🤔

Steps to reproduce

Create a Ray cluster using no Helm approach: kubectl -n <namespace> create -f ./example_cluster.yaml

example_cluster.yaml file:

apiVersion: cluster.ray.io/v1
kind: RayCluster
metadata:
  name: example-cluster
spec:
  # The maximum number of workers nodes to launch in addition to the head node.
  maxWorkers: 3
  # The autoscaler will scale up the cluster faster with higher upscaling speed.
  # E.g., if the task requires adding more nodes then autoscaler will gradually
  # scale up the cluster in chunks of upscaling_speed*currently_running_nodes.
  # This number should be > 0.
  upscalingSpeed: 1.0
  # If a node is idle for this many minutes, it will be removed.
  idleTimeoutMinutes: 5
  # Specify the pod type for the ray head node (as configured below).
  headPodType: head-node
  # Optionally, configure ports for the Ray head service.
  # The ports specified below are the defaults.
  headServicePorts:
    - name: client
      port: 10001
      targetPort: 10001
    - name: dashboard
      port: 8265
      targetPort: 8265
    - name: ray-serve
      port: 8000
      targetPort: 8000
  # Specify the allowed pod types for this ray cluster and the resources they provide.
  podTypes:
    - name: head-node
      podConfig:
        apiVersion: v1
        kind: Pod
        metadata:
          # The operator automatically prepends the cluster name to this field.
          generateName: ray-head-
          annotations:
            sidecar.istio.io/inject: "false"
        spec:
          restartPolicy: Never
          serviceAccount: default-editor
          # This volume allocates shared memory for Ray to use for its plasma
          # object store. If you do not provide this, Ray will fall back to
          # /tmp which cause slowdowns if is not a shared memory volume.
          volumes:
            - name: dshm
              emptyDir:
                medium: Memory
          containers:
            - name: ray-node
              imagePullPolicy: Always
              image: rayproject/ray:latest
              # Do not change this command - it keeps the pod alive until it is
              # explicitly killed.
              command: ["/bin/bash", "-c", "--"]
              args: ["trap : TERM INT; touch /tmp/raylogs; tail -f /tmp/raylogs; sleep infinity & wait;"]
              ports:
                - containerPort: 6379  # Redis port for Ray <= 1.10.0. GCS server port for Ray >= 1.11.0.
                - containerPort: 10001  # Used by Ray Client
                - containerPort: 8265  # Used by Ray Dashboard
                - containerPort: 8000 # Used by Ray Serve

              # This volume allocates shared memory for Ray to use for its plasma
              # object store. If you do not provide this, Ray will fall back to
              # /tmp which cause slowdowns if is not a shared memory volume.
              volumeMounts:
                - mountPath: /dev/shm
                  name: dshm
              resources:
                requests:
                  cpu: 4
                  memory: 8Gi
                  ephemeral-storage: 8Gi
                limits:
                  cpu: 4
                  # The maximum memory that this pod is allowed to use. The
                  # limit will be detected by ray and split to use 10% for
                  # redis, 30% for the shared memory object store, and the
                  # rest for application memory. If this limit is not set and
                  # the object store size is not set manually, ray will
                  # allocate a very large object store in each pod that may
                  # cause problems for other pods.
                  memory: 8Gi
    - name: worker-node
      # Minimum number of Ray workers of this Pod type.
      minWorkers: 2
      # Maximum number of Ray workers of this Pod type. Takes precedence over minWorkers.
      maxWorkers: 3
      # User-specified custom resources for use by Ray.
      # (Ray detects CPU and GPU from pod spec resource requests and limits, so no need to fill those here.)
      rayResources: {"example-resource-a": 1, "example-resource-b": 1}
      podConfig:
        apiVersion: v1
        kind: Pod
        metadata:
          # The operator automatically prepends the cluster name to this field.
          generateName: ray-worker-
          annotations:
            sidecar.istio.io/inject: "false"
        spec:
          serviceAccount: default-editor
          restartPolicy: Never
          volumes:
            - name: dshm
              emptyDir:
                medium: Memory
          containers:
            - name: ray-node
              imagePullPolicy: Always
              image: rayproject/ray:latest
              command: ["/bin/bash", "-c", "--"]
              args: ["trap : TERM INT; touch /tmp/raylogs; tail -f /tmp/raylogs; sleep infinity & wait;"]
              # This volume allocates shared memory for Ray to use for its plasma
              # object store. If you do not provide this, Ray will fall back to
              # /tmp which cause slowdowns if is not a shared memory volume.
              volumeMounts:
                - mountPath: /dev/shm
                  name: dshm
              resources:
                requests:
                  cpu: 4
                  memory: 8Gi
                  ephemeral-storage: 8Gi
                limits:
                  cpu: 4
                  # The maximum memory that this pod is allowed to use. The
                  # limit will be detected by ray and split to use 10% for
                  # redis, 30% for the shared memory object store, and the
                  # rest for application memory. If this limit is not set and
                  # the object store size is not set manually, ray will
                  # allocate a very large object store in each pod that may
                  # cause problems for other pods.
                  memory: 8Gi
  # Commands to start Ray on the head node. You don't need to change this.
  # Note dashboard-host is set to 0.0.0.0 so that Kubernetes can port forward.
  headStartRayCommands:
    - ray stop
    - ulimit -n 65536; ray start --disable-usage-stats --head --port=6379 --no-monitor --dashboard-host 0.0.0.0 &> /tmp/raylogs
  # Commands to start Ray on worker nodes. You don't need to change this.
  workerStartRayCommands:
    - ray stop
    - ulimit -n 65536; ray start --disable-usage-stats --address=$RAY_HEAD_IP:6379 &> /tmp/raylogs

Ray App Script:

import ray
from xgboost_ray import RayDMatrix, RayParams, train, predict
from sklearn.datasets import load_breast_cancer
import xgboost as xgb

# Initialize connection to remote Ray cluster on K8s and install dependencies
runtime_env = {"pip": ["pyarrow==8.0.0", "xgboost_ray", "gcsfs", "sklearn"]}

ray.init("ray://example-cluster-ray-head:10001", runtime_env=runtime_env)

# XG Boost Model Training
train_x, train_y = load_breast_cancer(return_X_y=True)
train_set = RayDMatrix(train_x, train_y)

evals_result = {}
bst = train({"objective": "binary:logistic","eval_metric": ["logloss", "error"],},train_set,evals_result=evals_result,evals=[(train_set, "train")],verbose_eval=False,
            ray_params=RayParams(num_actors=2, cpus_per_actor=1) #num_actors=2
           )

bst.save_model("model.xgb")
print("Final training error: {:.4f}".format(
    evals_result["train"]["error"][-1]))

# Model Prediction
data, labels = load_breast_cancer(return_X_y=True)

dpred = RayDMatrix(data, labels)

bst = xgb.Booster(model_file="model.xgb")
pred_ray = predict(bst, dpred, ray_params=RayParams(num_actors=1))

print(pred_ray)

Link

https://docs.ray.io/en/latest/ray-more-libs/xgboost-ray.html#usage

About this issue

Original URL
State: closed
Created 2 years ago
Comments: 26 (12 by maintainers)

Most upvoted comments

We just came off a debugging session.

Results:

The script does work on AWS EC2 instances (distributed workers)
The script does not work on the k8s cluster when the Ray actors are distributed across nodes
The script does work on the same k8s cluster when we force all Ray actors onto the same node

This thus seems to be a communication issue between kubernetes pods.

We’ll try to repro this on our side to see if we can resolve this. One thing to try out would be to open up all ports for cross-node communication.

cc @DmitriGekhtman can you help me on how to set this up/deploy on our infrastructure to repro?

krfricke on Jul 21, 2022

Hey @Anando304! Your intuition about the Ray release is most astute 😃 😃 I’ll take a look into reproducing this issue today.

Besides unblocking you, we have selfish ulterior motives: In the context of the Ray release, we want Ray Tune on K8s to be a great experience 😃

DmitriGekhtman on Jul 28, 2022

Can you post logs @Anando304 (/tmp/raylogs)?

richardliaw on Jul 15, 2022