ray: [Core] [Bug] Remote client environment is not setting up properly.

Search before asking

  • I searched the issues and found no similar issues.

Ray Component

Ray Core, Ray Clusters

What happened + What you expected to happen

Hi everyone, I have a Ray cluster deployed on Azure K8s. I connect to it using kubectl command as given in the documentation. I initially ran the task in the local ray environment to test the scaling and working of the task. Then, when I am trying to run put an object to test on the cluster, it throws the following error:

Put failed:
---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
<ipython-input-9-bd8872eca93e> in <module>
     16     'combined_dfs': combined_dfs_ray,
     17 }
---> 18 run_validation = validate_source.remote(config)
ModuleNotFoundError: No module named 'sklearn'

But sklearn is available in the local env. Also, for the algorithm we are running, we are using runtime_env option while running the ray.init command to make available our custom code. All the dependencies are already installed in the env.

LOCAL_PORT = 10001
ray.init(f"ray://127.0.0.1:{LOCAL_PORT}",
         runtime_env={
             "working_dir": "../src",
         })

By the put command, I was expecting an objectId, thus telling that the object has been successfully transferred to the cluster.

Versions / Dependencies

I am using:

  • Conda (Windows)
  • Python 3.8.12
  • Ray[default] 1.9.1
  • Remote client on Azure k8s

Reproduction script

I am trying to workout a small code sample, but the problem is that the simple code structure and no external dependencies are working fine in the remote client environment. The error occurs when I am using our existing dev env. I am trying to create an example in the meanwhile. Also, If there are any logs which could help in providing information then please do let me know.

Anything else

No response

Are you willing to submit a PR?

  • Yes I am willing to submit a PR!

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Comments: 15 (6 by maintainers)

Most upvoted comments

i had the same problem

In [4]: ray.init(address='auto', namespace='algo-serve', runtime_env={'pip':['requests==2.1.2'], 'env_vars': dict(os.environ)})
2022-01-10 12:20:57,224	INFO worker.py:843 -- Connecting to existing Ray cluster at address: 10.251.192.213:6379
Out[4]:
{'node_ip_address': '10.251.192.213',
 'raylet_ip_address': '10.251.192.213',
 'redis_address': '10.251.192.213:6379',
 'object_store_address': '/tmp/ray/session_2022-01-10_11-43-51_099860_49411/sockets/plasma_store',
 'raylet_socket_name': '/tmp/ray/session_2022-01-10_11-43-51_099860_49411/sockets/raylet',
 'webui_url': '10.251.192.213:8265',
 'session_dir': '/tmp/ray/session_2022-01-10_11-43-51_099860_49411',
 'metrics_export_port': 61428,
 'node_id': 'ac4b598c8477048e81b883069027527fbaf0ba9be758441c427be970'}

(raylet) [2022-01-10 12:20:57,396 E 49504 49504] agent_manager.cc:237: Failed to delete URIs, status = IOError: , maybe there are some network problems, will retry it later.
(raylet) [2022-01-10 12:20:57,397 E 49504 49504] agent_manager.cc:237: Failed to delete URIs, status = IOError: , maybe there are some network problems, will retry it later.
(raylet, ip=10.251.183.221) [2022-01-10 12:20:57,615 E 157 157] agent_manager.cc:237: Failed to delete URIs, status = IOError: , maybe there are some network problems, will retry it later.
(raylet, ip=10.251.183.221) [2022-01-10 12:20:57,617 E 157 157] agent_manager.cc:237: Failed to delete URIs, status = IOError: , maybe there are some network problems, will retry it later.
(raylet, ip=10.251.183.221) [2022-01-10 12:20:57,682 E 157 157] agent_manager.cc:237: Failed to delete URIs, status = IOError: , maybe there are some network problems, will retry it later.
(raylet, ip=10.251.183.221) [2022-01-10 12:20:57,809 E 157 157] agent_manager.cc:237: Failed to delete URIs, status = IOError: , maybe there are some network problems, will retry it later.

Thanks @dongruixiao, could you share some more details about your setup? Are you using helm charts as well?

I only run it in my custom cluster implemented via node_provider.py and do not use helm

and this is my config:

cluster_name: default

max_workers: 2

upscaling_speed: 1.0

idle_timeout_minutes: 5

provider:
    type: external
    module: test.my_provider
auth:
    ssh_user: ubuntu

available_node_types:
    ray.head.default:
        resources: 
          ...

    ray.worker.default:
        min_workers: 0
        max_workers: 32

        resources:
            ... 
        
head_node_type: ray.head.default

file_mounts: {
#    "/path1/on/remote/machine": "/path1/on/local/machine",
#    "/path2/on/remote/machine": "/path2/on/local/machine",
}

cluster_synced_files: []

file_mounts_sync_continuously: False

rsync_exclude: []

rsync_filter: []

initialization_commands: []

setup_commands:
    - test ! -z $all_proxy || echo 'export all_proxy="..."' >> ~/.bashrc
    - test -d $HOME/anaconda3 || wget https://repo.continuum.io/archive/Anaconda3-2021.11-Linux-x86_64.sh
    - test -d $HOME/anaconda3 || bash Anaconda3-2021.11-Linux-x86_64.sh -b -p $HOME/anaconda3
    - which conda || echo 'export PATH="$HOME/anaconda3/bin:$PATH"' >> ~/.bashrc
 
head_setup_commands: []

worker_setup_commands: []

head_start_ray_commands:
    - ray stop
    - ulimit -c unlimited
    - ray start --head --port=6379 --object-manager-port=8076 --include-dashboard true --dashboard-host 0.0.0.0 --autoscaling-config=~/ray_bootstrap_config.yaml

worker_start_ray_commands:
    - ray stop
    - ulimit -c unlimited
    - ray start --address=$RAY_HEAD_IP:6379 --object-manager-port=8076

head_node: {}
worker_nodes: {}

When running on the cluster you need to make sure the deps are also installed there. Don’t need to locally because everything is running in the same env (the local env).