ray: [Ray Client] Unable to connect to client when utilizing a proxy

What happened + What you expected to happen

When utilizing a Ray Cluster, and setting the environment variables of the pod i.e the gRPC proxy and http proxy variables, then I am unable to connect to the remote cluster using ray.init(). This only occurs when proxies are set, and I assure that my no_proxy environment variables are set properly.

Are there specific configurations that need to be set when running behind a proxy? I am rebuilding the image with those specified environment variables. Essentially, I am trying to use runtime_environments to install pip packages; however, when the proxy settings are set, this inhibits the gRPC client is unable to connect to the client.

These are the specific errors I receive:

INFO:ray.util.client.server.server:Starting Ray Client server on 0.0.0.0:10001
INFO:ray.util.client.server.proxier:New data connection from client 0d8a49073b31478a9518ecc41101c230: 
INFO:ray.util.client.server.proxier:SpecificServer started on port: 23000 with PID: 298 for client: 0d8a49073b31478a9518ecc41101c230
ERROR:ray.util.client.server.proxier:Timeout waiting for channel for 0d8a49073b31478a9518ecc41101c230
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/client/server/proxier.py", line 352, in get_channel
    server.channel).result(timeout=CHECK_CHANNEL_TIMEOUT_S)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/grpc/_utilities.py", line 139, in result
    self._block(timeout)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/grpc/_utilities.py", line 85, in _block
    raise grpc.FutureTimeoutError()
grpc.FutureTimeoutError
ERROR:ray.util.client.server.proxier:Timeout waiting for channel for 0d8a49073b31478a9518ecc41101c230
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/client/server/proxier.py", line 352, in get_channel
    server.channel).result(timeout=CHECK_CHANNEL_TIMEOUT_S)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/grpc/_utilities.py", line 139, in result
    self._block(timeout)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/grpc/_utilities.py", line 85, in _block
    raise grpc.FutureTimeoutError()
grpc.FutureTimeoutError
WARNING:ray.util.client.server.proxier:Retrying Logstream connection. 1 attempts failed.
ERROR:ray.util.client.server.proxier:Channel not found for 0d8a49073b31478a9518ecc41101c230
$ ERROR:ray.util.client.server.proxier:Channel not found for 0d8a49073b31478a9518ecc41101c230
sh: 19: ERROR:ray.util.client.server.proxier:Channel: not found

I was able to workaround this issue by manually setting the pip environment variables to utilize the proxy while the gRPC server does not. However, I do not believe this is the best workaround. However, I still do receive some proxy errors in the head nodes logs:

ERROR:ray.util.client.server.proxier:Proxying Datapath failed!
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/client/server/proxier.py", line 659, in Datapath
    for resp in resp_stream:
  File "/home/ray/anaconda3/lib/python3.7/site-packages/grpc/_channel.py", line 426, in __next__
    return self._next()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/grpc/_channel.py", line 826, in _next
    raise self
grpc._channel._MultiThreadedRendezvous: <_MultiThreadedRendezvous of RPC that terminated with:
        status = StatusCode.UNKNOWN
        details = "Exception iterating requests!"
        debug_error_string = "None"
>

Additionally, I have noticed issues when a worker node tries to install the pip packages. The Ray head node is able to, however worker nodes are not. Any insight would be appreciated.

(raylet, ip=10.131.4.120) [2022-04-12 12:29:46,896 E 127 127] (raylet) agent_manager.cc:190: Failed to create runtime env: {"envVars": {"PIP_NO_CACHE_DIR": "1"}, "extensions": {"_ray_commit": "fec30a25dbb5f3fa81d2bf419f75f5d40bc9fc39"}, "pythonRuntimeEnv": {"pipRuntimeEnv": {"config": {"packages": ["torch"]}}}, "uris": {"pipUri": "pip://9c0e795652de796e136d31847c5f97c28b7d09d7"}}, error message: Failed to install pip requirements:
(raylet, ip=10.131.4.120) Collecting torch
(raylet, ip=10.131.4.120) Downloading torch-1.11.0-cp37-cp37m-manylinux1_x86_64.whl (750.6 MB)
(raylet, ip=10.131.4.120) 
(raylet, ip=10.131.4.120) [2022-04-12 12:29:46,897 E 127 127] (raylet) worker_pool.cc:623: [Eagerly] Couldn't create a runtime environment for job 01000000.
(raylet, ip=10.131.4.120) [2022-04-12 12:29:46,897 E 127 127] (raylet) agent_manager.cc:190: Failed to create runtime env: {"envVars": {"PIP_NO_CACHE_DIR": "1"}, "extensions": {"_ray_commit": "fec30a25dbb5f3fa81d2bf419f75f5d40bc9fc39"}, "pythonRuntimeEnv": {"pipRuntimeEnv": {"config": {"packages": ["torch"]}}}, "uris": {"pipUri": "pip://9c0e795652de796e136d31847c5f97c28b7d09d7"}}, error message: Failed to install pip requirements:
(raylet, ip=10.131.4.120) Collecting torch
(raylet, ip=10.131.4.120) Downloading torch-1.11.0-cp37-cp37m-manylinux1_x86_64.whl (750.6 MB)
(raylet, ip=10.131.4.120) 
(pid=gcs_server) [2022-04-12 12:29:46,897 E 126 126] (gcs_server) gcs_actor_scheduler.cc:320: The lease worker request from node 078ddb83b3aa89732f7ed6578e448bc4f1fadf340e14edbdbbc251d5 for actor 24c732320cea8daebcbfbce701000000(PPOTrainer.__init__) has been canceled, job id = 01000000, cancel type: SCHEDULING_CANCELLED_RUNTIME_ENV_SETUP_FAILE

Versions / Dependencies

Ray v1.11.0 Python 3.7.12

Reproduction script

Using ray.init in a Jupyter Notebook

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Comments: 15 (9 by maintainers)

Commits related to this issue

Most upvoted comments

Thanks for the details–from the missing file error it looks like it might be a bug in runtime environments where some necessary files are getting deleted too early.

What’s interesting is that this only occurs when proxies are set–I’m not sure yet how that could be related. I’d be curious to know if the issue still persists on ray==1.12.0rc1 or on the nightly wheels.

What would be the simplest way to reproduce this? Also, does it happen every time, or randomly?