ray: [Ray Client] Unable to connect to client when utilizing a proxy
What happened + What you expected to happen
When utilizing a Ray Cluster, and setting the environment variables of the pod i.e the gRPC proxy and http proxy variables, then I am unable to connect to the remote cluster using ray.init(). This only occurs when proxies are set, and I assure that my no_proxy environment variables are set properly.
Are there specific configurations that need to be set when running behind a proxy? I am rebuilding the image with those specified environment variables. Essentially, I am trying to use runtime_environments to install pip packages; however, when the proxy settings are set, this inhibits the gRPC client is unable to connect to the client.
These are the specific errors I receive:
INFO:ray.util.client.server.server:Starting Ray Client server on 0.0.0.0:10001
INFO:ray.util.client.server.proxier:New data connection from client 0d8a49073b31478a9518ecc41101c230:
INFO:ray.util.client.server.proxier:SpecificServer started on port: 23000 with PID: 298 for client: 0d8a49073b31478a9518ecc41101c230
ERROR:ray.util.client.server.proxier:Timeout waiting for channel for 0d8a49073b31478a9518ecc41101c230
Traceback (most recent call last):
File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/client/server/proxier.py", line 352, in get_channel
server.channel).result(timeout=CHECK_CHANNEL_TIMEOUT_S)
File "/home/ray/anaconda3/lib/python3.7/site-packages/grpc/_utilities.py", line 139, in result
self._block(timeout)
File "/home/ray/anaconda3/lib/python3.7/site-packages/grpc/_utilities.py", line 85, in _block
raise grpc.FutureTimeoutError()
grpc.FutureTimeoutError
ERROR:ray.util.client.server.proxier:Timeout waiting for channel for 0d8a49073b31478a9518ecc41101c230
Traceback (most recent call last):
File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/client/server/proxier.py", line 352, in get_channel
server.channel).result(timeout=CHECK_CHANNEL_TIMEOUT_S)
File "/home/ray/anaconda3/lib/python3.7/site-packages/grpc/_utilities.py", line 139, in result
self._block(timeout)
File "/home/ray/anaconda3/lib/python3.7/site-packages/grpc/_utilities.py", line 85, in _block
raise grpc.FutureTimeoutError()
grpc.FutureTimeoutError
WARNING:ray.util.client.server.proxier:Retrying Logstream connection. 1 attempts failed.
ERROR:ray.util.client.server.proxier:Channel not found for 0d8a49073b31478a9518ecc41101c230
$ ERROR:ray.util.client.server.proxier:Channel not found for 0d8a49073b31478a9518ecc41101c230
sh: 19: ERROR:ray.util.client.server.proxier:Channel: not found
I was able to workaround this issue by manually setting the pip environment variables to utilize the proxy while the gRPC server does not. However, I do not believe this is the best workaround. However, I still do receive some proxy errors in the head nodes logs:
ERROR:ray.util.client.server.proxier:Proxying Datapath failed!
Traceback (most recent call last):
File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/client/server/proxier.py", line 659, in Datapath
for resp in resp_stream:
File "/home/ray/anaconda3/lib/python3.7/site-packages/grpc/_channel.py", line 426, in __next__
return self._next()
File "/home/ray/anaconda3/lib/python3.7/site-packages/grpc/_channel.py", line 826, in _next
raise self
grpc._channel._MultiThreadedRendezvous: <_MultiThreadedRendezvous of RPC that terminated with:
status = StatusCode.UNKNOWN
details = "Exception iterating requests!"
debug_error_string = "None"
>
Additionally, I have noticed issues when a worker node tries to install the pip packages. The Ray head node is able to, however worker nodes are not. Any insight would be appreciated.
(raylet, ip=10.131.4.120) [2022-04-12 12:29:46,896 E 127 127] (raylet) agent_manager.cc:190: Failed to create runtime env: {"envVars": {"PIP_NO_CACHE_DIR": "1"}, "extensions": {"_ray_commit": "fec30a25dbb5f3fa81d2bf419f75f5d40bc9fc39"}, "pythonRuntimeEnv": {"pipRuntimeEnv": {"config": {"packages": ["torch"]}}}, "uris": {"pipUri": "pip://9c0e795652de796e136d31847c5f97c28b7d09d7"}}, error message: Failed to install pip requirements:
(raylet, ip=10.131.4.120) Collecting torch
(raylet, ip=10.131.4.120) Downloading torch-1.11.0-cp37-cp37m-manylinux1_x86_64.whl (750.6 MB)
(raylet, ip=10.131.4.120)
(raylet, ip=10.131.4.120) [2022-04-12 12:29:46,897 E 127 127] (raylet) worker_pool.cc:623: [Eagerly] Couldn't create a runtime environment for job 01000000.
(raylet, ip=10.131.4.120) [2022-04-12 12:29:46,897 E 127 127] (raylet) agent_manager.cc:190: Failed to create runtime env: {"envVars": {"PIP_NO_CACHE_DIR": "1"}, "extensions": {"_ray_commit": "fec30a25dbb5f3fa81d2bf419f75f5d40bc9fc39"}, "pythonRuntimeEnv": {"pipRuntimeEnv": {"config": {"packages": ["torch"]}}}, "uris": {"pipUri": "pip://9c0e795652de796e136d31847c5f97c28b7d09d7"}}, error message: Failed to install pip requirements:
(raylet, ip=10.131.4.120) Collecting torch
(raylet, ip=10.131.4.120) Downloading torch-1.11.0-cp37-cp37m-manylinux1_x86_64.whl (750.6 MB)
(raylet, ip=10.131.4.120)
(pid=gcs_server) [2022-04-12 12:29:46,897 E 126 126] (gcs_server) gcs_actor_scheduler.cc:320: The lease worker request from node 078ddb83b3aa89732f7ed6578e448bc4f1fadf340e14edbdbbc251d5 for actor 24c732320cea8daebcbfbce701000000(PPOTrainer.__init__) has been canceled, job id = 01000000, cancel type: SCHEDULING_CANCELLED_RUNTIME_ENV_SETUP_FAILE
Versions / Dependencies
Ray v1.11.0 Python 3.7.12
Reproduction script
Using ray.init in a Jupyter Notebook
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Comments: 15 (9 by maintainers)
Commits related to this issue
- [Core] Stop iteratoring cancelled grpc request streams (#23865) — committed to rueian/ray by rueian 2 years ago
- [Core] Stop iteratoring cancelled grpc request streams (#23865) Signed-off-by: Rueian <rueiancsie@gmail.com> — committed to rueian/ray by rueian 2 years ago
- [Core] Stop iteratoring cancelled grpc request streams (#23865) Signed-off-by: Rueian <rueiancsie@gmail.com> — committed to rueian/ray by rueian 2 years ago
- [Core] Stop iteratoring cancelled grpc request streams (#23865) Signed-off-by: Rueian <rueiancsie@gmail.com> — committed to rueian/ray by rueian 2 years ago
- [Core] Stop iteratoring cancelled grpc request streams (#23865) Signed-off-by: Rueian <rueiancsie@gmail.com> — committed to rueian/ray by rueian 2 years ago
- [Core] Stop iteratoring cancelled grpc request streams (#23865) Signed-off-by: Rueian <rueiancsie@gmail.com> — committed to rueian/ray by rueian 2 years ago
- [Core] Stop iteratoring cancelled grpc request streams (#23865) Signed-off-by: Rueian <rueiancsie@gmail.com> — committed to rueian/ray by rueian 2 years ago
- [Core] Stop iteratoring cancelled grpc request streams (#23865) Signed-off-by: Rueian <rueiancsie@gmail.com> — committed to rueian/ray by rueian 2 years ago
- [Core] Stop iteratoring cancelled grpc request streams (#23865) Signed-off-by: Rueian <rueiancsie@gmail.com> — committed to rueian/ray by rueian 2 years ago
- [Core] Stop iteratoring cancelled grpc request streams (#23865) (#27951) This fixes the below grpc error mentioned in #23865. grpc._channel._MultiThreadedRendezvous: <_MultiThreadedRendezvous of R... — committed to ray-project/ray by rueian 2 years ago
- Merge branch 'master' of https://github.com/ddelange/ray into pytorch-extra-index-url * 'master' of https://github.com/ddelange/ray: (1154 commits) [Tune] [PBT] [Doc] Fix and clean up PBT examples ... — committed to ddelange/ray by ddelange 2 years ago
- [Core] Stop iteratoring cancelled grpc request streams (#23865) (#27951) This fixes the below grpc error mentioned in #23865. grpc._channel._MultiThreadedRendezvous: <_MultiThreadedRendezvous of RPC... — committed to WeichenXu123/ray by rueian 2 years ago
Thanks for the details–from the missing file error it looks like it might be a bug in runtime environments where some necessary files are getting deleted too early.
What’s interesting is that this only occurs when proxies are set–I’m not sure yet how that could be related. I’d be curious to know if the issue still persists on
ray==1.12.0rc1or on the nightly wheels.What would be the simplest way to reproduce this? Also, does it happen every time, or randomly?