ray: [Core] Cannot Connect to Head Node GCS Through URL
What happened + What you expected to happen
- I am working in a managed Kubernetes environment. We have three nodes (managed K8S Deployment + Service + Ingress) setup - one head node, and two worker nodes. Using the Service and Ingress configurations, I expose port 8265 of my container through the (internal) URL
http://head-node-dashboard.company.internal.domain.com, and 6379 throughhttp://head-node-gcs.company.internal.domain.com.
When I try to submit jobs to the dashboard URL, everything works fine:
ray job submit --working-dir ./ --address='http://head-node-dashboard.company.internal.domain.com' -- python ./script.py
But when I try to connect to the GCS, it fails. There are two ways that this happens:
- Connecting a worker node to the head node with
ray start:
$ > ray start --address='head-node-gcs.company.internal.domain.com:80'
Local node IP: 10.251.222.101
2023-03-18 06:51:17,521 WARNING utils.py:1446 -- Unable to connect to GCS at head-node-gcs.company.internal.domain.com:80. Check that (1) Ray GCS with matching version started successfully at the specified address, and (2) there is no firewall setting preventing access.
- Connecting to the head node through
ray.init().
$ > python
>>> import ray
a) If I connect without any protocol defined:
>>> ray.init(address='head-node-gcs.company.internal.domain.com:80')
2023-03-18 06:58:11,670 INFO worker.py:1333 -- Connecting to existing Ray cluster at address: head-node-gcs.company.internal.domain.com:80...
2023-03-18 06:58:16,743 WARNING utils.py:1333 -- Unable to connect to GCS at head-node-gcs.company.internal.domain.com:80. Check that (1) Ray GCS with matching version started successfully at the specified address, and (2) there is no firewall setting preventing access.
b) If I connect with the http:// protocol specified:
>>> ray.init(address='http://head-node-gcs.company.internal.domain.com')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/project/workSpace/venv/lib/python3.8/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
return func(*args, **kwargs)
File "/project/workSpace/venv/lib/python3.8/site-packages/ray/_private/worker.py", line 1230, in init
builder = ray.client(address, _deprecation_warn_enabled=False)
File "/project/workSpace/venv/lib/python3.8/site-packages/ray/client_builder.py", line 382, in client
builder = _get_builder_from_address(address)
File "/project/workSpace/venv/lib/python3.8/site-packages/ray/client_builder.py", line 350, in _get_builder_from_address
assert "ClientBuilder" in dir(
AssertionError: Module: http does not have ClientBuilder.
c) If I connect with the ray:// protocol specified:
>>> ray.init(address='ray://head-node-gcs.company.internal.domain.com')
/project/workSpace/venv/lib/python3.8/site-packages/ray/util/client/worker.py:253: UserWarning: Ray Client connection timed out. Ensure that the Ray Client port on the head node is reachable from your local machine. See https://docs.ray.io/en/latest/cluster/ray-client.html#step-2-check-ports for more information.
warnings.warn(
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/project/workSpace/venv/lib/python3.8/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
return func(*args, **kwargs)
File "/project/workSpace/venv/lib/python3.8/site-packages/ray/_private/worker.py", line 1248, in init
ctx = builder.connect()
File "/project/workSpace/venv/lib/python3.8/site-packages/ray/client_builder.py", line 178, in connect
client_info_dict = ray.util.client_connect.connect(
File "/project/workSpace/venv/lib/python3.8/site-packages/ray/util/client_connect.py", line 47, in connect
conn = ray.connect(
File "/project/workSpace/venv/lib/python3.8/site-packages/ray/util/client/__init__.py", line 252, in connect
conn = self.get_context().connect(*args, **kw_args)
File "/project/workSpace/venv/lib/python3.8/site-packages/ray/util/client/__init__.py", line 94, in connect
self.client_worker = Worker(
File "/project/workSpace/venv/lib/python3.8/site-packages/ray/util/client/worker.py", line 139, in __init__
self._connect_channel()
File "/project/workSpace/venv/lib/python3.8/site-packages/ray/util/client/worker.py", line 260, in _connect_channel
raise ConnectionError("ray client connection timeout")
ConnectionError: ray client connection timeout
- The worker to head node connection should work with the URL specified. It works if I give the local IP of the head node:
$ > ray init --address='10.251.222.100:6379'
Local node IP: 10.251.222.101
2023-03-18 07:20:41,943 WARNING services.py:1791 -- WARNING: The object store is using /tmp instead of /dev/shm because /dev/shm has only 67108864 bytes available. This will harm performance! You may be able to free up space by deleting files in /dev/shm. If you are inside a Docker container, you can increase /dev/shm size by passing '--shm-size=2.47gb' to 'docker run' (or add it to the run_options list in a Ray cluster config). Make sure to set this to more than 30% of available RAM.
[2023-03-18 07:20:41,964 I 115596 115596] global_state_accessor.cc:356: This node has an IP address of 10.251.222.101, while we can not find the matched Raylet address. This maybe come from when you connect the Ray cluster with a different IP address or connect a container.
--------------------
Ray runtime started.
--------------------
To terminate the Ray runtime, run
ray stop
This is the behavior I’m hoping to get from the command ray init --address='head-node-gcs.company.internal.domain.com:80'.
- This is the relevant part of the Service config of the head node:
ports:
- name: ray-dashboard
port: 8265
targetPort: 8265
protocol: TCP
- name: ray-gcs
port: 6379
targetPort: 6379
protocol: TCP
- name: ray-client
port: 10001
targetPort: 10001
protocol: TCP
- name: ray-serve
port: 8000
targetPort: 8000
protocol: TCP
type: ClusterIP
This is the relevant part of the Ingress config of the head node:
spec:
rules:
- host: head-node-dashboard.company.internal.domain.com
http:
paths:
- path: /
backend:
serviceName: head-node-svc
servicePort: 8265
- host: head-node-gcs.company.internal.domain.com
http:
paths:
- path: /
backend:
serviceName: head-node-svc
servicePort: 6379
- host: head-node-client.company.internal.domain.com
http:
paths:
- path: /
backend:
serviceName: head-node-svc
servicePort: 10001
- host: head-node-serve.company.internal.domain.com
http:
paths:
- path: /
backend:
serviceName: head-node-svc
servicePort: 8000
Versions / Dependencies
$ > ray --version
ray, version 2.3.0
$ > python --version
Python 3.7.4
$ > uname -a
Linux head-node-659568794c-rwmpk 3.10.0-1160.el7.x86_64 #1 SMP Tue Aug 18 14:50:17 EDT 2020 x86_64 GNU/Linux
Reproduction script
I don’t think this is reproducible since I’m running this in a managed Kubernetes environment. But, the Service and Ingress configuration snippets provided above should help setup the basic networking.
Issue Severity
High: It blocks me from completing my task.
About this issue
- Original URL
- State: open
- Created a year ago
- Comments: 21 (9 by maintainers)
I’m facing the same issue as @marrrcin @jednymslowem described. I’ve Used the Getting Started Guide to set up ray cluster on KiND. Did the required port-forwarding, and set
RAY_ADDRESSenv variable to point to the right url (I’m able to access the dashboard) and I get the same error:I would very appreciate any help, currently can’t figure this out.
It’s still a valid issue on Ray 2.6.3 (I’m using
kuberay-operator-0.5.0Helm Chart on GKE). Just a basic setup as show in the quick start guide with port-forward and the code (run in a notebook):fails with:
@jjyao - could you provide some update on this?
@kevin85421 So, we don’t have any alternative to
kubectlper se. The way our company has setup K8S is very opinionated.They’ve built a UI that we can access for creating new ‘apps’. These ‘apps’ are essentially a K8S Deployment (pod) + a K8S Service + a K8S Ingress component. The UI also allows us to specify mounted storage locations and hardware resource allocation.
Once the ‘app’ is created though, we do get access to the underlying K8S configuration .
yamlfor the three components.Hi @RishabhMalviya, @jjyao, were you able to find a solution to this? I’m running into a similar issue with trying to run a ray serve application on a remote cluster from my local machine, but even submitting jobs from the dashboard url doesn’t seem to work, although I am able to see the dashboard just fine if I go to the dashboard url. My workaround right now is to forward the port from the kubectl and then submit a job, like the following:
The strange thing was that it was working fine about 1.5 weeks ago, but then I came back to the cluster yesterday and I received this error. Deleting the cluster and creating a new one didn’t seem to help