ray: [Core] Cannot Connect to Head Node GCS Through URL

What happened + What you expected to happen

I am working in a managed Kubernetes environment. We have three nodes (managed K8S Deployment + Service + Ingress) setup - one head node, and two worker nodes. Using the Service and Ingress configurations, I expose port 8265 of my container through the (internal) URL http://head-node-dashboard.company.internal.domain.com, and 6379 through http://head-node-gcs.company.internal.domain.com.

When I try to submit jobs to the dashboard URL, everything works fine:

ray job submit --working-dir ./ --address='http://head-node-dashboard.company.internal.domain.com' -- python ./script.py

But when I try to connect to the GCS, it fails. There are two ways that this happens:

Connecting a worker node to the head node with ray start:

$ > ray start --address='head-node-gcs.company.internal.domain.com:80'
Local node IP: 10.251.222.101
2023-03-18 06:51:17,521 WARNING utils.py:1446 -- Unable to connect to GCS at head-node-gcs.company.internal.domain.com:80. Check that (1) Ray GCS with matching version started successfully at the specified address, and (2) there is no firewall setting preventing access.

Connecting to the head node through ray.init().

$ > python
>>> import ray

a) If I connect without any protocol defined:

>>> ray.init(address='head-node-gcs.company.internal.domain.com:80')
2023-03-18 06:58:11,670 INFO worker.py:1333 -- Connecting to existing Ray cluster at address: head-node-gcs.company.internal.domain.com:80...
2023-03-18 06:58:16,743 WARNING utils.py:1333 -- Unable to connect to GCS at head-node-gcs.company.internal.domain.com:80. Check that (1) Ray GCS with matching version started successfully at the specified address, and (2) there is no firewall setting preventing access.

b) If I connect with the http:// protocol specified:

>>> ray.init(address='http://head-node-gcs.company.internal.domain.com')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/project/workSpace/venv/lib/python3.8/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
    return func(*args, **kwargs)
  File "/project/workSpace/venv/lib/python3.8/site-packages/ray/_private/worker.py", line 1230, in init
    builder = ray.client(address, _deprecation_warn_enabled=False)
  File "/project/workSpace/venv/lib/python3.8/site-packages/ray/client_builder.py", line 382, in client
    builder = _get_builder_from_address(address)
  File "/project/workSpace/venv/lib/python3.8/site-packages/ray/client_builder.py", line 350, in _get_builder_from_address
    assert "ClientBuilder" in dir(
AssertionError: Module: http does not have ClientBuilder.

c) If I connect with the ray:// protocol specified:

>>> ray.init(address='ray://head-node-gcs.company.internal.domain.com')
/project/workSpace/venv/lib/python3.8/site-packages/ray/util/client/worker.py:253: UserWarning: Ray Client connection timed out. Ensure that the Ray Client port on the head node is reachable from your local machine. See https://docs.ray.io/en/latest/cluster/ray-client.html#step-2-check-ports for more information.
  warnings.warn(
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/project/workSpace/venv/lib/python3.8/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
    return func(*args, **kwargs)
  File "/project/workSpace/venv/lib/python3.8/site-packages/ray/_private/worker.py", line 1248, in init
    ctx = builder.connect()
  File "/project/workSpace/venv/lib/python3.8/site-packages/ray/client_builder.py", line 178, in connect
    client_info_dict = ray.util.client_connect.connect(
  File "/project/workSpace/venv/lib/python3.8/site-packages/ray/util/client_connect.py", line 47, in connect
    conn = ray.connect(
  File "/project/workSpace/venv/lib/python3.8/site-packages/ray/util/client/__init__.py", line 252, in connect
    conn = self.get_context().connect(*args, **kw_args)
  File "/project/workSpace/venv/lib/python3.8/site-packages/ray/util/client/__init__.py", line 94, in connect
    self.client_worker = Worker(
  File "/project/workSpace/venv/lib/python3.8/site-packages/ray/util/client/worker.py", line 139, in __init__
    self._connect_channel()
  File "/project/workSpace/venv/lib/python3.8/site-packages/ray/util/client/worker.py", line 260, in _connect_channel
    raise ConnectionError("ray client connection timeout")
ConnectionError: ray client connection timeout

The worker to head node connection should work with the URL specified. It works if I give the local IP of the head node:

$ > ray init --address='10.251.222.100:6379'
Local node IP: 10.251.222.101
2023-03-18 07:20:41,943 WARNING services.py:1791 -- WARNING: The object store is using /tmp instead of /dev/shm because /dev/shm has only 67108864 bytes available. This will harm performance! You may be able to free up space by deleting files in /dev/shm. If you are inside a Docker container, you can increase /dev/shm size by passing '--shm-size=2.47gb' to 'docker run' (or add it to the run_options list in a Ray cluster config). Make sure to set this to more than 30% of available RAM.
[2023-03-18 07:20:41,964 I 115596 115596] global_state_accessor.cc:356: This node has an IP address of 10.251.222.101, while we can not find the matched Raylet address. This maybe come from when you connect the Ray cluster with a different IP address or connect a container.

--------------------
Ray runtime started.
--------------------

To terminate the Ray runtime, run
  ray stop

This is the behavior I’m hoping to get from the command ray init --address='head-node-gcs.company.internal.domain.com:80'.

This is the relevant part of the Service config of the head node:

  ports:
  - name: ray-dashboard
    port: 8265
    targetPort: 8265
    protocol: TCP
  - name: ray-gcs
    port: 6379
    targetPort: 6379
    protocol: TCP
  - name: ray-client
    port: 10001
    targetPort: 10001
    protocol: TCP
  - name: ray-serve
    port: 8000
    targetPort: 8000
    protocol: TCP
  type: ClusterIP

This is the relevant part of the Ingress config of the head node:

spec:
  rules:
  - host: head-node-dashboard.company.internal.domain.com
    http:
      paths:
      - path: /
        backend:
          serviceName: head-node-svc
          servicePort: 8265
  - host: head-node-gcs.company.internal.domain.com
    http:
      paths:
      - path: /
        backend:
          serviceName: head-node-svc
          servicePort: 6379
  - host: head-node-client.company.internal.domain.com
    http:
      paths:
      - path: /
        backend:
          serviceName: head-node-svc
          servicePort: 10001
  - host: head-node-serve.company.internal.domain.com
    http:
      paths:
      - path: /
        backend:
          serviceName: head-node-svc
          servicePort: 8000

Versions / Dependencies

$ > ray --version
ray, version 2.3.0

$ > python --version
Python 3.7.4

$ > uname -a
Linux head-node-659568794c-rwmpk 3.10.0-1160.el7.x86_64 #1 SMP Tue Aug 18 14:50:17 EDT 2020 x86_64 GNU/Linux

Reproduction script

I don’t think this is reproducible since I’m running this in a managed Kubernetes environment. But, the Service and Ingress configuration snippets provided above should help setup the basic networking.

Issue Severity

High: It blocks me from completing my task.

About this issue

Original URL
State: open
Created a year ago
Comments: 21 (9 by maintainers)

Most upvoted comments

I’m facing the same issue as @marrrcin @jednymslowem described. I’ve Used the Getting Started Guide to set up ray cluster on KiND. Did the required port-forwarding, and set RAY_ADDRESS env variable to point to the right url (I’m able to access the dashboard) and I get the same error:

2023-09-16 01:16:49,657	INFO worker.py:1313 -- Using address 127.0.0.1:8265 set in the environment variable RAY_ADDRESS
2023-09-16 01:16:49,658	INFO worker.py:1431 -- Connecting to existing Ray cluster at address: 127.0.0.1:8265...
2023-09-16 01:16:54,857	ERROR utils.py:1395 -- Failed to connect to GCS. Please check `gcs_server.out` for more details.
2023-09-16 01:16:54,857	WARNING utils.py:1401 -- Unable to connect to GCS (ray head) at 127.0.0.1:8265. Check that (1) Ray with matching version started successfully at the specified address, (2) this node can reach the specified address, and (3) there is no firewall setting preventing access.

I would very appreciate any help, currently can’t figure this out.

OmriLevyTau on Sep 15, 2023

It’s still a valid issue on Ray 2.6.3 (I’m using kuberay-operator-0.5.0 Helm Chart on GKE). Just a basic setup as show in the quick start guide with port-forward and the code (run in a notebook):

import ray
ray.init("127.0.0.1:8266")

fails with:

2023-08-30 13:01:44,006	INFO worker.py:1431 -- Connecting to existing Ray cluster at address: 127.0.0.1:8266...
2023-08-30 13:01:49,650	ERROR utils.py:1395 -- Failed to connect to GCS. Please check `gcs_server.out` for more details.
2023-08-30 13:01:49,651	WARNING utils.py:1401 -- Unable to connect to GCS (ray head) at 127.0.0.1:8266. Check that (1) Ray with matching version started successfully at the specified address, (2) this node can reach the specified address, and (3) there is no firewall setting preventing access.

@jjyao - could you provide some update on this?

marrrcin on Aug 31, 2023

@kevin85421 So, we don’t have any alternative to kubectl per se. The way our company has setup K8S is very opinionated.

They’ve built a UI that we can access for creating new ‘apps’. These ‘apps’ are essentially a K8S Deployment (pod) + a K8S Service + a K8S Ingress component. The UI also allows us to specify mounted storage locations and hardware resource allocation.

Once the ‘app’ is created though, we do get access to the underlying K8S configuration .yaml for the three components.

RishabhMalviya on Apr 12, 2023

Hi @RishabhMalviya, @jjyao, were you able to find a solution to this? I’m running into a similar issue with trying to run a ray serve application on a remote cluster from my local machine, but even submitting jobs from the dashboard url doesn’t seem to work, although I am able to see the dashboard just fine if I go to the dashboard url. My workaround right now is to forward the port from the kubectl and then submit a job, like the following:

kubectl port-forward service/raycluster-service-name-head-svc 8265:8265
ray job submit --address http://localhost:8265 --working-dir="./" -- serve run --host="0.0.0.0" --working-dir="./" --non-blocking model_file:model

The strange thing was that it was working fine about 1.5 weeks ago, but then I came back to the cluster yesterday and I received this error. Deleting the cluster and creating a new one didn’t seem to help

charu-vl on Apr 11, 2023