ray: Unable to access Ray dashboard when running Ray on Kubernetes

What is the problem?

Unable to access Ray dashboard when running Ray on a kubernetes cluster

Ray version and other system information (Python version, TensorFlow version, OS): latest rayproject/autoscaler docker image

Reproduction (REQUIRED)

Please provide a script that can be run to reproduce the issue. The script should have no external library dependencies (i.e., use fake or mock data / environments):

Just added containerPort: 8265 to expose 8265 on the head deployment. When I get the IP of the head pod using kubectl -get deployments -o wide and then go to that ip:8265, I get this site can’t be reached…

Here’s the ray-cluster.yaml:

# Ray head node service, allowing worker pods to discover the head node.
apiVersion: v1
kind: Service
metadata:
  namespace: ray
  name: ray-head
spec:
  ports:
    # Redis ports.
    - name: redis-primary
      port: 6379
      targetPort: 6379
    - name: redis-shard-0
      port: 6380
      targetPort: 6380
    - name: redis-shard-1
      port: 6381
      targetPort: 6381

    # Ray internal communication ports.
    - name: object-manager
      port: 12345
      targetPort: 12345
    - name: node-manager
      port: 12346
      targetPort: 12346
  selector:
    component: ray-head
---
apiVersion: apps/v1
kind: Deployment
metadata:
  namespace: ray
  name: ray-head
spec:
  # Do not change this - Ray currently only supports one head node per cluster.
  replicas: 1
  selector:
    matchLabels:
      component: ray-head
      type: ray
  template:
    metadata:
      labels:
        component: ray-head
        type: ray
    spec:
      # If the head node goes down, the entire cluster (including all worker
      # nodes) will go down as well. If you want Kubernetes to bring up a new
      # head node in this case, set this to "Always," else set it to "Never."
      restartPolicy: Always

      # This volume allocates shared memory for Ray to use for its plasma
      # object store. If you do not provide this, Ray will fall back to
      # /tmp which cause slowdowns if is not a shared memory volume.
      volumes:
      - name: dshm
        emptyDir:
          medium: Memory
      containers:
        - name: ray-head
          image: rayproject/autoscaler
          # imagePullPolicy: Never
          imagePullPolicy: Always
          command: [ "/bin/bash", "-c", "--" ]
          args: 
            - "ray start --head --node-ip-address=$MY_POD_IP --port=6379 --redis-shard-ports=6380,6381 --num-cpus=$MY_CPU_REQUEST \
            --object-manager-port=12345 --node-manager-port=12346 --dashboard-port 8265 --redis-password=password --block"
          ports:
            - containerPort: 6379 # Redis port.
            - containerPort: 6380 # Redis port.
            - containerPort: 6381 # Redis port.
            - containerPort: 12345 # Ray internal communication.
            - containerPort: 12346 # Ray internal communication.
            - containerPort: 8265 # Ray dashboard

          # This volume allocates shared memory for Ray to use for its plasma
          # object store. If you do not provide this, Ray will fall back to
          # /tmp which cause slowdowns if is not a shared memory volume.
          volumeMounts:
            - mountPath: /dev/shm
              name: dshm
          env:
            - name: MY_POD_IP
              valueFrom:
                fieldRef:
                  fieldPath: status.podIP

            # This is used in the ray start command so that Ray can spawn the
            # correct number of processes. Omitting this may lead to degraded
            # performance.
            - name: MY_CPU_REQUEST
              value: "3"
            #   valueFrom:
            #    resourceFieldRef:
            #      resource: requests.cpu
          resources:
            requests:
              cpu: 100m
              memory: 512Mi
---
apiVersion: apps/v1
kind: Deployment
metadata:
  namespace: ray
  name: ray-worker
spec:
  # Change this to scale the number of worker nodes started in the Ray cluster.
  replicas: 1
  selector:
    matchLabels:
      component: ray-worker
      type: ray
  template:
    metadata:
      labels:
        component: ray-worker
        type: ray
    spec:
      restartPolicy: Always
      volumes:
      - name: dshm
        emptyDir:
          medium: Memory
      containers:
      - name: ray-worker
        image: rayproject/autoscaler
        # imagePullPolicy: Never
        imagePullPolicy: Always
        command: ["/bin/bash", "-c", "--"]
        args:
          - "ray start --node-ip-address=$MY_POD_IP --num-cpus=$MY_CPU_REQUEST --address=$RAY_HEAD_SERVICE_HOST:$RAY_HEAD_SERVICE_PORT_REDIS_PRIMARY \
          --object-manager-port=12345 --node-manager-port=12346 --redis-password=password --block"
        ports:
          - containerPort: 12345 # Ray internal communication.
          - containerPort: 12346 # Ray internal communication.
        volumeMounts:
          - mountPath: /dev/shm
            name: dshm
        env:
          - name: MY_POD_IP
            valueFrom:
              fieldRef:
                fieldPath: status.podIP

          # This is used in the ray start command so that Ray can spawn the
          # correct number of processes. Omitting this may lead to degraded
          # performance.
          - name: MY_CPU_REQUEST
            value: "3"
            # valueFrom:
            #  resourceFieldRef:
            #    resource: requests.cpu
        resources:
          requests:
            cpu: 100m
            memory: 512Mi

If we cannot run your script, we cannot fix your issue.

  • I have verified my script runs in a clean environment and reproduces the issue.
  • I have verified the issue also occurs with the latest wheels.

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Comments: 24 (14 by maintainers)

Most upvoted comments

still getting this

I believe this is not fixed even with -p 8265 on docker run command, that doesn’t work on mac on stand alone docker (not kubernetes)

docker run --shm-size=100M -t --tty --interactive -p 8265 rayproject/ray (base) ray@8d3972dc8dbc:/$ ray start --head Local node IP: 172.17.0.2 2021-01-15 10:06:00,245 INFO services.py:1173 – View the Ray dashboard at http://localhost:8265 => not able to see the dashboard…

@mfitton please reopen

For anyone with problems with accessing Ray Dashboard with Docker, setting the option --dashboard-host=0.0.0.0 solved it for me.

By default, that option is localhost, the localhost inside a Docker container is the container itself (and not the host system). So, 0.0.0.0 will open the dashboard to all interfaces.

To view the dashboard, you need to run the following locally (outside the cluster) kubectl port-forward ray-head 8265:8265 then open http://localhost:8265 in a browser.

Check out the docs for the recommended way of using Ray on K8s.

@danlg It should work. Try this:

docker run --shm-size=100M -t --tty --interactive -p 8265:8265 rayproject/ray

(base) ray@9b7d8802e3a2:/$ ray start  --head  --dashboard-host=0.0.0.0

View the Ray dashboard at http://localhost:8265

@Vysybyl I believe the issue you’re seeing is one that #11313 is tracking. I’m working on a fix at the moment. As far as the issue that @ankur6ue mentioned, I’m not sure what the cause is exactly. Could you provide any additional information like the dashboard log and error log? It’s at /tmp/ray/session_latest/logs/dashboard.err and /tmp/ray/session_latest/logs/dashboard.out