ray: Unable to access Ray dashboard when running Ray on Kubernetes
What is the problem?
Unable to access Ray dashboard when running Ray on a kubernetes cluster
Ray version and other system information (Python version, TensorFlow version, OS): latest rayproject/autoscaler docker image
Reproduction (REQUIRED)
Please provide a script that can be run to reproduce the issue. The script should have no external library dependencies (i.e., use fake or mock data / environments):
Just added containerPort: 8265 to expose 8265 on the head deployment. When I get the IP of the head pod using kubectl -get deployments -o wide and then go to that ip:8265, I get this site can’t be reached…
Here’s the ray-cluster.yaml:
# Ray head node service, allowing worker pods to discover the head node.
apiVersion: v1
kind: Service
metadata:
namespace: ray
name: ray-head
spec:
ports:
# Redis ports.
- name: redis-primary
port: 6379
targetPort: 6379
- name: redis-shard-0
port: 6380
targetPort: 6380
- name: redis-shard-1
port: 6381
targetPort: 6381
# Ray internal communication ports.
- name: object-manager
port: 12345
targetPort: 12345
- name: node-manager
port: 12346
targetPort: 12346
selector:
component: ray-head
---
apiVersion: apps/v1
kind: Deployment
metadata:
namespace: ray
name: ray-head
spec:
# Do not change this - Ray currently only supports one head node per cluster.
replicas: 1
selector:
matchLabels:
component: ray-head
type: ray
template:
metadata:
labels:
component: ray-head
type: ray
spec:
# If the head node goes down, the entire cluster (including all worker
# nodes) will go down as well. If you want Kubernetes to bring up a new
# head node in this case, set this to "Always," else set it to "Never."
restartPolicy: Always
# This volume allocates shared memory for Ray to use for its plasma
# object store. If you do not provide this, Ray will fall back to
# /tmp which cause slowdowns if is not a shared memory volume.
volumes:
- name: dshm
emptyDir:
medium: Memory
containers:
- name: ray-head
image: rayproject/autoscaler
# imagePullPolicy: Never
imagePullPolicy: Always
command: [ "/bin/bash", "-c", "--" ]
args:
- "ray start --head --node-ip-address=$MY_POD_IP --port=6379 --redis-shard-ports=6380,6381 --num-cpus=$MY_CPU_REQUEST \
--object-manager-port=12345 --node-manager-port=12346 --dashboard-port 8265 --redis-password=password --block"
ports:
- containerPort: 6379 # Redis port.
- containerPort: 6380 # Redis port.
- containerPort: 6381 # Redis port.
- containerPort: 12345 # Ray internal communication.
- containerPort: 12346 # Ray internal communication.
- containerPort: 8265 # Ray dashboard
# This volume allocates shared memory for Ray to use for its plasma
# object store. If you do not provide this, Ray will fall back to
# /tmp which cause slowdowns if is not a shared memory volume.
volumeMounts:
- mountPath: /dev/shm
name: dshm
env:
- name: MY_POD_IP
valueFrom:
fieldRef:
fieldPath: status.podIP
# This is used in the ray start command so that Ray can spawn the
# correct number of processes. Omitting this may lead to degraded
# performance.
- name: MY_CPU_REQUEST
value: "3"
# valueFrom:
# resourceFieldRef:
# resource: requests.cpu
resources:
requests:
cpu: 100m
memory: 512Mi
---
apiVersion: apps/v1
kind: Deployment
metadata:
namespace: ray
name: ray-worker
spec:
# Change this to scale the number of worker nodes started in the Ray cluster.
replicas: 1
selector:
matchLabels:
component: ray-worker
type: ray
template:
metadata:
labels:
component: ray-worker
type: ray
spec:
restartPolicy: Always
volumes:
- name: dshm
emptyDir:
medium: Memory
containers:
- name: ray-worker
image: rayproject/autoscaler
# imagePullPolicy: Never
imagePullPolicy: Always
command: ["/bin/bash", "-c", "--"]
args:
- "ray start --node-ip-address=$MY_POD_IP --num-cpus=$MY_CPU_REQUEST --address=$RAY_HEAD_SERVICE_HOST:$RAY_HEAD_SERVICE_PORT_REDIS_PRIMARY \
--object-manager-port=12345 --node-manager-port=12346 --redis-password=password --block"
ports:
- containerPort: 12345 # Ray internal communication.
- containerPort: 12346 # Ray internal communication.
volumeMounts:
- mountPath: /dev/shm
name: dshm
env:
- name: MY_POD_IP
valueFrom:
fieldRef:
fieldPath: status.podIP
# This is used in the ray start command so that Ray can spawn the
# correct number of processes. Omitting this may lead to degraded
# performance.
- name: MY_CPU_REQUEST
value: "3"
# valueFrom:
# resourceFieldRef:
# resource: requests.cpu
resources:
requests:
cpu: 100m
memory: 512Mi
If we cannot run your script, we cannot fix your issue.
- I have verified my script runs in a clean environment and reproduces the issue.
- I have verified the issue also occurs with the latest wheels.
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Comments: 24 (14 by maintainers)
still getting this
I believe this is not fixed even with -p 8265 on docker run command, that doesn’t work on mac on stand alone docker (not kubernetes)
docker run --shm-size=100M -t --tty --interactive -p 8265 rayproject/ray (base) ray@8d3972dc8dbc:/$ ray start --head Local node IP: 172.17.0.2 2021-01-15 10:06:00,245 INFO services.py:1173 – View the Ray dashboard at http://localhost:8265 => not able to see the dashboard…
@mfitton please reopen
For anyone with problems with accessing Ray Dashboard with Docker, setting the option
--dashboard-host=0.0.0.0
solved it for me.By default, that option is
localhost
, thelocalhost
inside a Docker container is the container itself (and not the host system). So,0.0.0.0
will open the dashboard to all interfaces.To view the dashboard, you need to run the following locally (outside the cluster)
kubectl port-forward ray-head 8265:8265
then openhttp://localhost:8265
in a browser.Check out the docs for the recommended way of using Ray on K8s.
@danlg It should work. Try this:
View the Ray dashboard at http://localhost:8265
@Vysybyl I believe the issue you’re seeing is one that #11313 is tracking. I’m working on a fix at the moment. As far as the issue that @ankur6ue mentioned, I’m not sure what the cause is exactly. Could you provide any additional information like the dashboard log and error log? It’s at
/tmp/ray/session_latest/logs/dashboard.err
and/tmp/ray/session_latest/logs/dashboard.out