kuberay: [Bug] Autoscaler deployment fails - reports Forbidden access (403) to Kubernetes API

Search before asking

  • I searched the issues and found no similar issues.

KubeRay Component

ray-operator

What happened + What you expected to happen

I am unable to deploy an autoscaling ray cluster to AKS. Our Ops team have deployed the KubeRay operator. We have added the enableInTreeAutoscaling: true to the yaml cluster definition manifest but are unable to deploy successfully (autoscaler sidecar fails and head pod reports CrashLoopBackOff) - we are trying to deploy to a single namespace and our deployment service accounts only have access to that single namespace. We are quite new to both k8s, kuberay, and ray clusters in general so it is entirely possible we have made a mistake along the way. After some digging through the code (here and here) it seems like the service account the autoscaler is using (created by the ray deployment) in the single namespace does not have permission against the following -> https://kubernetes.default:443/api/v1/apis/ray.io/v1alpha1/namespace/apm0005738-sb/rayclusters because it cannot do this at the cluster scope. I have accessed the pods directly and confirmed by curl command (having grabed the auth bearer token from /var/run/secrets/kubernetes.io/serviceaccount/token) - which reports:

{
  "kind": "Status",
  "apiVersion": "v1",
  "metadata": {},
  "status": "Failure",
  "message": "apis \"ray.io\" is forbidden: User \"system:serviceaccount:apm0005738-sb:forecasting-raycluster\" cannot get resource \"apis/v1alpha1\" in API group \"\" at the cluster scope",
  "reason": "Forbidden",
  "details": {
    "name": "ray.io",
    "kind": "apis"
  },
  "code": 403,
}

Reproduction script

I believe our Ops team deployed the KubeRay Operator to the namespace scope only but used their deployment service accounts to do so which does have permissions to deploy at the cluster scope which was needed for the CRDs. I believe the ops team ran the following to deploy the kuberay operator: helm install kuberay-operator kuberay/kuberay-operator --version 0.4.0 -namespace apm0005738-sb

The ray cluster manifest we are using is as follows and was deployed using service accounts with only permissions over the namespce:

apiVersion: ray.io/v1alpha1
kind: RayCluster
metadata:
  labels:
    controller-tools.k8s.io: "1.0"   
  name: forecasting-raycluster
  namespace: apm0005738-sb
spec:
  rayVersion: '2.3.0'
  enableInTreeAutoscaling: true
  autoscalerOptions:
    upscalingMode: Default
    idleTimeoutSeconds: 60
    imagePullPolicy: Always
    securityContext: {}
    env: []
    envFrom: []
    resources:
      limits:
        cpu: "500m"
        memory: "512Mi"
      requests:
        cpu: "500m"
        memory: "512Mi"
  # Ray head pod configuration
  headGroupSpec:
    # Kubernetes Service Type. This is an optional field, and the default value is ClusterIP.
    serviceType: NodePort # ClusterIP
    # the following params are used to complete the ray start: ray start --head --block --dashboard-host: '0.0.0.0' ...
    rayStartParams:
      dashboard-host: '0.0.0.0'
      block: 'true'
    # pod template
    template:
      metadata:
        # Custom labels. NOTE: To avoid conflicts with KubeRay operator, do not define custom labels start with `raycluster`.
        # Refer to https://kubernetes.io/docs/concepts/overview/working-with-objects/labels/
        labels: {}
      spec:
        containers:
        - name: ray-head
          image: rayproject/ray:2.3.0 
          ports:
          - containerPort: 6379
            name: gcs
          - containerPort: 8265
            name: dashboard
          - containerPort: 10001
            name: client
          - containerPort: 8000
            name: serve
          - containerPort: 52365
            name: dashboard-agent                                   
          lifecycle:
            preStop:
              exec:
                command: ["/bin/sh","-c","ray stop"]
          volumeMounts:
            - mountPath: /tmp/ray
              name: ray-logs
          resources:
            limits:
              cpu: "4"
              memory: "8G"
            requests: # limits and requests should be equal
              cpu: "4"
              memory: "8G"
        volumes:
          - name: ray-logs
            emptyDir: {}
  workerGroupSpecs:
  # the pod replicas in this group typed worker
  - replicas: 1
    minReplicas: 1
    maxReplicas: 3 
    # logical group name, for this called small-group, also can be functional
    groupName: ray-worker-group
    # the following params are used to complete the ray start: ray start --block
    rayStartParams:
      block: 'true'
    #pod template
    template:
      spec:
        containers:
        - name: ray-worker
          image: rayproject/ray:2.3.0
          lifecycle:
            preStop:
              exec:
                command: ["/bin/sh","-c","ray stop"]
          # use volumeMounts.Optional.
          # Refer to https://kubernetes.io/docs/concepts/storage/volumes/
          volumeMounts:
            - mountPath: /tmp/ray
              name: ray-logs
          # The resource requests and limits in this config are too small for production!
          # For an example with more realistic resource configuration, see
          # ray-cluster.autoscaler.large.yaml.
          # It is better to use a few large Ray pod than many small ones.
          # For production, it is ideal to size each Ray pod to take up the
          # entire Kubernetes node on which it is scheduled.
          resources:
            limits:
              cpu: "4"
              memory: "8G"
            requests:
              cpu: "4"
              memory: "8G"
        initContainers:
        # the env var $RAY_IP is set by the operator if missing, with the value of the head service name
        - name: init
          image: busybox:1.28
          # Change the cluster postfix if you don't have a default setting
          command: ['sh', '-c', "until nslookup $RAY_IP.$(cat /var/run/secrets/kubernetes.io/serviceaccount/namespace).svc.cluster.local; do echo waiting for K8s Service $RAY_IP; sleep 2; done"]
          # command: ['sh', '-c', "until nslookup $RAY_IP.apm0005738-sb.svc.cluster.local; do echo waiting for K8s Service $RAY_IP; sleep 2; done"]          
          # Special Dow must supply container limits
          resources:
            limits:
              cpu: "1"
              memory: "1G"
            requests:
              cpu: "1"
              memory: "1G"
        # use volumes
        # Refer to https://kubernetes.io/docs/concepts/storage/volumes/
        volumes:
          - name: ray-logs
            emptyDir: {}

Anything else

This issue seems closely related to https://github.com/ray-project/kuberay/issues/924 but our issue is on first deployment rather than when attempting a cluster update/change so we felt it perhaps needed its own issue.

Are you willing to submit a PR?

  • Yes I am willing to submit a PR!

About this issue

  • Original URL
  • State: closed
  • Created a year ago
  • Comments: 17 (2 by maintainers)

Most upvoted comments

as it turns out I was trying an incorrect curl command above and the correct one is here: curl -H "Authorization: Bearer $TOKEN" --cacert /var/run/secrets/kubernetes.io/serviceaccount/ca.crt https://kubernetes.default:443/apis/ray.io/v1alpha1/namespaces/apm0005738-sb/rayclusters/forecasting-raycluster

Previously, I had /api/v1/apis/ray.io/v1alpha1/... but that was wrong and I needed /apis/ray.io/v1alpha1/.... Querying the correct endpoint does return what looks like the correct data (ray cluster manifest)

I think this is what you are looking for …

❯ kubectl -n apm0005738-sb logs pod/forecasting-raycluster-head-x7vlv -c autoscaler
E0315 09:20:58.096963   98231 memcache.go:255] couldn't get resource list for external.metrics.k8s.io/v1beta1: Got empty response for: external.metrics.k8s.io/v1beta1
The Ray head is ready. Starting the autoscaler.
2023-03-15 14:20:30,769	INFO monitor.py:167 -- session_name: session_2023-03-15_14-18-06_679719_1
2023-03-15 14:20:30,771	INFO monitor.py:198 -- Starting autoscaler metrics server on port 44217
2023-03-15 14:20:30,772	INFO monitor.py:218 -- Monitor: Started
2023-03-15 14:20:30,797	ERROR monitor.py:503 -- Error in monitor loop
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 703, in urlopen
    httplib_response = self._make_request(
  File "/usr/local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 386, in _make_request
    self._validate_conn(conn)
  File "/usr/local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 1042, in _validate_conn
    conn.connect()
  File "/usr/local/lib/python3.10/site-packages/urllib3/connection.py", line 414, in connect
    self.sock = ssl_wrap_socket(
  File "/usr/local/lib/python3.10/site-packages/urllib3/util/ssl_.py", line 449, in ssl_wrap_socket
    ssl_sock = _ssl_wrap_socket_impl(
  File "/usr/local/lib/python3.10/site-packages/urllib3/util/ssl_.py", line 493, in _ssl_wrap_socket_impl
    return ssl_context.wrap_socket(sock, server_hostname=server_hostname)
  File "/usr/local/lib/python3.10/ssl.py", line 513, in wrap_socket
    return self.sslsocket_class._create(
  File "/usr/local/lib/python3.10/ssl.py", line 1071, in _create
    self.do_handshake()
  File "/usr/local/lib/python3.10/ssl.py", line 1342, in do_handshake
    self._sslobj.do_handshake()
ssl.SSLCertVerificationError: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:997)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/site-packages/requests/adapters.py", line 489, in send
    resp = conn.urlopen(
  File "/usr/local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 787, in urlopen
    retries = retries.increment(
  File "/usr/local/lib/python3.10/site-packages/urllib3/util/retry.py", line 592, in increment
    raise MaxRetryError(_pool, url, error or ResponseError(cause))
urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='kubernetes.default', port=443): Max retries exceeded with url: /apis/ray.io/v1alpha1/namespaces/apm0005738-sb/rayclusters/forecasting-raycluster (Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:997)')))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/site-packages/ray/autoscaler/_private/monitor.py", line 547, in run
    self._initialize_autoscaler()
  File "/usr/local/lib/python3.10/site-packages/ray/autoscaler/_private/monitor.py", line 233, in _initialize_autoscaler
    self.autoscaler = StandardAutoscaler(
  File "/usr/local/lib/python3.10/site-packages/ray/autoscaler/_private/autoscaler.py", line 247, in __init__
    self.reset(errors_fatal=True)
  File "/usr/local/lib/python3.10/site-packages/ray/autoscaler/_private/autoscaler.py", line 1107, in reset
    raise e
  File "/usr/local/lib/python3.10/site-packages/ray/autoscaler/_private/autoscaler.py", line 1024, in reset
    new_config = self.config_reader()
  File "/usr/local/lib/python3.10/site-packages/ray/autoscaler/_private/kuberay/autoscaling_config.py", line 59, in __call__
    ray_cr = self._fetch_ray_cr_from_k8s_with_retries()
  File "/usr/local/lib/python3.10/site-packages/ray/autoscaler/_private/kuberay/autoscaling_config.py", line 71, in _fetch_ray_cr_from_k8s_with_retries
    return self._fetch_ray_cr_from_k8s()
  File "/usr/local/lib/python3.10/site-packages/ray/autoscaler/_private/kuberay/autoscaling_config.py", line 85, in _fetch_ray_cr_from_k8s
    result = requests.get(
  File "/usr/local/lib/python3.10/site-packages/requests/api.py", line 73, in get
    return request("get", url, params=params, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/requests/api.py", line 59, in request
    return session.request(method=method, url=url, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/requests/sessions.py", line 587, in request
    resp = self.send(prep, **send_kwargs)
  File "/usr/local/lib/python3.10/site-packages/requests/sessions.py", line 701, in send
    r = adapter.send(request, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/requests/adapters.py", line 563, in send
    raise SSLError(e, request=request)
requests.exceptions.SSLError: HTTPSConnectionPool(host='kubernetes.default', port=443): Max retries exceeded with url: /apis/ray.io/v1alpha1/namespaces/apm0005738-sb/rayclusters/forecasting-raycluster (Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:997)')))
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 703, in urlopen
    httplib_response = self._make_request(
  File "/usr/local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 386, in _make_request
    self._validate_conn(conn)
  File "/usr/local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 1042, in _validate_conn
    conn.connect()
  File "/usr/local/lib/python3.10/site-packages/urllib3/connection.py", line 414, in connect
    self.sock = ssl_wrap_socket(
  File "/usr/local/lib/python3.10/site-packages/urllib3/util/ssl_.py", line 449, in ssl_wrap_socket
    ssl_sock = _ssl_wrap_socket_impl(
  File "/usr/local/lib/python3.10/site-packages/urllib3/util/ssl_.py", line 493, in _ssl_wrap_socket_impl
    return ssl_context.wrap_socket(sock, server_hostname=server_hostname)
  File "/usr/local/lib/python3.10/ssl.py", line 513, in wrap_socket
    return self.sslsocket_class._create(
  File "/usr/local/lib/python3.10/ssl.py", line 1071, in _create
    self.do_handshake()
  File "/usr/local/lib/python3.10/ssl.py", line 1342, in do_handshake
    self._sslobj.do_handshake()
ssl.SSLCertVerificationError: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:997)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/site-packages/requests/adapters.py", line 489, in send
    resp = conn.urlopen(
  File "/usr/local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 787, in urlopen
    retries = retries.increment(
  File "/usr/local/lib/python3.10/site-packages/urllib3/util/retry.py", line 592, in increment
    raise MaxRetryError(_pool, url, error or ResponseError(cause))
urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='kubernetes.default', port=443): Max retries exceeded with url: /apis/ray.io/v1alpha1/namespaces/apm0005738-sb/rayclusters/forecasting-raycluster (Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:997)')))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/bin/ray", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.10/site-packages/ray/scripts/scripts.py", line 2422, in main
    return cli()
  File "/usr/local/lib/python3.10/site-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/click/core.py", line 1055, in main
    rv = self.invoke(ctx)
  File "/usr/local/lib/python3.10/site-packages/click/core.py", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/usr/local/lib/python3.10/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/local/lib/python3.10/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/ray/scripts/scripts.py", line 2168, in kuberay_autoscaler
    run_kuberay_autoscaler(cluster_name, cluster_namespace)
  File "/usr/local/lib/python3.10/site-packages/ray/autoscaler/_private/kuberay/run_autoscaler.py", line 64, in run_kuberay_autoscaler
    ).run()
  File "/usr/local/lib/python3.10/site-packages/ray/autoscaler/_private/monitor.py", line 547, in run
    self._initialize_autoscaler()
  File "/usr/local/lib/python3.10/site-packages/ray/autoscaler/_private/monitor.py", line 233, in _initialize_autoscaler
    self.autoscaler = StandardAutoscaler(
  File "/usr/local/lib/python3.10/site-packages/ray/autoscaler/_private/autoscaler.py", line 247, in __init__
    self.reset(errors_fatal=True)
  File "/usr/local/lib/python3.10/site-packages/ray/autoscaler/_private/autoscaler.py", line 1107, in reset
    raise e
  File "/usr/local/lib/python3.10/site-packages/ray/autoscaler/_private/autoscaler.py", line 1024, in reset
    new_config = self.config_reader()
  File "/usr/local/lib/python3.10/site-packages/ray/autoscaler/_private/kuberay/autoscaling_config.py", line 59, in __call__
    ray_cr = self._fetch_ray_cr_from_k8s_with_retries()
  File "/usr/local/lib/python3.10/site-packages/ray/autoscaler/_private/kuberay/autoscaling_config.py", line 71, in _fetch_ray_cr_from_k8s_with_retries
    return self._fetch_ray_cr_from_k8s()
  File "/usr/local/lib/python3.10/site-packages/ray/autoscaler/_private/kuberay/autoscaling_config.py", line 85, in _fetch_ray_cr_from_k8s
    result = requests.get(
  File "/usr/local/lib/python3.10/site-packages/requests/api.py", line 73, in get
    return request("get", url, params=params, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/requests/api.py", line 59, in request
    return session.request(method=method, url=url, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/requests/sessions.py", line 587, in request
    resp = self.send(prep, **send_kwargs)
  File "/usr/local/lib/python3.10/site-packages/requests/sessions.py", line 701, in send
    r = adapter.send(request, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/requests/adapters.py", line 563, in send
    raise SSLError(e, request=request)
requests.exceptions.SSLError: HTTPSConnectionPool(host='kubernetes.default', port=443): Max retries exceeded with url: /apis/ray.io/v1alpha1/namespaces/apm0005738-sb/rayclusters/forecasting-raycluster (Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:997)')))

Let me know if you need anything else. Thanks!