kuberay: [Bug] Quickstart example is not working.

Search before asking

  • I searched the issues and found no similar issues.

KubeRay Component

Others

What happened + What you expected to happen

At first it appears to be more like Ray issue but nevertheless its a part of kuberay and k8s documentation

I’ve followed kuberay doc and was able to use autoscaler which added more worker successfully.

NAMESPACE     NAME                                           READY   STATUS    RESTARTS   AGE
default       raycluster-complete-head-xc9pg                 2/2     Running   0          13m
default       raycluster-complete-worker-small-group-fvfn4   1/1     Running   0          10m
default       raycluster-complete-worker-small-group-g9bk5   1/1     Running   0          10m
default       raycluster-complete-worker-small-group-jzjs4   1/1     Running   0          14m

Operator

ray-system    kuberay-operator-74455ff8c6-5qdrx              1/1     Running   0          26m

surprisingly there was nothing for autoscaling, was this expected?

I was able to successfully do port-forward for head node service for dashboard :8265 and ray client server port: 10001

After this since there wasn’t any working example from kuberay, I tried to use k8s documentation to run some program.

I started with Running Ray programs with Ray Jobs Submission with sample script.py

this resultant into following error.

$ ray job submit --runtime-env-json='{"working_dir": "./", "pip": ["requests==2.26.0"]}' -- "python script.py"
Job submission server address: http://127.0.0.1:8265
2022-06-02 00:34:53,279    ERROR packaging.py:84 -- Issue with path: /Users/gauta/.docker/run/docker-cli-api.sock
Traceback (most recent call last):
  File "/usr/local/bin/ray", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.9/site-packages/ray/scripts/scripts.py", line 1958, in main
    return cli()
  File "/usr/local/lib/python3.9/site-packages/click/core.py", line 1128, in __call__
    return self.main(*args, **kwargs)
  File "/usr/local/lib/python3.9/site-packages/click/core.py", line 1053, in main
    rv = self.invoke(ctx)
  File "/usr/local/lib/python3.9/site-packages/click/core.py", line 1659, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/usr/local/lib/python3.9/site-packages/click/core.py", line 1659, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/usr/local/lib/python3.9/site-packages/click/core.py", line 1395, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/local/lib/python3.9/site-packages/click/core.py", line 754, in invoke
    return __callback(*args, **kwargs)
  File "/usr/local/lib/python3.9/site-packages/ray/autoscaler/_private/cli_logger.py", line 808, in wrapper
    return f(*args, **kwargs)
  File "/usr/local/lib/python3.9/site-packages/ray/dashboard/modules/job/cli.py", line 151, in job_submit
    job_id = client.submit_job(
  File "/usr/local/lib/python3.9/site-packages/ray/dashboard/modules/job/sdk.py", line 255, in submit_job
    self._upload_working_dir_if_needed(runtime_env)
  File "/usr/local/lib/python3.9/site-packages/ray/dashboard/modules/job/sdk.py", line 232, in _upload_working_dir_if_needed
    package_uri = self._upload_package_if_needed(
  File "/usr/local/lib/python3.9/site-packages/ray/dashboard/modules/job/sdk.py", line 211, in _upload_package_if_needed
    package_uri = get_uri_for_directory(package_path, excludes=excludes)
  File "/usr/local/lib/python3.9/site-packages/ray/_private/runtime_env/packaging.py", line 333, in get_uri_for_directory
    hash_val = _hash_directory(directory, directory,
  File "/usr/local/lib/python3.9/site-packages/ray/_private/runtime_env/packaging.py", line 121, in _hash_directory
    _dir_travel(root, excludes, handler, logger=logger)
  File "/usr/local/lib/python3.9/site-packages/ray/_private/runtime_env/packaging.py", line 88, in _dir_travel
    _dir_travel(sub_path, excludes, handler, logger=logger)
  File "/usr/local/lib/python3.9/site-packages/ray/_private/runtime_env/packaging.py", line 88, in _dir_travel
    _dir_travel(sub_path, excludes, handler, logger=logger)
  File "/usr/local/lib/python3.9/site-packages/ray/_private/runtime_env/packaging.py", line 88, in _dir_travel
    _dir_travel(sub_path, excludes, handler, logger=logger)
  File "/usr/local/lib/python3.9/site-packages/ray/_private/runtime_env/packaging.py", line 85, in _dir_travel
    raise e
  File "/usr/local/lib/python3.9/site-packages/ray/_private/runtime_env/packaging.py", line 82, in _dir_travel
    handler(path)
  File "/usr/local/lib/python3.9/site-packages/ray/_private/runtime_env/packaging.py", line 112, in handler
    with path.open("rb") as f:
  File "/usr/local/Cellar/python@3.9/3.9.7/Frameworks/Python.framework/Versions/3.9/lib/python3.9/pathlib.py", line 1252, in open
    return io.open(self, mode, buffering, encoding, errors, newline,
  File "/usr/local/Cellar/python@3.9/3.9.7/Frameworks/Python.framework/Versions/3.9/lib/python3.9/pathlib.py", line 1120, in _opener
    return self._accessor.open(self, flags, mode)
OSError: [Errno 102] Operation not supported on socket: '/Users/gauta/.docker/run/docker-cli-api.sock'

Then I tried with Using Ray Client to connect from within the Kubernetes cluster

However this also resulted in following error

$ python3 ray/doc/kubernetes/example_scripts/run_local_example.py 
Traceback (most recent call last):
  File "/Users/gauta/ray/doc/kubernetes/example_scripts/run_local_example.py", line 60, in <module>
    ray.init(f"ray://127.0.0.1:{LOCAL_PORT}")
  File "/usr/local/lib/python3.9/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.9/site-packages/ray/worker.py", line 800, in init
    return builder.connect()
  File "/usr/local/lib/python3.9/site-packages/ray/client_builder.py", line 151, in connect
    client_info_dict = ray.util.client_connect.connect(
  File "/usr/local/lib/python3.9/site-packages/ray/util/client_connect.py", line 33, in connect
    conn = ray.connect(
  File "/usr/local/lib/python3.9/site-packages/ray/util/client/__init__.py", line 228, in connect
    conn = self.get_context().connect(*args, **kw_args)
  File "/usr/local/lib/python3.9/site-packages/ray/util/client/__init__.py", line 88, in connect
    self.client_worker._server_init(job_config, ray_init_kwargs)
  File "/usr/local/lib/python3.9/site-packages/ray/util/client/worker.py", line 697, in _server_init
    raise ConnectionAbortedError(
ConnectionAbortedError: Initialization failure from server:
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/client/server/proxier.py", line 649, in Datapath
    modified_init_req, job_config = prepare_runtime_init_req(init_req)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/client/server/proxier.py", line 571, in prepare_runtime_init_req
    job_config = pickle.loads(req.job_config)
AttributeError: Can't get attribute 'ParsedRuntimeEnv' on <module 'ray._private.runtime_env.validation' from '/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/runtime_env/validation.py'>

any help would be appreciated.

Reproduction script

Provided details in above section.

Anything else

No response

Are you willing to submit a PR?

  • Yes I am willing to submit a PR!

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Comments: 18 (5 by maintainers)

Commits related to this issue

Most upvoted comments

thanks I was able to run both options using ray submit and connecting to raycluster.

Feel free to close this issue. Thanks @architkulkarni and @DmitriGekhtman for help.

I’ve also created a small PR https://github.com/ray-project/ray/compare/master...goswamig:patch-1

The URI issue might be the same as this one: https://github.com/ray-project/ray/issues/23423

According to the reports on that issue, if you change the contents of the working_dir in some way (e.g. by adding a file or editing a file), it’s likely to go away–can you let us know if that works? The issue should be fixed in the nightly wheels, if you’re willing to try them out (installing them both on your local machine and on the cluster). (We’re fully confident yet that our fix fully addresses the issue, so the issue is still open)

Another workaround if you don’t need working_dir, is to omit the working_dir field entirely.

I started with Running Ray programs with Ray Jobs Submission with sample script.py

It looks like the runtime env machinery hit an issue with a file in your local directory. Could you try again with a directory that doesn’t container a socket file – perhaps an empty directory?

@architkulkarni @edoakes probably we can make the UX smoother here by skipping over .sock files or adding better error handling for files that can’t be processed

Then I tried with Using Ray Client to connect from within the Kubernetes cluster However this also resulted in following error

The good news is that the connection was successfully established! The bad news is that Ray client is extremely sensitive to Ray version mismatch. Ray versions on client and server side must align exactly. I suspect there might be a mismatch in your situation.

Thanks for posting this bug report.

surprisingly there was nothing for autoscaling, was this expected?

The autoscaler for each Ray cluster is attached as a sidecar to Ray head pod, the output of kubectl get pod <ray-head> will show two containers and kubectl get pod <ray-head> -o yaml will show the full details of the pod.

I will follow up soon to help address the problems encountered here.