kuberay: [Bug] Quickstart example is not working.
Search before asking
- I searched the issues and found no similar issues.
KubeRay Component
Others
What happened + What you expected to happen
At first it appears to be more like Ray issue but nevertheless its a part of kuberay and k8s documentation
I’ve followed kuberay doc and was able to use autoscaler which added more worker successfully.
NAMESPACE NAME READY STATUS RESTARTS AGE
default raycluster-complete-head-xc9pg 2/2 Running 0 13m
default raycluster-complete-worker-small-group-fvfn4 1/1 Running 0 10m
default raycluster-complete-worker-small-group-g9bk5 1/1 Running 0 10m
default raycluster-complete-worker-small-group-jzjs4 1/1 Running 0 14m
Operator
ray-system kuberay-operator-74455ff8c6-5qdrx 1/1 Running 0 26m
surprisingly there was nothing for autoscaling, was this expected?
I was able to successfully do port-forward for head node service for dashboard :8265
and ray client server port: 10001
After this since there wasn’t any working example from kuberay, I tried to use k8s documentation to run some program.
I started with Running Ray programs with Ray Jobs Submission with sample script.py
this resultant into following error.
$ ray job submit --runtime-env-json='{"working_dir": "./", "pip": ["requests==2.26.0"]}' -- "python script.py"
Job submission server address: http://127.0.0.1:8265
2022-06-02 00:34:53,279 ERROR packaging.py:84 -- Issue with path: /Users/gauta/.docker/run/docker-cli-api.sock
Traceback (most recent call last):
File "/usr/local/bin/ray", line 8, in <module>
sys.exit(main())
File "/usr/local/lib/python3.9/site-packages/ray/scripts/scripts.py", line 1958, in main
return cli()
File "/usr/local/lib/python3.9/site-packages/click/core.py", line 1128, in __call__
return self.main(*args, **kwargs)
File "/usr/local/lib/python3.9/site-packages/click/core.py", line 1053, in main
rv = self.invoke(ctx)
File "/usr/local/lib/python3.9/site-packages/click/core.py", line 1659, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/usr/local/lib/python3.9/site-packages/click/core.py", line 1659, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/usr/local/lib/python3.9/site-packages/click/core.py", line 1395, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/usr/local/lib/python3.9/site-packages/click/core.py", line 754, in invoke
return __callback(*args, **kwargs)
File "/usr/local/lib/python3.9/site-packages/ray/autoscaler/_private/cli_logger.py", line 808, in wrapper
return f(*args, **kwargs)
File "/usr/local/lib/python3.9/site-packages/ray/dashboard/modules/job/cli.py", line 151, in job_submit
job_id = client.submit_job(
File "/usr/local/lib/python3.9/site-packages/ray/dashboard/modules/job/sdk.py", line 255, in submit_job
self._upload_working_dir_if_needed(runtime_env)
File "/usr/local/lib/python3.9/site-packages/ray/dashboard/modules/job/sdk.py", line 232, in _upload_working_dir_if_needed
package_uri = self._upload_package_if_needed(
File "/usr/local/lib/python3.9/site-packages/ray/dashboard/modules/job/sdk.py", line 211, in _upload_package_if_needed
package_uri = get_uri_for_directory(package_path, excludes=excludes)
File "/usr/local/lib/python3.9/site-packages/ray/_private/runtime_env/packaging.py", line 333, in get_uri_for_directory
hash_val = _hash_directory(directory, directory,
File "/usr/local/lib/python3.9/site-packages/ray/_private/runtime_env/packaging.py", line 121, in _hash_directory
_dir_travel(root, excludes, handler, logger=logger)
File "/usr/local/lib/python3.9/site-packages/ray/_private/runtime_env/packaging.py", line 88, in _dir_travel
_dir_travel(sub_path, excludes, handler, logger=logger)
File "/usr/local/lib/python3.9/site-packages/ray/_private/runtime_env/packaging.py", line 88, in _dir_travel
_dir_travel(sub_path, excludes, handler, logger=logger)
File "/usr/local/lib/python3.9/site-packages/ray/_private/runtime_env/packaging.py", line 88, in _dir_travel
_dir_travel(sub_path, excludes, handler, logger=logger)
File "/usr/local/lib/python3.9/site-packages/ray/_private/runtime_env/packaging.py", line 85, in _dir_travel
raise e
File "/usr/local/lib/python3.9/site-packages/ray/_private/runtime_env/packaging.py", line 82, in _dir_travel
handler(path)
File "/usr/local/lib/python3.9/site-packages/ray/_private/runtime_env/packaging.py", line 112, in handler
with path.open("rb") as f:
File "/usr/local/Cellar/python@3.9/3.9.7/Frameworks/Python.framework/Versions/3.9/lib/python3.9/pathlib.py", line 1252, in open
return io.open(self, mode, buffering, encoding, errors, newline,
File "/usr/local/Cellar/python@3.9/3.9.7/Frameworks/Python.framework/Versions/3.9/lib/python3.9/pathlib.py", line 1120, in _opener
return self._accessor.open(self, flags, mode)
OSError: [Errno 102] Operation not supported on socket: '/Users/gauta/.docker/run/docker-cli-api.sock'
Then I tried with Using Ray Client to connect from within the Kubernetes cluster
However this also resulted in following error
$ python3 ray/doc/kubernetes/example_scripts/run_local_example.py
Traceback (most recent call last):
File "/Users/gauta/ray/doc/kubernetes/example_scripts/run_local_example.py", line 60, in <module>
ray.init(f"ray://127.0.0.1:{LOCAL_PORT}")
File "/usr/local/lib/python3.9/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
return func(*args, **kwargs)
File "/usr/local/lib/python3.9/site-packages/ray/worker.py", line 800, in init
return builder.connect()
File "/usr/local/lib/python3.9/site-packages/ray/client_builder.py", line 151, in connect
client_info_dict = ray.util.client_connect.connect(
File "/usr/local/lib/python3.9/site-packages/ray/util/client_connect.py", line 33, in connect
conn = ray.connect(
File "/usr/local/lib/python3.9/site-packages/ray/util/client/__init__.py", line 228, in connect
conn = self.get_context().connect(*args, **kw_args)
File "/usr/local/lib/python3.9/site-packages/ray/util/client/__init__.py", line 88, in connect
self.client_worker._server_init(job_config, ray_init_kwargs)
File "/usr/local/lib/python3.9/site-packages/ray/util/client/worker.py", line 697, in _server_init
raise ConnectionAbortedError(
ConnectionAbortedError: Initialization failure from server:
Traceback (most recent call last):
File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/client/server/proxier.py", line 649, in Datapath
modified_init_req, job_config = prepare_runtime_init_req(init_req)
File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/client/server/proxier.py", line 571, in prepare_runtime_init_req
job_config = pickle.loads(req.job_config)
AttributeError: Can't get attribute 'ParsedRuntimeEnv' on <module 'ray._private.runtime_env.validation' from '/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/runtime_env/validation.py'>
any help would be appreciated.
Reproduction script
Provided details in above section.
Anything else
No response
Are you willing to submit a PR?
- Yes I am willing to submit a PR!
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Comments: 18 (5 by maintainers)
Commits related to this issue
- Fixing example The double quotes are not needed see [this](https://github.com/ray-project/kuberay/issues/283) — committed to goswamig/ray by goswamig 2 years ago
thanks I was able to run both options using
ray submit
and connecting to raycluster.Feel free to close this issue. Thanks @architkulkarni and @DmitriGekhtman for help.
I’ve also created a small PR https://github.com/ray-project/ray/compare/master...goswamig:patch-1
The URI issue might be the same as this one: https://github.com/ray-project/ray/issues/23423
According to the reports on that issue, if you change the contents of the working_dir in some way (e.g. by adding a file or editing a file), it’s likely to go away–can you let us know if that works? The issue should be fixed in the nightly wheels, if you’re willing to try them out (installing them both on your local machine and on the cluster). (We’re fully confident yet that our fix fully addresses the issue, so the issue is still open)
Another workaround if you don’t need
working_dir
, is to omit theworking_dir
field entirely.It looks like the runtime env machinery hit an issue with a file in your local directory. Could you try again with a directory that doesn’t container a socket file – perhaps an empty directory?
@architkulkarni @edoakes probably we can make the UX smoother here by skipping over .sock files or adding better error handling for files that can’t be processed
The good news is that the connection was successfully established! The bad news is that Ray client is extremely sensitive to Ray version mismatch. Ray versions on client and server side must align exactly. I suspect there might be a mismatch in your situation.
Thanks for posting this bug report.
The autoscaler for each Ray cluster is attached as a sidecar to Ray head pod, the output of
kubectl get pod <ray-head>
will show two containers andkubectl get pod <ray-head> -o yaml
will show the full details of the pod.I will follow up soon to help address the problems encountered here.