ray: [k8s] Ray Nightly: setup fails w/ object store memory greater than available
Starting a Ray cluster using the latest nightly and example.yaml file from GitHub is failing when running on a local k8s cluster:
2021-03-08 12:53:14,976 INFO services.py:1251 -- View the Ray dashboard at http://0.0.0.0:8265
Traceback (most recent call last):
File "/home/ray/anaconda3/bin/ray", line 8, in <module>
sys.exit(main())
File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/scripts/scripts.py", line 1707, in main
return cli()
File "/home/ray/anaconda3/lib/python3.7/site-packages/click/core.py", line 829, in __call__
return self.main(*args, **kwargs)
File "/home/ray/anaconda3/lib/python3.7/site-packages/click/core.py", line 782, in main
rv = self.invoke(ctx)
File "/home/ray/anaconda3/lib/python3.7/site-packages/click/core.py", line 1259, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/home/ray/anaconda3/lib/python3.7/site-packages/click/core.py", line 1066, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/home/ray/anaconda3/lib/python3.7/site-packages/click/core.py", line 610, in invoke
return callback(*args, **kwargs)
File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/scripts/scripts.py", line 564, in start
ray_params, head=True, shutdown_at_exit=block, spawn_reaper=block)
File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/node.py", line 233, in __init__
self.start_ray_processes()
File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/node.py", line 887, in start_ray_processes
huge_pages=self._ray_params.huge_pages
File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/services.py", line 1728, in determine_plasma_store_config
"The requested object store memory size is greater "
ValueError: The requested object store memory size is greater than the total available memory.
command terminated with exit code 1
New status: update-failed
!!!
Setup command `kubectl -n ray exec -it local-cluster-ray-head-d9q6c -- bash --login -c -i 'true && source ~/.bashrc && export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && (export RAY_OVERRIDE_RESOURCES='"'"'{"CPU":1}'"'"';ulimit -n 65536; ray start --head --autoscaling-config=~/ray_bootstrap_config.yaml --dashboard-host 0.0.0.0)'` failed with exit code 1. stderr:
!!!
Exception in thread Thread-1:
Traceback (most recent call last):
File "/Users/tgaddair/.pyenv/versions/3.7.8/lib/python3.7/threading.py", line 926, in _bootstrap_inner
self.run()
File "/Users/tgaddair/repos/ludwig/env/lib/python3.7/site-packages/ray/autoscaler/_private/updater.py", line 127, in run
self.do_update()
File "/Users/tgaddair/repos/ludwig/env/lib/python3.7/site-packages/ray/autoscaler/_private/updater.py", line 451, in do_update
run_env="auto")
File "/Users/tgaddair/repos/ludwig/env/lib/python3.7/site-packages/ray/autoscaler/_private/command_runner.py", line 177, in run
self.process_runner.check_call(final_cmd, shell=True)
File "/Users/tgaddair/.pyenv/versions/3.7.8/lib/python3.7/subprocess.py", line 363, in check_call
raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command 'kubectl -n ray exec -it local-cluster-ray-head-d9q6c -- bash --login -c -i 'true && source ~/.bashrc && export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && (export RAY_OVERRIDE_RESOURCES='"'"'{"CPU":1}'"'"';ulimit -n 65536; ray start --head --autoscaling-config=~/ray_bootstrap_config.yaml --dashboard-host 0.0.0.0)'' returned non-zero exit status 1.
This cluster was provisioned on a local macOS system using k3d. It has a single node, and is overcommitted:
Allocated resources:
(Total limits may be over 100 percent, i.e., overcommitted.)
Resource Requests Limits
-------- -------- ------
cpu 4100m (51%) 0 (0%)
memory 2630Mi (44%) 3754Mi (63%)
ephemeral-storage 0 (0%) 0 (0%)
hugepages-1Gi 0 (0%) 0 (0%)
hugepages-2Mi 0 (0%) 0 (0%)
Adding ray start ... --object-store-memory=200000000 to the setup commands solved the issue.
cc @richardliaw
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Comments: 19 (19 by maintainers)
just tested – this is fixed.
better k8s resource detection 😃