ray: [k8s] Ray Nightly: setup fails w/ object store memory greater than available

Starting a Ray cluster using the latest nightly and example.yaml file from GitHub is failing when running on a local k8s cluster:

2021-03-08 12:53:14,976 INFO services.py:1251 -- View the Ray dashboard at http://0.0.0.0:8265
Traceback (most recent call last):
  File "/home/ray/anaconda3/bin/ray", line 8, in <module>
    sys.exit(main())
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/scripts/scripts.py", line 1707, in main
    return cli()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/click/core.py", line 1259, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/ray/anaconda3/lib/python3.7/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/scripts/scripts.py", line 564, in start
    ray_params, head=True, shutdown_at_exit=block, spawn_reaper=block)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/node.py", line 233, in __init__
    self.start_ray_processes()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/node.py", line 887, in start_ray_processes
    huge_pages=self._ray_params.huge_pages
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/services.py", line 1728, in determine_plasma_store_config
    "The requested object store memory size is greater "
ValueError: The requested object store memory size is greater than the total available memory.
command terminated with exit code 1
  New status: update-failed
  !!!
  Setup command `kubectl -n ray exec -it local-cluster-ray-head-d9q6c -- bash --login -c -i 'true && source ~/.bashrc && export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && (export RAY_OVERRIDE_RESOURCES='"'"'{"CPU":1}'"'"';ulimit -n 65536; ray start --head --autoscaling-config=~/ray_bootstrap_config.yaml --dashboard-host 0.0.0.0)'` failed with exit code 1. stderr:
  !!!
Exception in thread Thread-1:
Traceback (most recent call last):
  File "/Users/tgaddair/.pyenv/versions/3.7.8/lib/python3.7/threading.py", line 926, in _bootstrap_inner
    self.run()
  File "/Users/tgaddair/repos/ludwig/env/lib/python3.7/site-packages/ray/autoscaler/_private/updater.py", line 127, in run
    self.do_update()
  File "/Users/tgaddair/repos/ludwig/env/lib/python3.7/site-packages/ray/autoscaler/_private/updater.py", line 451, in do_update
    run_env="auto")
  File "/Users/tgaddair/repos/ludwig/env/lib/python3.7/site-packages/ray/autoscaler/_private/command_runner.py", line 177, in run
    self.process_runner.check_call(final_cmd, shell=True)
  File "/Users/tgaddair/.pyenv/versions/3.7.8/lib/python3.7/subprocess.py", line 363, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command 'kubectl -n ray exec -it local-cluster-ray-head-d9q6c -- bash --login -c -i 'true && source ~/.bashrc && export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && (export RAY_OVERRIDE_RESOURCES='"'"'{"CPU":1}'"'"';ulimit -n 65536; ray start --head --autoscaling-config=~/ray_bootstrap_config.yaml --dashboard-host 0.0.0.0)'' returned non-zero exit status 1.

This cluster was provisioned on a local macOS system using k3d. It has a single node, and is overcommitted:

Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource           Requests      Limits
  --------           --------      ------
  cpu                4100m (51%)   0 (0%)
  memory             2630Mi (44%)  3754Mi (63%)
  ephemeral-storage  0 (0%)        0 (0%)
  hugepages-1Gi      0 (0%)        0 (0%)
  hugepages-2Mi      0 (0%)        0 (0%)

Adding ray start ... --object-store-memory=200000000 to the setup commands solved the issue.

cc @richardliaw

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Comments: 19 (19 by maintainers)

Most upvoted comments

just tested – this is fixed.

@DmitriGekhtman will this be solved with better k8s resource detection that you’re working on?

better k8s resource detection 😃