ray: [Ray Core] Getting 'AttributeError: ‘RayInternalKvStore’ object has no attribute ‘del_keys’' while using Ray Collective Communication Library

What happened + What you expected to happen

I am trying to use Ray Collective Communication Library for communication between distributed CPUs and using gloo for backend. I am getting the following error while running it.

@rkooo567

Error Output(Click to Expand)
NCCL seems unavailable. Please install Cupy following the guide at: https://docs.cupy.dev/en/stable/install.html.
2022-10-02 13:40:43,958	INFO worker.py:1333 -- Connecting to existing Ray cluster at address: 172.29.58.27:6379...
2022-10-02 13:40:43,963	INFO worker.py:1509 -- Connected to Ray cluster. View the dashboard at http://127.0.0.1:8265 
(pid=8135) 2022-10-02 13:40:45,050	WARNING collective.py:20 -- NCCL seems unavailable. Please install Cupy following the guide at: https://docs.cupy.dev/en/stable/install.html.
2022-10-02 13:40:45,117	WARNING worker.py:1829 -- It looks like you're creating a detached actor in an anonymous namespace. In order to access this actor in the future, you will need to explicitly connect to this namespace with ray.init(namespace="a2d0ba68-b14a-4cc4-8fcc-f383b869e3ed", ...)
2022-10-02 13:40:45,124	WARNING worker.py:1829 -- It looks like you're creating a detached actor in an anonymous namespace. In order to access this actor in the future, you will need to explicitly connect to this namespace with ray.init(namespace="a2d0ba68-b14a-4cc4-8fcc-f383b869e3ed", ...)
(pid=8137) 2022-10-02 13:40:45,119	WARNING collective.py:20 -- NCCL seems unavailable. Please install Cupy following the guide at: https://docs.cupy.dev/en/stable/install.html.
(pid=37650, ip=172.29.58.192) 2022-10-02 13:40:46,116	WARNING collective.py:20 -- NCCL seems unavailable. Please install Cupy following the guide at: https://docs.cupy.dev/en/stable/install.html.
Traceback (most recent call last):
  File "demo_collective_communication_all_reduce.py", line 32, in <module>
    _ = ray.get(init_rets)
  File "/usr/local/lib/python3.8/dist-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/ray/_private/worker.py", line 2275, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(AttributeError): ray::Worker.setup() (pid=8135, ip=172.29.58.27, repr=<demo_collective_communication_all_reduce.Worker object at 0x7f69ead45520>)
  File "demo_collective_communication_all_reduce.py", line 14, in setup
    collective.init_collective_group(world_size, rank, "gloo", "default")
  File "/usr/local/lib/python3.8/dist-packages/ray/util/collective/collective.py", line 148, in init_collective_group
    _group_mgr.create_collective_group(backend, world_size, rank, group_name)
  File "/usr/local/lib/python3.8/dist-packages/ray/util/collective/collective.py", line 63, in create_collective_group
    g = GLOOGroup(
  File "/usr/local/lib/python3.8/dist-packages/ray/util/collective/collective_group/gloo_collective_group.py", line 209, in __init__
    self._rendezvous.meet()
  File "/usr/local/lib/python3.8/dist-packages/ray/util/collective/collective_group/gloo_collective_group.py", line 158, in meet
    self._store.delKeys(keys)
AttributeError: 'RayInternalKvStore' object has no attribute 'del_keys'

Versions / Dependencies

Ray==2.0.0 Pygloo==Built from source(AttributeError: module ‘pygloo.rendezvous’ has no attribute ‘CustomStore’ - #4 by matthewdeng)

Reproduction script

import numpy as np

import ray
import ray.util.collective as col
from ray.util.collective.types import Backend, ReduceOp

@ray.remote(num_cpus=4)
class Worker:
    def __init__(self):
        self.buffer = None
        self.list_buffer = None

    def init_tensors(self):
        self.buffer = np.ones((10,), dtype=np.float32)
        self.list_buffer = [np.ones((10,), dtype=np.float32) for _ in range(2)]
        return True

    def init_group(self, world_size, rank, backend=Backend.NCCL, group_name="default"):
        col.init_collective_group(world_size, rank, backend, group_name)
        return True

    def do_allreduce(self, group_name="default", op=ReduceOp.SUM):
        col.allreduce(self.buffer, group_name, op)
        return self.buffer



def create_collective_workers(num_workers=2, group_name="default", backend="nccl"):
    actors = [None] * num_workers
    for i in range(num_workers):
        actor = Worker.remote()
        ray.get([actor.init_tensors.remote()])
        actors[i] = actor
    world_size = num_workers
    init_results = ray.get(
        [
            actor.init_group.remote(world_size, i, backend, group_name)
            for i, actor in enumerate(actors)
        ]
    )
    return actors, init_results

world_size=2
group_name="default"
actors, _ = create_collective_workers(
    num_workers=world_size, group_name=group_name, backend=Backend.GLOO
)
results = ray.get([a.do_allreduce.remote(group_name) for a in actors])

Issue Severity

High: It blocks me from completing my task.

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Comments: 17 (9 by maintainers)

Most upvoted comments

I was able to get the script to run, with changes from #29084. One more question. when could we expect pygloo(3.0.0) to be released, it is sometime annoying sometime to built pygloo from source everytime, we want to use collective communication?

Yes. I didn’t release it because I thought there is only me using it. @ericl Could you please help to transform the pygloo ownership to me or JIAO? including repo ownership and pypi ownership.

I am on linux only.

@pratkpranav But actually, the ownership has not been taken over to me yet. @ericl CC

Yeah sure! Thanks for help.

Thanks for quick reply. I have built pygloo from source only. But, still I am facing this issue.