rmm: [BUG] Shouldn't need to synchronize on creation of DeviceBuffer even in the default stream

We recently merged a PR that does a cudaStreamSynchronize on any DeviceBuffer creation on the default stream. This was primarily to fix a UCX-py issue we were seeing where we were sending memory before it was necessarily computed.

@jakirkham The more I think about this the more I think we should be synchronizing before a send in UCX-py, but we shouldn’t need to synchronize here. In general, calls should be enqueued onto the default stream and we shouldn’t need to synchronize here to guarantee correct results / behavior. Things like copying to host memory would synchronize before returning, and kernel launches wouldn’t actually execute until the host --> device copy was done anyway.

I’m a bit surprised that UCX-py is doing something that ends up requiring the default stream to be synchronized. I’d expect the CUDA API calls to either implicitly synchronize things or be enqueued onto the stream.

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Comments: 38 (36 by maintainers)

Most upvoted comments

Yes, which is what we’re trying to guard against, ideally without synchronizing the entire device.

hb[:] = 2

# what is `rdb` here?

https://github.com/rapidsai/rmm/blob/branch-0.13/include/rmm/device_buffer.hpp#L120 https://github.com/rapidsai/rmm/blob/branch-0.13/python/rmm/_lib/device_buffer.pxd#L30

The input is marked as const here where this shouldn’t be allowed, but because we’re just casting an integer to a const void* we’re bypassing this safety and allowing this UB it seems.

Not quite. When a C++ function marks it’s input as const as in:

void do_stuff(foo const * f);

That’s telling you that do_stuff will treat f as const and not modify it. It does not mean all fs passed into do_stuff need to be const.

foo const c_f;
foo f;

do_stuff(&c_f); // this is fine
do_stuff(&f); // this is fine too

As a general rule (with some caveats) it’s always okay to add const in a function parameter. All it means is that function won’t try to modify it.

Long story short, there’s no reason to be casting to void const* (that is technically UB), you can just cast to void* and pass it into the function and it will work.

Well, @jrhemstad wrote device_buffer so he should cast an opinion. I was asking a what if, not expressing an opinion on whether it’s a good idea! 😃

Additionally, it requires libcudf supporting explicitly passing a stream to it’s C++ APIs which I’m not sure of the state of.

It’s something we were hoping to avoid, but could consider. We were hoping to take the thrust route and use futures instead of streams…

Yes we likely can, but does it work with cnmem under the hood?

Not if you try and use CNMEM w/ multiple threads. If you’re only using one thread then it’s fine.

Thanks for the explanation @jrhemstad. Either way it seems like we’re in a bit of a pickle where we’re dependent on synchronous behavior for our DeviceBuffer cython class construction but the only way to achieve that on the default stream is to synchronize the entire device.

I would welcome ideas / suggestions here on a more elegant / performant solution.