rmm: [BUG] Shouldn't need to synchronize on creation of DeviceBuffer even in the default stream
We recently merged a PR that does a cudaStreamSynchronize on any DeviceBuffer creation on the default stream. This was primarily to fix a UCX-py issue we were seeing where we were sending memory before it was necessarily computed.
@jakirkham The more I think about this the more I think we should be synchronizing before a send in UCX-py, but we shouldn’t need to synchronize here. In general, calls should be enqueued onto the default stream and we shouldn’t need to synchronize here to guarantee correct results / behavior. Things like copying to host memory would synchronize before returning, and kernel launches wouldn’t actually execute until the host --> device copy was done anyway.
I’m a bit surprised that UCX-py is doing something that ends up requiring the default stream to be synchronized. I’d expect the CUDA API calls to either implicitly synchronize things or be enqueued onto the stream.
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Comments: 38 (36 by maintainers)
Yes, which is what we’re trying to guard against, ideally without synchronizing the entire device.
Not quite. When a C++ function marks it’s input as
constas in:That’s telling you that
do_stuffwill treatfasconstand not modify it. It does not mean allfs passed intodo_stuffneed to beconst.As a general rule (with some caveats) it’s always okay to add
constin a function parameter. All it means is that function won’t try to modify it.Long story short, there’s no reason to be casting to
void const*(that is technically UB), you can just cast tovoid*and pass it into the function and it will work.Well, @jrhemstad wrote
device_bufferso he should cast an opinion. I was asking a what if, not expressing an opinion on whether it’s a good idea! 😃It’s something we were hoping to avoid, but could consider. We were hoping to take the thrust route and use futures instead of streams…
Not if you try and use CNMEM w/ multiple threads. If you’re only using one thread then it’s fine.
Thanks for the explanation @jrhemstad. Either way it seems like we’re in a bit of a pickle where we’re dependent on synchronous behavior for our
DeviceBuffercython class construction but the only way to achieve that on the default stream is to synchronize the entire device.I would welcome ideas / suggestions here on a more elegant / performant solution.