ucx-py: Join Hang exploration with GDB

@pentschev and @Akshay-Venkatesh spent time looking at the hang with GDB using the following test

py.test -s -v “tests/test_ucx_w_dask_workers.py::test_dask_join[5000000-50000000-500000000]”

They found things hanging much of time in uct_cuda_copy_ep_put_short and uct_cuda_copy_ep_get_short in ib_md.c of UCX. From ucx-py, we send data through ucp_tag_send_nb, this in turn breaks up the data into small chunks and begins transferring the data. During IB transfers, these small send/receives are executed with uct_cuda_copy_ep_put_short/uct_cuda_copy_ep_get_short . The process which receives the data allocates host buffers then executes a host-to-device copy. A critical thing to note, is that the buffer sizes we’ve seen thus far are 8174 bytes. This is quite small, especially in the context of moving data to a GPU. It’s possible the hang is generated because the number of outstanding events is significantly larger than the number of outstanding events cuda_copy can handle. This particular problem may be mitigated by the following PR: https://github.com/openucx/ucx/pull/4123 . Additionally, @Akshay-Venkatesh is reviewing the chunking with his team. Also, @pentschev replaced the cudaMemcpyAsync calls with cudaMemcpy (sync copy) and the hang still manifested just the same. Lastly, we’ve seen some errors between different branches of ucx-cuda. Those difference are being explored in issue: https://github.com/rapidsai/ucx-py/issues/186

About this issue

  • Original URL
  • State: closed
  • Created 5 years ago
  • Comments: 37 (27 by maintainers)

Most upvoted comments

We actually found the ultimate cause of the bug. Thanks @kkraus14 @quasiben @shwina @mt-jones for taking the time for our live debugging.

To describe briefly, the problem is the numba.forall call, which internally calls cuOccupancyMaxPotentialBlockSize. This last function requires two function pointers, one being the CUDA kernel itself, and the other being a function to calculate how much shared memory the call requires. The problem lies in the latter, which is defined in https://github.com/numba/numba/blob/master/numba/cuda/compiler.py#L288. Since that is a Python lambda function, when cuOccupancyMaxPotentialBlockSize calls that function back, it tries to acquire the GIL, which causes a deadlock (as both the thread executing cuOccupancyMaxPotentialBlockSize and the thread executing cudaMemcpyAsync lock the same CUDA mutex). The GIL can then never be acquires since both threads can never complete.

What we need to prevent is that CUDA calls (e.g., function callbacks passed to libcuda) never tries to acquire the GIL. To fix that in the present case, we can simply pass a C function pointer instead of passing a Python function to it.

I will submit a PR to Numba shortly for this.

This makes me cautiously very excited 😃

Thanks for the update @Akshay-Venkatesh

I understand that. What if we grab the GIL before making a cuda call because we will touch some python objects? Is it also not unsafe?

That would be unsafe too, yes. I guess we have to be careful not to take the GIL and then launch a CUDA call before releasing it again. I’m not sure there’s a way to prevent that other than being careful.