ucx-py: Crash with InfiniBand

Today we started doing a more large-scale test with 17 DGX-1 machines. After less than one minute, the machine where the scheduler was running completely hanged. I went to check workers’ logs, and one of them reports the error below for dozens of times:

Exception ignored in: 'ucp._libs.send_recv._stream_recv_callback'
asyncio.base_futures.InvalidStateError: invalid state
asyncio.base_futures.InvalidStateError: invalid state
Exception ignored in: 'ucp._libs.send_recv._stream_recv_callback'
asyncio.base_futures.InvalidStateError: invalid state
asyncio.base_futures.InvalidStateError: invalid state
Exception ignored in: 'ucp._libs.send_recv._send_callback'
asyncio.base_futures.InvalidStateError: invalid state
distributed.core - INFO - Event loop was unresponsive in Worker for 4.38s.  This is often caused by long-running GIL-holding functions or moving large chunks of data. This can cause timeouts and instability.
asyncio.base_futures.InvalidStateError: invalid state
Exception ignored in: 'ucp._libs.send_recv._stream_recv_callback'
asyncio.base_futures.InvalidStateError: invalid state
distributed.core - INFO - Event loop was unresponsive in Worker for 6.17s.  This is often caused by long-running GIL-holding functions or moving large chunks of data. This can cause timeouts and instability.
asyncio.base_futures.InvalidStateError: invalid state
Exception ignored in: 'ucp._libs.send_recv._send_callback'
asyncio.base_futures.InvalidStateError: invalid state
asyncio.base_futures.InvalidStateError: invalid state
Exception ignored in: 'ucp._libs.send_recv._stream_recv_callback'
asyncio.base_futures.InvalidStateError: invalid state
distributed.core - INFO - Event loop was unresponsive in Worker for 7.38s.  This is often caused by long-running GIL-holding functions or moving large chunks of data. This can cause timeouts and instability.

This is an error that I’ve seen before multiple times in different occasions but I wasn’t able to track it down.

After many of those errors, one of the CUDA workers finally errors out as follows:

ERROR Ignored except: <class 'ValueError'> Both peers must set guarantee_msg_order identically
[rl-dgx-d19-u08-rapids-dgx102:24389:0:24389] ib_mlx5_log.c:139  Transport retry count exceeded on mlx5_3:1/IB (synd 0x15 vend 0x81 hw_synd 0/0)
[rl-dgx-d19-u08-rapids-dgx102:24389:0:24389] ib_mlx5_log.c:139  RC QP 0x520 wqe[9]: RDMA_WRITE s-- [rva 0x7f6342565e00 rkey 0x1c56] [va 0x7f325ec3c000 len 58307648 lkey 0x18c6]
==== backtrace ====
    0  /home/rgelhausen/conda/envs/rapids/lib/libucs.so.0(ucs_fatal_error_message+0xdf) [0x7f39358e7b4c]
    1  /home/rgelhausen/conda/envs/rapids/lib/libucs.so.0(ucs_log_default_handler+0x159) [0x7f39358eb79b]
    2  /home/rgelhausen/conda/envs/rapids/lib/libucs.so.0(ucs_log_dispatch+0xf8) [0x7f39358eb9d3]
    3  /home/rgelhausen/conda/envs/rapids/lib/ucx/libuct_ib.so.0(uct_ib_mlx5_completion_with_err+0x667) [0x7f3934a490c3]
    4  /home/rgelhausen/conda/envs/rapids/lib/ucx/libuct_ib.so.0(+0x860bc) [0x7f3934aae0bc]
    5  /home/rgelhausen/conda/envs/rapids/lib/ucx/libuct_ib.so.0(uct_ib_mlx5_check_completion+0x7f) [0x7f3934a4b75b]
    6  /home/rgelhausen/conda/envs/rapids/lib/ucx/libuct_ib.so.0(uct_rc_mlx5_iface_progress+0x542a) [0x7f3934aad2f8]
    7  /home/rgelhausen/conda/envs/rapids/lib/libucp.so.0(+0x2d4f0) [0x7f3935fa44f0]
    8  /home/rgelhausen/conda/envs/rapids/lib/libucp.so.0(ucp_worker_progress+0x137) [0x7f3935faad51]
    9  /home/rgelhausen/conda/envs/rapids/lib/python3.7/site-packages/ucp/_libs/core.cpython-37m-x86_64-linux-gnu.so(+0x14291) [0x7f3936253291]
   10  /home/rgelhausen/conda/envs/rapids/lib/python3.7/site-packages/ucp/_libs/core.cpython-37m-x86_64-linux-gnu.so(+0x2cf21) [0x7f393626bf21]
   11  /home/rgelhausen/conda/envs/rapids/lib/python3.7/site-packages/ucp/_libs/core.cpython-37m-x86_64-linux-gnu.so(+0x5a50b) [0x7f393629950b]
   12  /home/rgelhausen/conda/envs/rapids/lib/python3.7/site-packages/ucp/_libs/core.cpython-37m-x86_64-linux-gnu.so(+0x15fe5) [0x7f3936254fe5]
   13  /home/rgelhausen/conda/envs/rapids/lib/python3.7/site-packages/ucp/_libs/core.cpython-37m-x86_64-linux-gnu.so(+0x36417) [0x7f3936275417]
   14  /home/rgelhausen/conda/envs/rapids/bin/python(_PyMethodDef_RawFastCallDict+0xa1) [0x559ed9f67aa1]
   15  /home/rgelhausen/conda/envs/rapids/bin/python(_PyCFunction_FastCallDict+0x21) [0x559ed9f67dd1]
   16  /home/rgelhausen/conda/envs/rapids/bin/python(+0x134e5e) [0x559ed9f65e5e]
   17  /home/rgelhausen/conda/envs/rapids/bin/python(_PyObject_CallMethodIdObjArgs+0xbd) [0x559ed9fc1ded]
   18  /home/rgelhausen/conda/envs/rapids/lib/python3.7/lib-dynload/_asyncio.cpython-37m-x86_64-linux-gnu.so(+0xca8e) [0x7f3a2ffd1a8e]
   19  /home/rgelhausen/conda/envs/rapids/bin/python(_PyObject_FastCallKeywords+0x49b) [0x559ed9f9ee7b]
   20  /home/rgelhausen/conda/envs/rapids/bin/python(+0x201893) [0x559eda032893]
   21  /home/rgelhausen/conda/envs/rapids/bin/python(_PyMethodDef_RawFastCallDict+0x194) [0x559ed9f67b94]
   22  /home/rgelhausen/conda/envs/rapids/bin/python(_PyCFunction_FastCallDict+0x21) [0x559ed9f67dd1]
   23  /home/rgelhausen/conda/envs/rapids/bin/python(_PyEval_EvalFrameDefault+0x5cc4) [0x559eda0040b4]
   24  /home/rgelhausen/conda/envs/rapids/bin/python(_PyFunction_FastCallKeywords+0xfb) [0x559ed9f9629b]
   25  /home/rgelhausen/conda/envs/rapids/bin/python(_PyEval_EvalFrameDefault+0x6a0) [0x559ed9ffea90]
   26  /home/rgelhausen/conda/envs/rapids/bin/python(_PyFunction_FastCallKeywords+0xfb) [0x559ed9f9629b]
   27  /home/rgelhausen/conda/envs/rapids/bin/python(_PyEval_EvalFrameDefault+0x6a0) [0x559ed9ffea90]
   28  /home/rgelhausen/conda/envs/rapids/bin/python(_PyFunction_FastCallKeywords+0xfb) [0x559ed9f9629b]
   29  /home/rgelhausen/conda/envs/rapids/bin/python(_PyEval_EvalFrameDefault+0x6a0) [0x559ed9ffea90]
   30  /home/rgelhausen/conda/envs/rapids/bin/python(_PyFunction_FastCallKeywords+0xfb) [0x559ed9f9629b]
   31  /home/rgelhausen/conda/envs/rapids/bin/python(_PyEval_EvalFrameDefault+0x6a0) [0x559ed9ffea90]
   32  /home/rgelhausen/conda/envs/rapids/bin/python(_PyEval_EvalCodeWithName+0xab8) [0x559ed9f46f08]
   33  /home/rgelhausen/conda/envs/rapids/bin/python(_PyFunction_FastCallKeywords+0x387) [0x559ed9f96527]
   34  /home/rgelhausen/conda/envs/rapids/bin/python(_PyEval_EvalFrameDefault+0x6a0) [0x559ed9ffea90]
   35  /home/rgelhausen/conda/envs/rapids/bin/python(_PyEval_EvalCodeWithName+0xab8) [0x559ed9f46f08]
   36  /home/rgelhausen/conda/envs/rapids/bin/python(_PyFunction_FastCallDict+0x400) [0x559ed9f47ab0]
   37  /home/rgelhausen/conda/envs/rapids/bin/python(_PyObject_Call_Prepend+0x63) [0x559ed9f65b63]
   38  /home/rgelhausen/conda/envs/rapids/bin/python(PyObject_Call+0x62) [0x559ed9f58522]
   39  /home/rgelhausen/conda/envs/rapids/bin/python(_PyEval_EvalFrameDefault+0x1e53) [0x559eda000243]
   40  /home/rgelhausen/conda/envs/rapids/bin/python(_PyFunction_FastCallDict+0x10b) [0x559ed9f477bb]
   41  /home/rgelhausen/conda/envs/rapids/bin/python(_PyObject_Call_Prepend+0xde) [0x559ed9f65bde]
   42  /home/rgelhausen/conda/envs/rapids/bin/python(PyObject_Call+0x62) [0x559ed9f58522]
   43  /home/rgelhausen/conda/envs/rapids/bin/python(_PyEval_EvalFrameDefault+0x1e53) [0x559eda000243]
   44  /home/rgelhausen/conda/envs/rapids/bin/python(_PyFunction_FastCallKeywords+0xfb) [0x559ed9f9629b]
   45  /home/rgelhausen/conda/envs/rapids/bin/python(_PyEval_EvalFrameDefault+0x6a0) [0x559ed9ffea90]
   46  /home/rgelhausen/conda/envs/rapids/bin/python(_PyFunction_FastCallKeywords+0xfb) [0x559ed9f9629b]
   47  /home/rgelhausen/conda/envs/rapids/bin/python(_PyEval_EvalFrameDefault+0x6a0) [0x559ed9ffea90]
   48  /home/rgelhausen/conda/envs/rapids/bin/python(_PyFunction_FastCallKeywords+0xfb) [0x559ed9f9629b]
   49  /home/rgelhausen/conda/envs/rapids/bin/python(_PyEval_EvalFrameDefault+0x416) [0x559ed9ffe806]
   50  /home/rgelhausen/conda/envs/rapids/bin/python(_PyEval_EvalCodeWithName+0x2f9) [0x559ed9f46749]
   51  /home/rgelhausen/conda/envs/rapids/bin/python(_PyFunction_FastCallKeywords+0x387) [0x559ed9f96527]
   52  /home/rgelhausen/conda/envs/rapids/bin/python(_PyEval_EvalFrameDefault+0x14e7) [0x559ed9fff8d7]
   53  /home/rgelhausen/conda/envs/rapids/bin/python(_PyEval_EvalCodeWithName+0x2f9) [0x559ed9f46749]
   54  /home/rgelhausen/conda/envs/rapids/bin/python(PyEval_EvalCodeEx+0x44) [0x559ed9f47674]
   55  /home/rgelhausen/conda/envs/rapids/bin/python(PyEval_EvalCode+0x1c) [0x559ed9f4769c]
   56  /home/rgelhausen/conda/envs/rapids/bin/python(+0x22cbc4) [0x559eda05dbc4]
   57  /home/rgelhausen/conda/envs/rapids/bin/python(PyRun_StringFlags+0x7d) [0x559eda068f1d]
   58  /home/rgelhausen/conda/envs/rapids/bin/python(PyRun_SimpleStringFlags+0x3f) [0x559eda068f7f]
   59  /home/rgelhausen/conda/envs/rapids/bin/python(+0x23807d) [0x559eda06907d]
   60  /home/rgelhausen/conda/envs/rapids/bin/python(_Py_UnixMain+0x3c) [0x559eda0693fc]
   61  /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xe7) [0x7f3a321a5b97]
===================
[rl-dgx-d19-u08-rapids-dgx102:24389:0:24389] Process frozen...

I think the first errors are coming from UCX-Py, but the last one may either be a consequence of that, or it could be some issue with UCX (or some wrong configuration we’re using).

cc @Akshay-Venkatesh in case you have ideas too.

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Comments: 17 (17 by maintainers)

Most upvoted comments

I just ran the dask-cuda benchmark and #439 resolves the invalidstate error

Turns out the machine this was being run on was simply heavily oversubscribed. Sorry for the long tangent Peter and thanks for the help debugging everyone.