ucx-py: test_shutdown_closed_peer fails locally

When running the following locally…

$ python -m pytest tests/test_disconnect.py 

…am getting the following test failure

=================================== FAILURES ===================================
__________________________ test_shutdown_closed_peer ___________________________

caplog = <_pytest.logging.LogCaptureFixture object at 0x7fe8339ca4f0>

    def test_shutdown_closed_peer(caplog):
        client_queue = mp.Queue()
        server_queue = mp.Queue()
        p1 = mp.Process(
            target=_test_shutdown_closed_peer_server, args=(client_queue, server_queue)
        )
        p1.start()
        p2 = mp.Process(
            target=_test_shutdown_closed_peer_client, args=(client_queue, server_queue)
        )
        p2.start()
        p2.join()
        server_queue.put("client is down")
        p1.join()
    
>       assert not p1.exitcode
E       AssertionError: assert not 1
E        +  where 1 = <SpawnProcess name='SpawnProcess-1' pid=54486 parent=54399 stopped exitcode=1>.exitcode

tests/test_disconnect.py:71: AssertionError
----------------------------- Captured stdout call -----------------------------
[1601921102.562273] [dgx15:54486:0]          rc_ep.c:321  UCX  WARN  destroying rc ep 0x55cdea70bcf8 with uncompleted operation 0x55cdea91e6c0
[1601921102.588147] [dgx15:54486:0]          mpool.c:43   UCX  WARN  object 0x55cdea91a380 was not returned to mpool ucp_requests
[1601921102.588153] [dgx15:54486:0]      callbackq.c:450  UCX  WARN  0 fast-path and 1 slow-path callbacks remain in the queue
----------------------------- Captured stderr call -----------------------------
Process SpawnProcess-1:
Traceback (most recent call last):
  File "/datasets/jkirkham/miniconda/envs/rapids16dev/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/datasets/jkirkham/miniconda/envs/rapids16dev/lib/python3.8/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/datasets/jkirkham/devel/ucx-py/tests/test_disconnect.py", line 43, in _test_shutdown_closed_peer_server
    assert log.find("""UCXError('<[Send shutdown]""") != -1
AssertionError
=========================== short test summary info ============================
FAILED tests/test_disconnect.py::test_shutdown_closed_peer - AssertionError: ...
============================== 1 failed in 1.94s ===============================

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Comments: 18 (18 by maintainers)

Most upvoted comments

As of https://github.com/rapidsai/ucx-py/pull/693 , all tests are confirmed passing in UCX >= 1.9 for various combinations of transports. Therefore, I believe this is now resolved and I’m closing, but please reopen if you see this still.

Would it make sense to rename the test to “test_terminate”/“test_unexpected_disconnect” or something that more clearly identifies that the client hasn’t disconnected after ep.close? Alternatively, we should probably add a comment that we expect to see UCX errors because it’s testing for an unexpected disconnect.

I think both the renaming and adding a comment is a good idea.

Yes, the test is part of #494 that test shutdown of an already closed peer.

Would it make sense to rename the test to “test_terminate”/“test_unexpected_disconnect” or something that more clearly identifies that the client hasn’t disconnected after ep.close? Alternatively, we should probably add a comment that we expect to see UCX errors because it’s testing for an unexpected disconnect.