ucx-py: Summary of recent UCX TPCxBB tests and intermittent failures

We’ve been doing some UCX TPCxBB tests on a Slurm cluster. Across multiple configurations, we’ve run into intermittent and as of yet unexplained failures using UCX and InfiniBand. We have been using UCX 1.8 and 1.9 rather than UCX 1.10 due to the already discussed issues (see #668 and associated issues/PRs). This issue will summarize several of the configurations we’ve recently tested and with which we’ve seen failures.

The setup includes a manually QA check of the mappings between GPUs, MLNX NICs, and NIC interfaces. The specific failures are being triaged and may result in their own issues with more details, which can be crosslinked for tracking.

Initial Setup

  • UCX 1.8.1
  • UCX-Py 0.18 (2020-01-19)
  • RAPIDS 0.18 nightlies (2020-01-19)
  • CUDA 11.0
  • Ubuntu 18.04
  • MLNX_OFED-5.1-2.5.8.0

With this setup, we are able to run a few queries successfully. However, we experienced intermittent segfaults that were not consistently reproducible.

We also saw the following warning related to libibcm, which we are triaging but may perhaps resolve itself with Ubuntu 20.04. Others (including @pentschev ) have suggested that we may simply no longer need libibcm.

> libibcm: kernel ABI version 0 doesn't match library version 5.

Second Setup

  • UCX 1.9
  • UCX-Py 0.18 (2020-01-19)
  • RAPIDS 0.18 nightlies (2020-01-19)
  • CUDA 11.0
  • Ubuntu 18.04
  • MLNX_OFED-5.1-2.5.8.0

The only change in this setup was to use OpenUCX 1.9. With this setup, we were also able to run a few queries successfully. However, we again experienced intermittent failures. Failing queries included both large and small queries, suggesting that this was not driven by out of memories but by something else.

Third Setup (~pending, may suceeed – will edit as appropriate~)

  • UCX 1.9
  • UCX-Py 0.18 (2020-01-19)
  • RAPIDS 0.18 nightlies (2020-01-19)
  • CUDA 11.0
  • Ubuntu 20.04 Focal
  • MLNX_OFED-5.1-2.5.8.0

After additional discussions, we upgraded from Ubuntu 18.04 to Ubuntu 20.04. In this test, we also removed --with-cm from the UCX build process. We now consistently see compute occurring and then shortly after we see a hang.

@quasiben please feel free to edit/correct me if I’ve misstated anything.

About this issue

  • Original URL
  • State: open
  • Created 3 years ago
  • Comments: 15 (15 by maintainers)

Most upvoted comments

Thanks @jakirkham, I think we have the right patch:

ADD https://raw.githubusercontent.com/rapidsai/ucx-split-feedstock/master/recipe/cuda-alloc-rcache.patch /tmp/ib_registration_cache.patch

@beckernick it would be good to have GPU-BDB tested again with UCX 1.11, we believe that issues here have been resolved.

The patch quoted in https://github.com/rapidsai/ucx-py/issues/670#issuecomment-763177952 is the correct one for 1.9, there’s only one patch needed. IIRC, the old patches from 1.8 will not apply to 1.9.

It’s a dirty remnant from the dockerfile – the patch is being written to a name /tmp/ib_reg…