LightGBM: [ci] [dask] CI jobs failing with Dask 2022.7.1

Description

Created from https://github.com/microsoft/LightGBM/pull/5388#issuecomment-1195848451.

All CUDA CI jobs and several Linux jobs are failing with the following.

FAILED ../tests/python_package_test/test_dask.py::test_machines_should_be_used_if_provided[binary-classification]
FAILED ../tests/python_package_test/test_dask.py::test_machines_should_be_used_if_provided[multiclass-classification]
FAILED ../tests/python_package_test/test_dask.py::test_machines_should_be_used_if_provided[regression]
FAILED ../tests/python_package_test/test_dask.py::test_machines_should_be_used_if_provided[ranking]
= 4 failed, 700 passed, 10 skipped, 2 xfailed, 395 warnings in 655.51s (0:10:55) =

client.restart() calls in that test are resulting in the following:

raise TimeoutError(f"{len(bad_nannies)}/{len(nannies)} nanny worker(s) did not shut down within {timeout}s") E asyncio.exceptions.TimeoutError: 1/2 nanny worker(s) did not shut down within 120s

traceback (click me)
/root/miniforge/envs/test-env/lib/python3.9/site-packages/distributed/client.py:3360: in restart
    return self.sync(
/root/miniforge/envs/test-env/lib/python3.9/site-packages/distributed/utils.py:338: in sync
    return sync(
/root/miniforge/envs/test-env/lib/python3.9/site-packages/distributed/utils.py:405: in sync
    raise exc.with_traceback(tb)
/root/miniforge/envs/test-env/lib/python3.9/site-packages/distributed/utils.py:378: in f
    result = yield future
/root/miniforge/envs/test-env/lib/python3.9/site-packages/tornado/gen.py:762: in run
    value = future.result()
/root/miniforge/envs/test-env/lib/python3.9/site-packages/distributed/client.py:3329: in _restart
    await self.scheduler.restart(timeout=timeout, wait_for_workers=wait_for_workers)
/root/miniforge/envs/test-env/lib/python3.9/site-packages/distributed/core.py:1153: in send_recv_from_rpc
    return await send_recv(comm=comm, op=key, **kwargs)
/root/miniforge/envs/test-env/lib/python3.9/site-packages/distributed/core.py:943: in send_recv
    raise exc.with_traceback(tb)
/root/miniforge/envs/test-env/lib/python3.9/site-packages/distributed/core.py:769: in _handle_comm
    result = await result
/root/miniforge/envs/test-env/lib/python3.9/site-packages/distributed/utils.py:778: in wrapper
    return await func(*args, **kwargs)

It looks like those jobs are getting dask and distributed 2022.7.1

    dask-2022.7.1              |     pyhd8ed1ab_0           5 KB  conda-forge
    dask-core-2022.7.1         |     pyhd8ed1ab_0         840 KB  conda-forge
    dbus-1.13.6                |       h5008d03_3         604 KB  conda-forge
    distributed-2022.7.1       |     pyhd8ed1ab_0         735 KB  conda-forge

which hit conda-forge 3 days ago.

Screen Shot 2022-07-26 at 1 34 56 PM

Reproducible example

Here’s an example: https://github.com/microsoft/LightGBM/runs/7522939980?check_suite_focus=true

I don’t believe the failure is related to anything specific on the PR that that failed build came from.

Additional Comments

Note that this should not be a concern for jobs using Python < 3.8, as dask / distributed have dropped support for those Python versions.

Logs from an example build on #5388 where I tried to pin to exactly dask==2022.7.0 (build link):

UnsatisfiableError: The following specifications were found
to be incompatible with the existing python installation in your environment:

Specifications:

  - dask==2022.7.0 -> python[version='>=3.8']
  - distributed==2022.7.0 -> python[version='>=3.8']

About this issue

Commits related to this issue

Most upvoted comments

It seems that timeout is for individual sockets, we need to have one for the whole network setup. This is the relevant part: https://github.com/microsoft/LightGBM/blob/f94050a4cc94909e10a0064baff11cec795eb250/src/network/linkers_socket.cpp#L196-L216 So it doesn’t wait forever, it does have a finite amount of retries and each time it waits longer. I added a log there to see how many retries it would make and there were 20, however after these 20 tries instead of failing the setup it assumed everything was ok and then segfaulted. image

I’ll keep investigating this

separate issue for discussing migration to official manylinux image

opened #5514

@jameslamb Done! Feel free to do your experiments in the dev2 branch.

You can check the diff for the .github/workflows/publish_image.yml file for how to make it possible to push multiple Dockerfiles into Docker Hub.

@StrikerRUS sorry it took me so long to respond to this!

I agree with the proposal to bump the minimum glibc version and to start publishing wheels with using one of the standard manylinux images.

That could definitely be related, especially if something on the dask-specific code paths leads to conflicting shared-library-loading (like some of the issues we’ve talked about in #5106).

scipy 1.9.0 changelog: http://scipy.github.io/devdocs/release.1.9.0.html

The only item there that in my opinion can be related to our issue is

SciPy switched to Meson as its build system

Specifically,

The build defaults to using OpenBLAS

It’s quite strange that only Linux regular and Linux bdist are affected. scipy has been upgraded in Linux mpi_source, for example, as well…

Or, alternatively, we can try to figure out why raising an error the way that this test intentionally does creates issues for Client.restart()

It’s not the error itself, it’s actually the other worker that gets stuck. Since the distributed training expects all machines to be ready, when one fails to bind the port the other ones just wait forever. That’s what I thought restart fixed but I’m not sure anymore haha. I think one better fix would be to add a timeout to the distributed training, so that if one of the machines fails to connect, the process is interrupted in the rest of them. I’ll give that a shot and let you know.

Also, at some point soon-ish, I’m hoping to finish up and merge https://github.com/dask/distributed/pull/6427. This might break your test yet again.

Currently, the Nanny.kill process goes:

  1. Call Worker.close
  2. If that doesn’t shut it down, send a SIGTERM to the worker. I think with your current test setup this will actually kill the worker process. But if you were running dask-worker (like users would be), it would just trigger Worker.close, so effectively a no-op.
  3. If it’s still not closed after the timeout… 🤷

After that PR:

  1. Call Worker.close
  2. If that doesn’t shut it down, send a SIGKILL to the worker. SIGKILL cannot be caught so the process will shut down nearly immediately (modulo it being suspended or currently waiting on a system call).
  3. If it’s still not closed after the timeout, raise an error.

The key difference is that maybe what you have right now is managing to block/ignore the SIGTERM, preventing the stuck worker from shutting down. Since you can’t block a SIGKILL, after this PR is merged I would expect (and hope) that client.restart would result in both workers restarting successfully, not just 1. If you’re counting on the stuck worker to survive the restart, I don’t think it would (it’s a bug that it’s able to right now).