LightGBM: [ci] [dask] CI jobs failing with Dask 2022.7.1
Description
Created from https://github.com/microsoft/LightGBM/pull/5388#issuecomment-1195848451.
All CUDA CI jobs and several Linux jobs are failing with the following.
FAILED ../tests/python_package_test/test_dask.py::test_machines_should_be_used_if_provided[binary-classification]
FAILED ../tests/python_package_test/test_dask.py::test_machines_should_be_used_if_provided[multiclass-classification]
FAILED ../tests/python_package_test/test_dask.py::test_machines_should_be_used_if_provided[regression]
FAILED ../tests/python_package_test/test_dask.py::test_machines_should_be_used_if_provided[ranking]
= 4 failed, 700 passed, 10 skipped, 2 xfailed, 395 warnings in 655.51s (0:10:55) =
client.restart() calls in that test are resulting in the following:
raise TimeoutError(f"{len(bad_nannies)}/{len(nannies)} nanny worker(s) did not shut down within {timeout}s") E asyncio.exceptions.TimeoutError: 1/2 nanny worker(s) did not shut down within 120s
traceback (click me)
/root/miniforge/envs/test-env/lib/python3.9/site-packages/distributed/client.py:3360: in restart
return self.sync(
/root/miniforge/envs/test-env/lib/python3.9/site-packages/distributed/utils.py:338: in sync
return sync(
/root/miniforge/envs/test-env/lib/python3.9/site-packages/distributed/utils.py:405: in sync
raise exc.with_traceback(tb)
/root/miniforge/envs/test-env/lib/python3.9/site-packages/distributed/utils.py:378: in f
result = yield future
/root/miniforge/envs/test-env/lib/python3.9/site-packages/tornado/gen.py:762: in run
value = future.result()
/root/miniforge/envs/test-env/lib/python3.9/site-packages/distributed/client.py:3329: in _restart
await self.scheduler.restart(timeout=timeout, wait_for_workers=wait_for_workers)
/root/miniforge/envs/test-env/lib/python3.9/site-packages/distributed/core.py:1153: in send_recv_from_rpc
return await send_recv(comm=comm, op=key, **kwargs)
/root/miniforge/envs/test-env/lib/python3.9/site-packages/distributed/core.py:943: in send_recv
raise exc.with_traceback(tb)
/root/miniforge/envs/test-env/lib/python3.9/site-packages/distributed/core.py:769: in _handle_comm
result = await result
/root/miniforge/envs/test-env/lib/python3.9/site-packages/distributed/utils.py:778: in wrapper
return await func(*args, **kwargs)
It looks like those jobs are getting dask and distributed 2022.7.1
dask-2022.7.1 | pyhd8ed1ab_0 5 KB conda-forge
dask-core-2022.7.1 | pyhd8ed1ab_0 840 KB conda-forge
dbus-1.13.6 | h5008d03_3 604 KB conda-forge
distributed-2022.7.1 | pyhd8ed1ab_0 735 KB conda-forge
which hit conda-forge 3 days ago.
Reproducible example
Here’s an example: https://github.com/microsoft/LightGBM/runs/7522939980?check_suite_focus=true
I don’t believe the failure is related to anything specific on the PR that that failed build came from.
Additional Comments
Note that this should not be a concern for jobs using Python < 3.8, as dask / distributed have dropped support for those Python versions.
Logs from an example build on #5388 where I tried to pin to exactly dask==2022.7.0 (build link):
UnsatisfiableError: The following specifications were found
to be incompatible with the existing python installation in your environment:
Specifications:
- dask==2022.7.0 -> python[version='>=3.8']
- distributed==2022.7.0 -> python[version='>=3.8']
About this issue
- Original URL
- State: open
- Created 2 years ago
- Comments: 29
Commits related to this issue
- [ci] remove constraints on dask and scipy in CI (fixes #5390) — committed to microsoft/LightGBM by jameslamb 2 years ago
It seems that timeout is for individual sockets, we need to have one for the whole network setup. This is the relevant part: https://github.com/microsoft/LightGBM/blob/f94050a4cc94909e10a0064baff11cec795eb250/src/network/linkers_socket.cpp#L196-L216 So it doesn’t wait forever, it does have a finite amount of retries and each time it waits longer. I added a log there to see how many retries it would make and there were 20, however after these 20 tries instead of failing the setup it assumed everything was ok and then segfaulted.
I’ll keep investigating this
opened #5514
@jameslamb Done! Feel free to do your experiments in the
dev2branch.You can check the diff for the
.github/workflows/publish_image.ymlfile for how to make it possible to push multiple Dockerfiles into Docker Hub.@StrikerRUS sorry it took me so long to respond to this!
I agree with the proposal to bump the minimum
glibcversion and to start publishing wheels with using one of the standardmanylinuximages.That could definitely be related, especially if something on the
dask-specific code paths leads to conflicting shared-library-loading (like some of the issues we’ve talked about in #5106).scipy1.9.0 changelog: http://scipy.github.io/devdocs/release.1.9.0.htmlThe only item there that in my opinion can be related to our issue is
Specifically,
It’s quite strange that only
Linux regularandLinux bdistare affected.scipyhas been upgraded inLinux mpi_source, for example, as well…It’s not the error itself, it’s actually the other worker that gets stuck. Since the distributed training expects all machines to be ready, when one fails to bind the port the other ones just wait forever. That’s what I thought
restartfixed but I’m not sure anymore haha. I think one better fix would be to add a timeout to the distributed training, so that if one of the machines fails to connect, the process is interrupted in the rest of them. I’ll give that a shot and let you know.Also, at some point soon-ish, I’m hoping to finish up and merge https://github.com/dask/distributed/pull/6427. This might break your test yet again.
Currently, the
Nanny.killprocess goes:Worker.closedask-worker(like users would be), it would just triggerWorker.close, so effectively a no-op.After that PR:
Worker.closeThe key difference is that maybe what you have right now is managing to block/ignore the SIGTERM, preventing the stuck worker from shutting down. Since you can’t block a SIGKILL, after this PR is merged I would expect (and hope) that
client.restartwould result in both workers restarting successfully, not just 1. If you’re counting on the stuck worker to survive the restart, I don’t think it would (it’s a bug that it’s able to right now).