LightGBM: Linux CI jobs hang forever after completing all Python tests successfully
This problem has started to timeout our CI jobs about 5 days ago. The most frequent CI jobs that run out of allowed 60min limit are Linux_latest regular and Linux_latest sdist at Azure Pipelines. Also, I just saw CUDA Version / cuda 10.0 pip (linux, clang, Python 3.8) encountered the same problem.
From test logs I guess that the root cause is connected to the following warning message from the joblib/threadpoolctl package:
2022-01-14T22:12:20.3133305Z tests/python_package_test/test_sklearn.py::test_sklearn_integration[LGBMRegressor()-check_regressors_train(readonly_memmap=True,X_dtype=float32)]
2022-01-14T22:12:20.3133965Z /root/miniconda/envs/test-env/lib/python3.8/site-packages/threadpoolctl.py:546: RuntimeWarning:
2022-01-14T22:12:20.3134409Z Found Intel OpenMP ('libiomp') and LLVM OpenMP ('libomp') loaded at
2022-01-14T22:12:20.3134727Z the same time. Both libraries are known to be incompatible and this
2022-01-14T22:12:20.3135026Z can cause random crashes or deadlocks on Linux when loaded in the
2022-01-14T22:12:20.3135283Z same Python program.
2022-01-14T22:12:20.3135545Z Using threadpoolctl may cause crashes or deadlocks. For more
2022-01-14T22:12:20.3135827Z information and possible workarounds, please see
2022-01-14T22:12:20.3136165Z https://github.com/joblib/threadpoolctl/blob/master/multiple_openmp.md
2022-01-14T22:12:20.3136411Z
2022-01-14T22:12:20.3136614Z warnings.warn(msg, RuntimeWarning)
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Comments: 16
Commits related to this issue
- [ci] use conda-forge in CI jobs (fixes #4948) — committed to microsoft/LightGBM by jameslamb 2 years ago
- [ci] use conda-forge in Linux and macOS CI jobs (#4953) * [ci] use conda-forge in CI jobs (fixes #4948) * comment out more jobs * try reverting graphviz patch, running more cuda jobs * get g... — committed to lorentzenchr/LightGBM by jameslamb 2 years ago
@StrikerRUS @guolinke could I have “Write” access on https://github.com/guolinke/lightgbm-ci-docker?
I realized today that to make this change to
mambaI’ll need to update that image, and it would be easier if I could directly push to thedevbranch there and have new images pushed, so I could test them in LightGBM’s CI.Otherwise, I’ll have to make a PR from my fork of
lightgbm-ci-dockerinto thedevbranch of that repo, and then if anything breaks repeat the process of PR-ing from my fork todevand waiting for approval.@jmoralez Let’s go deeper! Mambaforge! 😄 https://github.com/conda-forge/miniforge#mambaforge
I also support using conda-forge and maybe we could consider using mamba in CI. The time to solve the environments and install packages reduces significantly.
@StrikerRUS thanks for all the research! I strongly support moving LightGBM’s CI to using only
conda-forge, and I’d be happy to do that work.I wonder, maybe it’s a good time to migrate from default conda channel to conda-forge one? Besides this particular issue with different libomp implementations, default conda channel is extremely slow in terms of updates and lacks some required packages for our CI.
Just some small examples.
dask-corepackage, which is included into Anaconda distribution (this clarification emphasizes its importance to conda maintainers), right now has2021.10.0version, while on conda-forge there is2022.1.0version already.https://anaconda.org/conda-forge/dask-core
LightGBM version at default conda channel is
3.2.1: https://anaconda.org/anaconda/lightgbm. Related issue: https://github.com/microsoft/LightGBM/pull/3544#issuecomment-724143449.Requests for adding new and upgrading existing [R packages] tend to be ignored: https://github.com/ContinuumIO/anaconda-issues/issues/11604, https://github.com/ContinuumIO/anaconda-issues/issues/11571. Due to this reason, we have already migrated to conda-forge for building our docks: #4767.
In addition, conda-forge channel often supports more architectures (Miniforge): https://github.com/microsoft/LightGBM/issues/4843#issuecomment-1012313487.
Download stats for LightGBM (especially for the recent versions) show that users already prefer
conda-forgetodefault: https://anaconda.org/conda-forge/lightgbm/files vs https://anaconda.org/anaconda/lightgbm/files.Just a reminder: it’s better not to mix different channels in one environment not only due to possible package conflicts, but also due to long time and high memory consumption for resolving environment specification during installation phase (matters for CI): https://github.com/microsoft/LightGBM/pull/4054#pullrequestreview-607286413, https://github.com/ContinuumIO/anaconda-issues/issues/11604#issue-564647005.
Excellent investigation, thank you!
I vote for option 1, setting
MKL_THREADING_LAYER=GNUin the environment forclangLinux jobs. I like that because it’s what I’d most like to recommend to a LightGBM user facing this issue…I think “set this environment variable” is less invasive than “install this othercondapackage”.I’d also like to ask… @xhochy, if you have time could you advise us on this issue? I’m wondering if you’ve experienced a similar issuue with the
lightgbmconda-forge feedstock or other projects that both depend onnumpyand link to OpenMP themselves.Given that I remember only Azure Pipelines
Linux_latestand CUDAcuda 10.0 pip (linux, clang, Python 3.8)CI jobs have already faced this problem multiple times, I believe only CI jobs where we useclangto compile LightGBM suffer from this problem.https://github.com/microsoft/LightGBM/blob/4aaeb22932b72aaff632b40ba596ca2533071ca2/.vsts-ci.yml#L76-L79 https://github.com/microsoft/LightGBM/blob/4aaeb22932b72aaff632b40ba596ca2533071ca2/.github/workflows/cuda.yml#L30-L33