LightGBM: Linux CI jobs hang forever after completing all Python tests successfully

This problem has started to timeout our CI jobs about 5 days ago. The most frequent CI jobs that run out of allowed 60min limit are Linux_latest regular and Linux_latest sdist at Azure Pipelines. Also, I just saw CUDA Version / cuda 10.0 pip (linux, clang, Python 3.8) encountered the same problem.

From test logs I guess that the root cause is connected to the following warning message from the joblib/threadpoolctl package:

2022-01-14T22:12:20.3133305Z tests/python_package_test/test_sklearn.py::test_sklearn_integration[LGBMRegressor()-check_regressors_train(readonly_memmap=True,X_dtype=float32)]
2022-01-14T22:12:20.3133965Z   /root/miniconda/envs/test-env/lib/python3.8/site-packages/threadpoolctl.py:546: RuntimeWarning: 
2022-01-14T22:12:20.3134409Z   Found Intel OpenMP ('libiomp') and LLVM OpenMP ('libomp') loaded at
2022-01-14T22:12:20.3134727Z   the same time. Both libraries are known to be incompatible and this
2022-01-14T22:12:20.3135026Z   can cause random crashes or deadlocks on Linux when loaded in the
2022-01-14T22:12:20.3135283Z   same Python program.
2022-01-14T22:12:20.3135545Z   Using threadpoolctl may cause crashes or deadlocks. For more
2022-01-14T22:12:20.3135827Z   information and possible workarounds, please see
2022-01-14T22:12:20.3136165Z       https://github.com/joblib/threadpoolctl/blob/master/multiple_openmp.md
2022-01-14T22:12:20.3136411Z   
2022-01-14T22:12:20.3136614Z     warnings.warn(msg, RuntimeWarning)

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Comments: 16

Commits related to this issue

Most upvoted comments

@StrikerRUS @guolinke could I have “Write” access on https://github.com/guolinke/lightgbm-ci-docker?

I realized today that to make this change to mamba I’ll need to update that image, and it would be easier if I could directly push to the dev branch there and have new images pushed, so I could test them in LightGBM’s CI.

Otherwise, I’ll have to make a PR from my fork of lightgbm-ci-docker into the dev branch of that repo, and then if anything breaks repeat the process of PR-ing from my fork to dev and waiting for approval.

I also support using conda-forge and maybe we could consider using mamba in CI. The time to solve the environments and install packages reduces significantly.

@StrikerRUS thanks for all the research! I strongly support moving LightGBM’s CI to using only conda-forge, and I’d be happy to do that work.

I wonder, maybe it’s a good time to migrate from default conda channel to conda-forge one? Besides this particular issue with different libomp implementations, default conda channel is extremely slow in terms of updates and lacks some required packages for our CI.

Just some small examples.

dask-core package, which is included into Anaconda distribution (this clarification emphasizes its importance to conda maintainers), right now has 2021.10.0 version, while on conda-forge there is 2022.1.0 version already.

image https://docs.anaconda.com/anaconda/packages/pkg-docs/

https://anaconda.org/conda-forge/dask-core

LightGBM version at default conda channel is 3.2.1: https://anaconda.org/anaconda/lightgbm. Related issue: https://github.com/microsoft/LightGBM/pull/3544#issuecomment-724143449.

Requests for adding new and upgrading existing [R packages] tend to be ignored: https://github.com/ContinuumIO/anaconda-issues/issues/11604, https://github.com/ContinuumIO/anaconda-issues/issues/11571. Due to this reason, we have already migrated to conda-forge for building our docks: #4767.

In addition, conda-forge channel often supports more architectures (Miniforge): https://github.com/microsoft/LightGBM/issues/4843#issuecomment-1012313487.

Download stats for LightGBM (especially for the recent versions) show that users already prefer conda-forge to default: https://anaconda.org/conda-forge/lightgbm/files vs https://anaconda.org/anaconda/lightgbm/files.

Just a reminder: it’s better not to mix different channels in one environment not only due to possible package conflicts, but also due to long time and high memory consumption for resolving environment specification during installation phase (matters for CI): https://github.com/microsoft/LightGBM/pull/4054#pullrequestreview-607286413, https://github.com/ContinuumIO/anaconda-issues/issues/11604#issue-564647005.

Excellent investigation, thank you!

I vote for option 1, setting MKL_THREADING_LAYER=GNU in the environment for clang Linux jobs. I like that because it’s what I’d most like to recommend to a LightGBM user facing this issue…I think “set this environment variable” is less invasive than “install this other conda package”.


I’d also like to ask… @xhochy, if you have time could you advise us on this issue? I’m wondering if you’ve experienced a similar issuue with the lightgbm conda-forge feedstock or other projects that both depend on numpy and link to OpenMP themselves.

Given that I remember only Azure Pipelines Linux_latest and CUDA cuda 10.0 pip (linux, clang, Python 3.8) CI jobs have already faced this problem multiple times, I believe only CI jobs where we use clang to compile LightGBM suffer from this problem.

https://github.com/microsoft/LightGBM/blob/4aaeb22932b72aaff632b40ba596ca2533071ca2/.vsts-ci.yml#L76-L79 https://github.com/microsoft/LightGBM/blob/4aaeb22932b72aaff632b40ba596ca2533071ca2/.github/workflows/cuda.yml#L30-L33