scikit-learn: ImportError: dlopen: cannot load any more object with static TLS with torch built with gcc 5.5

I am not sure if this is a PyTorch bug, a scikit-learn bug or a numba, but this used to work in scikit-learn 0.20.3 and stopped working in the 0.21.0 series, so for now I am going to venture a guess that it is a regression in scikit learn.

When I do the following series of imports (minimized from the original import, which was import librosa), loading the following program fails:

import torch
import soundfile
import scipy.signal
import numba
import sklearn

with

Traceback (most recent call last):
  File "/opt/conda/lib/python3.6/site-packages/sklearn/__check_build/__init__.py", line 44, in <module>
    from ._check_build import check_build  # noqa
ImportError: dlopen: cannot load any more object with static TLS

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "test_torch.py", line 5, in <module>
    import sklearn
  File "/opt/conda/lib/python3.6/site-packages/sklearn/__init__.py", line 75, in <module>
    from . import __check_build
  File "/opt/conda/lib/python3.6/site-packages/sklearn/__check_build/__init__.py", line 46, in <module>
    raise_build_error(e)
  File "/opt/conda/lib/python3.6/site-packages/sklearn/__check_build/__init__.py", line 41, in raise_build_error
    %s""" % (e, local_dir, ''.join(dir_content).strip(), msg))
ImportError: dlopen: cannot load any more object with static TLS
___________________________________________________________________________
Contents of /opt/conda/lib/python3.6/site-packages/sklearn/__check_build:
_check_build.cpython-36m-x86_64-linux-gnu.so__pycache__               __init__.py
setup.py
___________________________________________________________________________
It seems that scikit-learn has not been built correctly.

If you have installed scikit-learn from source, please do not forget
to build the package before using it: run `python setup.py install` or
`make` in the source directory.

If you have used an installer, please check that it is suited for your
Python version, your operating system and your platform.

Downgrading to scikit-learn 0.20.3 makes the problem go away.

Versions

jenkins@260bf77532d0:~/workspace/test$ python
Python 3.6.8 |Anaconda, Inc.| (default, Dec 30 2018, 01:22:34) 
[GCC 7.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import sklearn; sklearn.show_versions()

System:
    python: 3.6.8 |Anaconda, Inc.| (default, Dec 30 2018, 01:22:34)  [GCC 7.3.0]
executable: /opt/conda/bin/python
   machine: Linux-4.15.0-29-generic-x86_64-with-debian-jessie-sid

BLAS:
    macros: SCIPY_MKL_H=None, HAVE_CBLAS=None
  lib_dirs: /opt/conda/lib
cblas_libs: mkl_rt, pthread

Python deps:
       pip: 19.1.1
setuptools: 41.0.1
   sklearn: 0.21.2
     numpy: 1.16.4
     scipy: 1.1.0
    Cython: None
    pandas: None

Also, you may be interested in:

jenkins@260bf77532d0:~/workspace/test$ pip list | grep numba
numba                  0.43.1         
jenkins@260bf77532d0:~/workspace/test$ pip list | grep torch
torch                  1.2.0a0+ab800ad

The build of torch must be done with gcc 5.5.0 to cause this problem; other versions of gcc are known not to cause this problem.

For ease of reproduction, you can use the following docker image ezyang/scikit-learn-tls-repro:1 https://cloud.docker.com/repository/registry-1.docker.io/ezyang/scikit-learn-tls-repro Once in, follow the reproduction instructions as described above. (EDIT At time of writing, the Docker image is still uploading. Should be done soon.)

About this issue

  • Original URL
  • State: closed
  • Created 5 years ago
  • Reactions: 22
  • Comments: 21 (7 by maintainers)

Commits related to this issue

Most upvoted comments

pip install scikit-learn

I solved it by import sklearn,then import tensorflow.The import order result in this error.

pip install scikit-learn

you should use ‘pip install scikit-learn==0.20.3’, and before that you’d better use ‘pip config set global.index-url https://mirrors.aliyun.com/pypi/simple/’ . That works for me

Going through the threads for other projects related to this issue, I see a common resolution is to upgrade their version of glibc to be >=2.21 which includes this bug fix. Users that have Ubuntu >= 16.04 or Debian >= 9 would get glibc>=2.21 by default.

@nscozzaro What version of glibc do you have running on your system? (ldd --version)

Going through the threads for other projects related to this issue, I see a common resolution is to upgrade their version of glibc to be >=2.21 which includes this bug fix. Users that have Ubuntu >= 16.04 or Debian >= 9 would get glibc>=2.21 by default.

@nscozzaro What version of glibc do you have running on your system? (ldd --version)

Thanks this comment helped me solve the issue.

pip install scikit-learn

you should use ‘pip install scikit-learn==0.20.3’, and before that you’d better use ‘pip config set global.index-url https://mirrors.aliyun.com/pypi/simple/’ . That works for me

Working thank you

The issue @ezyang linked has bunch of information on this TLS (thread local store) issue. Here’s some info I dug up before: https://github.com/pytorch/pytorch/issues/2575#issuecomment-369892859

;TLDR: Something in the chain of imports was not C/C++ compiled with -gPIC flag. Importing that library causes a problem that turns all imports to “static TLS”. There is a maximum amount of such “static TLS” slots (names I use here are surely incorrect). Exact N of slots depends on the OS, and how it was compiled.

In the linked pytorch issue 2575, there is a mention that it is OpenMP which was compiled without the flag causing the cascade. This scikit-learn issue might be due to some new library being introduced or some change, eating just few more static TLS slots.

Note: Not a real expert. There might be other sources for this error than “one/some lib missing `-gPIC’ flag when it was compiled”. Haven’t found one though.

@ezyang you may want to share the Dockerfile if that’s possible.

If anyone is interested in reproducing this error the right docker incantation to use is something like this:

docker run -it ezyang/scikit-learn-tls-repro:1 bash

Note that you need to specify the tag i.e. 1 explicitly otherwise you get a cryptic error message (the ‘latest’ tag does not exist):

Unable to find image 'ezyang/scikit-learn-tls-repro:latest' locally
docker: Error response from daemon: manifest for ezyang/scikit-learn-tls-repro:latest not found.

I have no idea why this would happen, but I have seem numerous bug reports related to this e.g. with pytorch and OpenCV https://github.com/pytorch/pytorch/issues/2083 or OpenCV and Tensorflow https://github.com/tensorflow/models/issues/523. All in all I would guess that this is not a scikit-learn bug.

The fact that it depends on the order of import is fishy, for exemple this works in your docker image:

python -c 'import torch; import sklearn; import soundfile; import scipy.signal; import numba'
                
Collecting scikit-learn                                                                           
  Using cached https://files.pythonhosted.org/packages/85/04/49633f490f726da6e454fddc8e938bbb5bfed
2001681118d3814c219b723/scikit_learn-0.21.2-cp36-cp36m-manylinux1_x86_64.whl