threadpoolctl: Slower without `OMP_NUM_THREADS=1` than with `OMP_NUM_THREADS=1`

I tried with threadpool_limits(1, user_api=None): with a not so simple case : https://gitlab.com/paugier/tsp-pythran (branch threadpoolctl) on Debian.

The case uses Pythran (through Transonic but I don’t see how it could change anything for this) to get an extension accelerated with OpenMP. Pythran uses the system OpenBlas library.

To reproduce (sorry, I don’t use Git):

hg clone https://gitlab.com/paugier/tsp-pythran.git
cd tsp-pythran
hg up threadpoolctl
# compile the extension with openmp
transonic tsp.py -pf "-march=native -DUSE_XSIMD -fopenmp"
# wait to get the extension ready
python run-test-omp.py
OMP_NUM_THREADS=1 python run-test-omp.py

The good news is that threadpoolctl manages to reduce the number of threads used with OpenMP. However, I get something strange that I don’t understand:

I’m not sure it’s an issue, but I get something slower with python run-test-omp.py (or OMP_NUM_THREADS=2 python run-test-omp.py) than with OMP_NUM_THREADS=1 python run-test-omp.py.

I actually get the same behavior if the extension is built without OpenMP, i.e. just with transonic tsp.py.

OMP_NUM_THREADS=1 python run-test-omp.py
[{'filename_prefixes': ('libopenblas',),
  'internal_api': 'openblas',
  'module_path': '/home/users/augier3pi/.pyenv/versions/3.7.2/lib/python3.7/site-packages/numpy/.libs/libopenblasp-r0-382c8f3a.3.5.dev.so',
  'n_thread': 1,
  'prefix': 'libopenblas',
  'user_api': 'blas',
  'version': '0.3.5.dev'},
 {'filename_prefixes': ('libiomp', 'libgomp', 'libomp', 'vcomp'),
  'internal_api': 'openmp',
  'module_path': '/usr/lib/x86_64-linux-gnu/libgomp.so.1',
  'n_thread': 1,
  'prefix': 'libgomp',
  'user_api': 'openmp',
  'version': None},
 {'filename_prefixes': ('libopenblas',),
  'internal_api': 'openblas',
  'module_path': '/home/users/augier3pi/.pyenv/versions/3.7.2/lib/python3.7/site-packages/scipy/.libs/libopenblasp-r0-8dca6697.3.0.dev.so',
  'n_thread': 1,
  'prefix': 'libopenblas',
  'user_api': 'blas',
  'version': None},
 {'filename_prefixes': ('libopenblas',),
  'internal_api': 'openblas',
  'module_path': '/usr/lib/libopenblas.so.0',
  'n_thread': 1,
  'prefix': 'libopenblas',
  'user_api': 'blas',
  'version': None}]
start search
run time = 0.43 s
start search
run time = 0.43 s
start search
run time = 0.44 s
start search
run time = 0.46 s


python run-test-omp.py
[{'filename_prefixes': ('libopenblas',),
  'internal_api': 'openblas',
  'module_path': '/home/users/augier3pi/.pyenv/versions/3.7.2/lib/python3.7/site-packages/numpy/.libs/libopenblasp-r0-382c8f3a.3.5.dev.so',
  'n_thread': 4,
  'prefix': 'libopenblas',
  'user_api': 'blas',
  'version': '0.3.5.dev'},
 {'filename_prefixes': ('libiomp', 'libgomp', 'libomp', 'vcomp'),
  'internal_api': 'openmp',
  'module_path': '/usr/lib/x86_64-linux-gnu/libgomp.so.1',
  'n_thread': 4,
  'prefix': 'libgomp',
  'user_api': 'openmp',
  'version': None},
 {'filename_prefixes': ('libopenblas',),
  'internal_api': 'openblas',
  'module_path': '/home/users/augier3pi/.pyenv/versions/3.7.2/lib/python3.7/site-packages/scipy/.libs/libopenblasp-r0-8dca6697.3.0.dev.so',
  'n_thread': 4,
  'prefix': 'libopenblas',
  'user_api': 'blas',
  'version': None},
 {'filename_prefixes': ('libopenblas',),
  'internal_api': 'openblas',
  'module_path': '/usr/lib/libopenblas.so.0',
  'n_thread': 4,
  'prefix': 'libopenblas',
  'user_api': 'blas',
  'version': None}]
start search
run time = 0.57 s
start search
run time = 0.59 s
start search
run time = 0.58 s
start search
run time = 0.58 s

About this issue

  • Original URL
  • State: closed
  • Created 5 years ago
  • Comments: 20 (3 by maintainers)

Commits related to this issue

Most upvoted comments

But I don’t believe threadpoolctl can do anything in this case.

I tend to agree but I am not sure. It looks like openBLAS relies on OMP_NUM_THREADS=1 here but they don’t seem to be checking the omp_get_max_threads to disable the mapping. I did not investigate enough to see if it could be set programatically.

Thus, this seems to be an issue not related to this library no? Note that we just reduced the overhead of the context manager so the results should be even closer now when using threadpool_limits or not.

Let us know if you feel like there is still some issue.