auto-sklearn: Error running "fit" with many cores.
Hi! I’m experiencing a problem when I fit an AutoSklearn instance in a virtual machine with many cores.
I have run exactly the same code, with the same dataset in three different virtual machines:
in a vm with 4 cores and 15Gb of RAM: works ok ✅ in a vm with 8 cores and 30Gb of RAM: works ok ✅ in a vm with 40 cores and 157 Gb of RAM: fails ❌ with the following error:
ValueError: Dummy prediction failed with run state StatusType.CRASHED and additional output: {'error': 'Result queue is empty', 'exit_status': "<class 'pynisher.limit_function_call.AnythingException'>", 'subprocess_stdout': '', 'subprocess_stderr': 'Process pynisher function call:\nTraceback (most recent call last):\n File "/usr/local/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap\n self.run()\n File "/usr/local/lib/python3.7/multiprocessing/process.py", line 99, in run\n self._target(*self._args, **self._kwargs)\n File "/usr/local/lib/python3.7/site-packages/pynisher/limit_function_call.py", line 133, in subprocess_func\n return_value = ((func(*args, **kwargs), 0))\n File "/usr/local/lib/python3.7/site-packages/autosklearn/evaluation/__init__.py", line 40, in fit_predict_try_except_decorator\n return ta(queue=queue, **kwargs)\n File "/usr/local/lib/python3.7/site-packages/autosklearn/evaluation/train_evaluator.py", line 1164, in eval_holdout\n budget_type=budget_type,\n File "/usr/local/lib/python3.7/site-packages/autosklearn/evaluation/train_evaluator.py", line 194, in __init__\n budget_type=budget_type,\n File "/usr/local/lib/python3.7/site-packages/autosklearn/evaluation/abstract_evaluator.py", line 199, in __init__\n threadpool_limits(limits=1)\n File "/usr/local/lib/python3.7/site-packages/threadpoolctl.py", line 171, in __init__\n self._original_info = self._set_threadpool_limits()\n File "/usr/local/lib/python3.7/site-packages/threadpoolctl.py", line 280, in _set_threadpool_limits\n module.set_num_threads(num_threads)\n File "/usr/local/lib/python3.7/site-packages/threadpoolctl.py", line 659, in set_num_threads\n return set_func(num_threads)\nKeyboardInterrupt\n', 'exitcode': 1, 'configuration_origin': 'DUMMY'}.
This is the code I was running:
automl = AutoSklearnClassifier(time_left_for_this_task=600, metric=roc_auc)
automl.fit(x_train, y_train, x_validation, y_validation)
Limiting the number of cores with the param nproc seems to work, but it’s a pity that we cannot take advantage of larger infra 😦
The dataset doesn’t seem to be the problem. I reproduced the bug with datasets of different sizes and different feature types, and everytime it raises the same error (it’s not something that happens stochastically).
Also, the error is almost instantaneous: clearly it doesn’t even start to fit when it fails.
Environment and installation:
- OS: linux
- Python version: 3.7
- Auto-sklearn version: 0.13.0
About this issue
- Original URL
- State: open
- Created 3 years ago
- Reactions: 13
- Comments: 17 (4 by maintainers)
The workaround I found to fix this issue is to limit the number of cores with the env var
OPENBLAS_NUM_THREADSbefore importing anything from autosklearn.For example:
I have been getting this error as well on
macOS Monterey 12.0andauto-sklearn==0.13.0, and I have not updated any libraries in my environment before this error started showing up. It happens when callingfitregardless of parameters:Hey @erinaldi,
We recently started reworking
pynisherwhich is in charge of limiting resources for spawned processes. This error line is directly from pynisher and is in the comment you linked:resource.setrlimit(resource.RLIMIT_AS, (mem_in_b, mem_in_b))\nValueError: current limit exceeds maximum limit.We have another push on getting it to work tomorrow hopefully but we still need a solution for Windows before we can make a release on that.
If you’d like more context or have any solutions, we can use the builtin python module
resourcesfor limiting memory on Unix based systems but there is no windows equivalent, it’s a unix only module. We need to find a substitute and then set up some local testing for it (we have no windows machines). There’s also other discrepancies between the three core operating systems.The error above seems to happen regardless of the memory you provide for
RLIMIT_XXXand we think thatRLIMIT_ASonly works for Linux, or at least doesn’t work on newer MAC OS systems.If we can’t get a windows version working soon, we will push the Mac fixed version as soon as we can and hopefully it will solve the issue for you 😃
Best, Eddie
Hi @sofidenner,
We don’t have infrastructure (a machine with that many cores) to actually test this properly which makes this difficult but we just want to write here to say we are aware of the issue and sorry that we have no response as of yet.
I get the same error https://github.com/automl/auto-sklearn/issues/360#issuecomment-963293965 On macOS Monterey with M1 Pro The installation was successful and importing the package works. It seems to be related to this. I understand auto-sklearn is not tested on macOS but I thought about reporting this known issue anyway in case someone finds a solution (which does not require downgrading the OS)
Hey @mfeurer, I understand. I tried with a lower memory configuration and it still didnt go away. However, in the case anyone faces this issue like me: the error stopped happening when I downgraded my MacOS version to below 12.0 (Monterey).