ray: [cloudpickle] Too much override for cloudpickle, breaks scikit-learn usage
What is the problem?
import ray
from sklearn.model_selection import train_test_split
from sklearn.linear_model import SGDClassifier
from sklearn.datasets import make_classification
import numpy as np
X, y = make_classification(
n_samples=11000,
n_features=1000,
n_informative=50,
n_redundant=0,
n_classes=10,
class_sep=2.5)
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=1000)
# Example parameters to tune from SGDClassifier
parameter_grid = {"alpha": [1e-4, 1e-1, 1], "epsilon": [0.01, 0.1]}
from sklearn.model_selection import GridSearchCV
# n_jobs=-1 enables use of all cores like Tune does
sklearn_search = GridSearchCV(SGDClassifier(), parameter_grid, n_jobs=4, cv=4)
sklearn_search.fit(x_train, y_train)
Introduces:
--------------------------------------------------------------------------------
LokyProcess-10 failed with traceback:
--------------------------------------------------------------------------------
Traceback (most recent call last):
File "/Users/rliaw/miniconda3/envs/test/lib/python3.7/site-packages/joblib/externals/loky/backend/popen_loky_posix.py", line 197, in <module>
prep_data = pickle.load(from_parent)
ValueError: unsupported pickle protocol: 5
--------------------------------------------------------------------------------
joblib.externals.loky.process_executor.TerminatedWorkerError: A worker process managed by the executor was unexpectedly terminated. This could be caused by a segmentation fault while calling the function or by an excessive memory usage causing the Operating System to kill the worker.
The exit codes of the workers are {EXIT(1)}
This is because it uses Joblib, a library for parallel computation. Joblib is for multiprocessing, so has a IPC where they use cloudpickle on the client side to transfer data to child processes. I think these clients access some ray cloudpickle attributes on accident, causing deserialization to break down.
Can we avoid the ray cloudpickle port from affecting 3rdparty libs?
cc @suquark
if sys.platform != "win32":
from ._posix_reduction import _mk_inheritable # noqa: F401
else:
from . import _win_reduction # noqa: F401
# global variable to change the pickler behavior
try:
from joblib.externals import cloudpickle # noqa: F401
DEFAULT_ENV = "cloudpickle"
except ImportError:
# If cloudpickle is not present, fallback to pickle
DEFAULT_ENV = "pickle"
ENV_LOKY_PICKLER = os.environ.get("LOKY_PICKLER", DEFAULT_ENV)
_LokyPickler = None
_loky_pickler_name = None
About this issue
- Original URL
- State: open
- Created 4 years ago
- Reactions: 2
- Comments: 29 (24 by maintainers)
I’m parallelizing the 10 Fold cross validation as "folds_val_test = Parallel(n_jobs=10)(delayed(train)(fold) for fold in range(n_folds)) " which uses scikit learn inside each fold.
Issue: LokyProcess-10 failed with traceback: Traceback (most recent call last): File “/home//my-kernel/lib/python3.6/site-packages/joblib/externals/loky/backend/popen_loky_posix.py”, line 197, in <module> prep_data = pickle.load(from_parent) ValueError: unsupported pickle protocol: 5
[0m Trial 0 failed because of the following error: TerminatedWorkerError(‘A worker process managed by the executor was unexpectedly terminated. This could be caused by a segmentation fault while calling the function or by an excessive memory usage causing the Operating System to kill the worker.\n\nThe exit codes of the workers are {EXIT(1), EXIT(1), EXIT(1), EXIT(1), EXIT(1), EXIT(1), EXIT(1), EXIT(1), EXIT(1), EXIT(1)}’,) Traceback (most recent call last): File “/home/chethan-kernel/my-kernel/lib/python3.6/site-packages/optuna/_optimize.py”, line 217, in _run_trial value_or_values = func(trial) File “/home/thesis/OPC_Full/OPC_Full_Dataset/EV_GCN-master/train_eval_evgcn.py”, line 140, in objective folds_val_test = Parallel(n_jobs=10)(delayed(train)(fold) for fold in range(n_folds)) File “/home/chethan-kernel/my-kernel/lib/python3.6/site-packages/joblib/parallel.py”, line 1054, in call self.retrieve() File “/home/chethan-kernel/my-kernel/lib/python3.6/site-packages/joblib/parallel.py”, line 933, in retrieve self._output.extend(job.get(timeout=self.timeout)) File “/home/chve882b/chethan-kernel/my-kernel/lib/python3.6/site-packages/joblib/_parallel_backends.py”, line 542, in wrap_future_result return future.result(timeout=timeout) File “/home/chethan-kernel/my-kernel/lib/python3.6/concurrent/futures/_base.py”, line 432, in result return self.__get_result() File “/home/chethan-kernel/my-kernel/lib/python3.6/concurrent/futures/_base.py”, line 384, in __get_result raise self._exception joblib.externals.loky.process_executor.TerminatedWorkerError: A worker process managed by the executor was unexpectedly terminated. This could be caused by a segmentation fault while calling the function or by an excessive memory usage causing the Operating System to kill the worker.
@suquark let me get another repro for this, as i think it is out of date.
Another fix I’ve found is to disable joblib usage - i.e.,
n_jobs=1. @xianyinxin as mentioned on slack, I’d love to know more about what your workload is (how you’re using Ray + sklearn)