ray: [cloudpickle] Too much override for cloudpickle, breaks scikit-learn usage

What is the problem?

import ray
from sklearn.model_selection import train_test_split
from sklearn.linear_model import SGDClassifier
from sklearn.datasets import make_classification
import numpy as np

X, y = make_classification(
    n_samples=11000,
    n_features=1000,
    n_informative=50,
    n_redundant=0,
    n_classes=10,
    class_sep=2.5)
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=1000)

# Example parameters to tune from SGDClassifier
parameter_grid = {"alpha": [1e-4, 1e-1, 1], "epsilon": [0.01, 0.1]}
from sklearn.model_selection import GridSearchCV
# n_jobs=-1 enables use of all cores like Tune does
sklearn_search = GridSearchCV(SGDClassifier(), parameter_grid, n_jobs=4, cv=4)
sklearn_search.fit(x_train, y_train)

Introduces:

--------------------------------------------------------------------------------
LokyProcess-10 failed with traceback:
--------------------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/rliaw/miniconda3/envs/test/lib/python3.7/site-packages/joblib/externals/loky/backend/popen_loky_posix.py", line 197, in <module>
    prep_data = pickle.load(from_parent)
ValueError: unsupported pickle protocol: 5


--------------------------------------------------------------------------------
joblib.externals.loky.process_executor.TerminatedWorkerError: A worker process managed by the executor was unexpectedly terminated. This could be caused by a segmentation fault while calling the function or by an excessive memory usage causing the Operating System to kill the worker.

The exit codes of the workers are {EXIT(1)}

This is because it uses Joblib, a library for parallel computation. Joblib is for multiprocessing, so has a IPC where they use cloudpickle on the client side to transfer data to child processes. I think these clients access some ray cloudpickle attributes on accident, causing deserialization to break down.

Can we avoid the ray cloudpickle port from affecting 3rdparty libs?

cc @suquark

if sys.platform != "win32":
    from ._posix_reduction import _mk_inheritable  # noqa: F401
else:
    from . import _win_reduction  # noqa: F401

# global variable to change the pickler behavior
try:
    from joblib.externals import cloudpickle  # noqa: F401
    DEFAULT_ENV = "cloudpickle"
except ImportError:
    # If cloudpickle is not present, fallback to pickle
    DEFAULT_ENV = "pickle"

ENV_LOKY_PICKLER = os.environ.get("LOKY_PICKLER", DEFAULT_ENV)
_LokyPickler = None
_loky_pickler_name = None

About this issue

Original URL
State: open
Created 4 years ago
Reactions: 2
Comments: 29 (24 by maintainers)

Most upvoted comments

I’m parallelizing the 10 Fold cross validation as "folds_val_test = Parallel(n_jobs=10)(delayed(train)(fold) for fold in range(n_folds)) " which uses scikit learn inside each fold.

Issue: LokyProcess-10 failed with traceback: Traceback (most recent call last): File “/home//my-kernel/lib/python3.6/site-packages/joblib/externals/loky/backend/popen_loky_posix.py”, line 197, in <module> prep_data = pickle.load(from_parent) ValueError: unsupported pickle protocol: 5

[0m Trial 0 failed because of the following error: TerminatedWorkerError(‘A worker process managed by the executor was unexpectedly terminated. This could be caused by a segmentation fault while calling the function or by an excessive memory usage causing the Operating System to kill the worker.\n\nThe exit codes of the workers are {EXIT(1), EXIT(1), EXIT(1), EXIT(1), EXIT(1), EXIT(1), EXIT(1), EXIT(1), EXIT(1), EXIT(1)}’,) Traceback (most recent call last): File “/home/chethan-kernel/my-kernel/lib/python3.6/site-packages/optuna/_optimize.py”, line 217, in _run_trial value_or_values = func(trial) File “/home/thesis/OPC_Full/OPC_Full_Dataset/EV_GCN-master/train_eval_evgcn.py”, line 140, in objective folds_val_test = Parallel(n_jobs=10)(delayed(train)(fold) for fold in range(n_folds)) File “/home/chethan-kernel/my-kernel/lib/python3.6/site-packages/joblib/parallel.py”, line 1054, in call self.retrieve() File “/home/chethan-kernel/my-kernel/lib/python3.6/site-packages/joblib/parallel.py”, line 933, in retrieve self._output.extend(job.get(timeout=self.timeout)) File “/home/chve882b/chethan-kernel/my-kernel/lib/python3.6/site-packages/joblib/_parallel_backends.py”, line 542, in wrap_future_result return future.result(timeout=timeout) File “/home/chethan-kernel/my-kernel/lib/python3.6/concurrent/futures/_base.py”, line 432, in result return self.__get_result() File “/home/chethan-kernel/my-kernel/lib/python3.6/concurrent/futures/_base.py”, line 384, in __get_result raise self._exception joblib.externals.loky.process_executor.TerminatedWorkerError: A worker process managed by the executor was unexpectedly terminated. This could be caused by a segmentation fault while calling the function or by an excessive memory usage causing the Operating System to kill the worker.

Chethan-Babu-stack on Jul 5, 2021

@suquark let me get another repro for this, as i think it is out of date.

richardliaw on Dec 20, 2020

Another fix I’ve found is to disable joblib usage - i.e., n_jobs=1. @xianyinxin as mentioned on slack, I’d love to know more about what your workload is (how you’re using Ray + sklearn)

richardliaw on Dec 3, 2020