scikit-learn: Parallalism in gridsearcCV is ending up with a permission error

Description - Parallelism(n_jobs =-1) in grid search cv is stopping with a permission error.

Steps/Code to Reproduce -

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import TimeSeriesSplit , GridSearchCV
from sklearn.preprocessing import StandardScaler 
from sklearn.utils import parallel_backend
#Standardization of Data
X_Train_Vectors_Std = StandardScaler(with_mean = False).fit_transform(X_Train_Vectors)
X_test_Vectors_Std = StandardScaler(with_mean = False).fit_transform(X_test_Vectors)
#creating List of lambda values that are to be searched
lambdaList = [10**-4, 10**-2, 10**0, 10**2, 10**4]
time_split = TimeSeriesSplit(n_splits=5)
param_search= dict(C = lambdaList)
grid = GridSearchCV(estimator = LogisticRegression(solver='saga'), param_grid = param_search,n_jobs = -1, scoring = 'f1_weighted', cv=time_split.split(X_Train_Vectors_Std)
                          ,return_train_score = True )
grid.fit(X_Train_Vectors_Std,Y_Train)  

Expected Results : No error is expected

Actual Results

C:\ProgramData\Anaconda3\lib\site-packages\sklearn\externals\joblib\disk.py:122: UserWarning: Unable to delete folder C:\Users\HANI\AppData\Local\Temp\joblib_memmapping_folder_13296_3875384810 after 5 tentatives.
  .format(folder_path, RM_SUBDIRS_N_RETRY))

---------------------------------------------------------------------------
PermissionError                           Traceback (most recent call last)
<ipython-input-6-c065dfe04993> in <module>()
      9 grid = GridSearchCV(estimator = LogisticRegression(solver='saga'), param_grid = param_search,n_jobs = -1, scoring = 'f1_weighted', cv=time_split.split(X_Train_Vectors_Std)
     10                           ,return_train_score = True )
---> 11 grid.fit(X_Train_Vectors_Std,Y_Train)

C:\ProgramData\Anaconda3\lib\site-packages\sklearn\model_selection\_search.py in fit(self, X, y, groups, **fit_params)
    720                 return results_container[0]
    721 
--> 722             self._run_search(evaluate_candidates)
    723 
    724         results = results_container[0]

C:\ProgramData\Anaconda3\lib\site-packages\sklearn\externals\joblib\parallel.py in __exit__(self, exc_type, exc_value, traceback)
    730 
    731     def __exit__(self, exc_type, exc_value, traceback):
--> 732         self._terminate_backend()
    733         self._managed_backend = False
    734 

C:\ProgramData\Anaconda3\lib\site-packages\sklearn\externals\joblib\parallel.py in _terminate_backend(self)
    760     def _terminate_backend(self):
    761         if self._backend is not None:
--> 762             self._backend.terminate()
    763 
    764     def _dispatch(self, batch):

C:\ProgramData\Anaconda3\lib\site-packages\sklearn\externals\joblib\_parallel_backends.py in terminate(self)
    524             # in latter calls but we free as much memory as we can by deleting
    525             # the shared memory
--> 526             delete_folder(self._workers._temp_folder)
    527             self._workers = None
    528 

C:\ProgramData\Anaconda3\lib\site-packages\sklearn\externals\joblib\disk.py in delete_folder(folder_path, onerror)
    113             while True:
    114                 try:
--> 115                     shutil.rmtree(folder_path, False, None)
    116                     break
    117                 except (OSError, WindowsError):

C:\ProgramData\Anaconda3\lib\shutil.py in rmtree(path, ignore_errors, onerror)
    492             os.close(fd)
    493     else:
--> 494         return _rmtree_unsafe(path, onerror)
    495 
    496 # Allow introspection of whether or not the hardening against symlink

C:\ProgramData\Anaconda3\lib\shutil.py in _rmtree_unsafe(path, onerror)
    387                 os.unlink(fullname)
    388             except OSError:
--> 389                 onerror(os.unlink, fullname, sys.exc_info())
    390     try:
    391         os.rmdir(path)

C:\ProgramData\Anaconda3\lib\shutil.py in _rmtree_unsafe(path, onerror)
    385         else:
    386             try:
--> 387                 os.unlink(fullname)
    388             except OSError:
    389                 onerror(os.unlink, fullname, sys.exc_info())

PermissionError: [WinError 32] The process cannot access the file because it is being used by another process: 'C:\\Users\\HANI\\AppData\\Local\\Temp\\joblib_memmapping_folder_13296_3875384810\\13296-2443532547352-7b8cd102e07c472ab00885ea9ca3e72d.pkl'

Versions

Windows-10-10.0.17134-SP0 Python 3.6.3 |Anaconda custom (64-bit)| (default, Oct 15 2017, 03:27:45) [MSC v.1900 64 bit (AMD64)] NumPy 1.15.2 SciPy 1.1.0 Scikit-Learn 0.20.0

About this issue

  • Original URL
  • State: closed
  • Created 6 years ago
  • Reactions: 3
  • Comments: 57 (38 by maintainers)

Most upvoted comments

Interestingly,

import numpy as np
import pandas as pd
from sklearn.model_selection import RandomizedSearchCV
from sklearn.ensemble import RandomForestClassifier


for _ in range(10):
    X_train = np.random.rand(int(2e6)).reshape((int(1e6), 2))
    y_train = np.random.randint(0, 2, int(1e6))
    X_train = pd.DataFrame(X_train)

    clf = RandomForestClassifier()
    gs = RandomizedSearchCV(
        clf,
        param_distributions={"n_estimators": np.array([1]),
                                           "max_depth": np.array([2])},
        n_iter=1,
        cv=2,
        scoring="accuracy",
        verbose=1,
        n_jobs=2
    )

    gs.fit(X_train, y_train)

always fails (never at the first iteration of the for loop). Note the use of a pandas dataframe for X_train.

However when X_train is a numpy array

import numpy as np
import pandas as pd
from sklearn.model_selection import RandomizedSearchCV
from sklearn.ensemble import RandomForestClassifier


for _ in range(10):
    X_train = np.random.rand(int(2e6)).reshape((int(1e6), 2))
    y_train = np.random.randint(0, 2, int(1e6))

    clf = RandomForestClassifier()
    gs = RandomizedSearchCV(
        clf,
        param_distributions={"n_estimators": np.array([1]),
                                           "max_depth": np.array([2])},
        n_iter=1,
        cv=2,
        scoring="accuracy",
        verbose=1,
        n_jobs=2
    )

    gs.fit(X_train, y_train)

does not fail.

The fact that I cannot reproduce with a VM might be caused by the fact that memory mapped files might behave differently in a VM.

I will try to reproduce with a CI worker in this PR: https://github.com/joblib/joblib/pull/942

Please offer your comment at joblib.

Yes, in my last tests it happened the same as you mentioned. Numpy arrays work (at least on the implementation I am working ) and if I try a dataframe, it tends to raise an error after the fifth iteraction.

Actually, when in the same ipython session I first run the snippet with X_train as a pandas dataframe and then the snippet with X_train as a numpy array I get the error for both. But if quit the ipython session, restart a new one and run the snippet with X_train as a numpy array I don’t have an error.

@albertcthomas Thanks for providing the snippets. By me both pieces of code, with numpy arrays and dataframes I get an error.

With dataframe as X_train:

Fitting 2 folds for each of 1 candidates, totalling 2 fits
[Parallel(n_jobs=2)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=2)]: Done   2 out of   2 | elapsed:    0.5s remaining:    0.0s
[Parallel(n_jobs=2)]: Done   2 out of   2 | elapsed:    0.5s finished
C:\Users\lucas\Anaconda3\lib\site-packages\sklearn\externals\joblib\disk.py:122: UserWarning: Unable to delete folder C:\Users\lucas\AppData\Local\Temp\joblib_memmapping_folder_113764_4112142399 after 5 tentatives.
  .format(folder_path, RM_SUBDIRS_N_RETRY))

The run with numpy arrays:

Fitting 2 folds for each of 1 candidates, totalling 2 fits
[Parallel(n_jobs=2)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=2)]: Done   2 out of   2 | elapsed:    0.4s remaining:    0.0s
[Parallel(n_jobs=2)]: Done   2 out of   2 | elapsed:    0.4s finished
C:\Users\lucas\Anaconda3\lib\site-packages\sklearn\externals\joblib\disk.py:122: UserWarning: Unable to delete folder C:\Users\lucas\AppData\Local\Temp\joblib_memmapping_folder_113764_4112142399 after 5 tentatives.
  .format(folder_path, RM_SUBDIRS_N_RETRY))

In scikit-learn 0.19.2, I tried running the code you provided, so:

import numpy as np
import pandas as pd
from sklearn.model_selection import RandomizedSearchCV
from sklearn.ensemble import RandomForestClassifier
import sklearn

print(sklearn.__version__)
for _ in range(10):
    X_train = np.random.rand(int(2e6)).reshape((int(1e6), 2))
    y_train = np.random.randint(0, 2, int(1e6))
    X_train = pd.DataFrame(X_train)

    clf = RandomForestClassifier()
    gs = RandomizedSearchCV(
        clf,
        param_distributions={"n_estimators": np.array([1]),
                                           "max_depth": np.array([2])},
        n_iter=1,
        cv=2,
        scoring="accuracy",
        verbose=1,
        n_jobs=-1
    )

    gs.fit(X_train, y_train)

It runs the random search without error.

Have to check the library dependancies, as in another computer it run smoothly with 0.20.

If you have any idea what I can still try, I am open to suggestions. In the meanwhile, I will continue with 0.19

Thanks! Instead of sending pictures, readability and reusability of your code can be greatly improved if you format your code snippets and complete error messages appropriately. For example:

```python
print(something)
```

generates:

print(something)

And:

```pytb
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ImportError: No module named 'hello'
```

generates:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ImportError: No module named 'hello'

Also, if the code is very long you can link to it like this.

You can edit your comments at any time to improve readability. This helps maintainers a lot.