cca_zoo: TerminatedWorkerError when using GridSearchCV
Hi James, with the latest version of cca_zoo I get this error:
TerminatedWorkerError: A worker process managed by the executor was unexpectedly terminated. This could be caused by a segmentation fault while calling the function or by an excessive memory usage causing the Operating System to kill the worker.
The exit codes of the workers are {SIGSEGV(-11), SIGSEGV(-11), SIGSEGV(-11)}
Didn’t happen in older versions (although I am using the exact same script). Can you reproduce this? Here’s my full code + attached my X,y and groups as txt files.
import numpy as np
from sklearn.model_selection import GroupShuffleSplit
from cca_zoo.model_selection import GridSearchCV
from cca_zoo.linear import SCCA_PMD
###############################################################################
## Settings ###################################################################
###############################################################################
n_jobs = 8
pre_dispatch = 3
rng = np.random.RandomState(42)
###############################################################################
## Prepare Analysis ###########################################################
###############################################################################
X = np.loadtxt('X.txt')
y = np.loadtxt('y.txt')
groups = np.loadtxt('groups.txt')
###############################################################################
## Analysis settings ##########################################################
###############################################################################
# define latent dimensions
latent_dimensions = 3
# pretend that there are subject groups in the dataset
cv = GroupShuffleSplit(n_splits=10,train_size=0.7,random_state=rng)
# define a search space (optimize left and right penalty parameters)
param_grid = {'tau':[np.arange(0.1,1.1,0.1),0]}
# define an estimator
estimator = SCCA_PMD(latent_dimensions=latent_dimensions,random_state=rng)
##############################################################################
## Run GridSearch
##############################################################################
def scorer(estimator, views):
scores = estimator.score(views)
return np.mean(scores)
grid = GridSearchCV(estimator,param_grid,scoring=scorer,n_jobs=n_jobs,cv=cv)
grid.fit([X,y],groups=groups)
Data:
Note that X and y have been normalized prior to GridSearch, so each fold “sees” different batches of the normalized dataset. Not sure if this is related to https://github.com/jameschapman19/cca_zoo/issues/175
About this issue
- Original URL
- State: open
- Created a year ago
- Comments: 29 (29 by maintainers)
Ah no because I haven’t been testing the jobs>1 behaviour. Will add to the tests
Geez, that sounds not trivial. For now, I will just use the working conda environment for the analysis. Let me know, if I should test something out for you. Probably a good idea to implement a testing workflow with different os-runners in the long term.
Thanks for this. Will have a dig around.
True, now I remember that we had this issue before. Unfortunately I still get the error, even when using
param_grid = {'tau':[list(np.arange(0.1,1.0,0.1)),0]}P.S.: Maybe it would make sense to open a separate issue for the data types in param_grid? I think it would sense if both list, numpy arrays or other iterables would be valid inputs?