cca_zoo: TerminatedWorkerError when using GridSearchCV

Hi James, with the latest version of cca_zoo I get this error:

TerminatedWorkerError: A worker process managed by the executor was unexpectedly terminated. This could be caused by a segmentation fault while calling the function or by an excessive memory usage causing the Operating System to kill the worker.
The exit codes of the workers are {SIGSEGV(-11), SIGSEGV(-11), SIGSEGV(-11)}

Didn’t happen in older versions (although I am using the exact same script). Can you reproduce this? Here’s my full code + attached my X,y and groups as txt files.

import numpy as np
from sklearn.model_selection import GroupShuffleSplit
from cca_zoo.model_selection import GridSearchCV
from cca_zoo.linear import SCCA_PMD

###############################################################################
## Settings ###################################################################
###############################################################################

n_jobs = 8
pre_dispatch = 3
rng = np.random.RandomState(42)

###############################################################################
## Prepare Analysis ###########################################################
###############################################################################

X = np.loadtxt('X.txt')
y = np.loadtxt('y.txt')
groups = np.loadtxt('groups.txt')

###############################################################################
## Analysis settings ##########################################################
###############################################################################

# define latent dimensions
latent_dimensions = 3

# pretend that there are subject groups in the dataset
cv = GroupShuffleSplit(n_splits=10,train_size=0.7,random_state=rng)

# define a search space (optimize left and right penalty parameters)
param_grid = {'tau':[np.arange(0.1,1.1,0.1),0]}

# define an estimator
estimator = SCCA_PMD(latent_dimensions=latent_dimensions,random_state=rng)

##############################################################################
## Run GridSearch
##############################################################################

def scorer(estimator, views):
    scores = estimator.score(views)
    return np.mean(scores)

grid = GridSearchCV(estimator,param_grid,scoring=scorer,n_jobs=n_jobs,cv=cv)
grid.fit([X,y],groups=groups)

Data:

groups.txt X.txt y.txt

Note that X and y have been normalized prior to GridSearch, so each fold “sees” different batches of the normalized dataset. Not sure if this is related to https://github.com/jameschapman19/cca_zoo/issues/175

About this issue

Original URL
State: open
Created a year ago
Comments: 29 (29 by maintainers)

Most upvoted comments

Ah no because I haven’t been testing the jobs>1 behaviour. Will add to the tests

jameschapman19 on Aug 3, 2023

Geez, that sounds not trivial. For now, I will just use the working conda environment for the analysis. Let me know, if I should test something out for you. Probably a good idea to implement a testing workflow with different os-runners in the long term.

JohannesWiesner on Aug 3, 2023

Thanks for this. Will have a dig around.

jameschapman19 on Aug 2, 2023

True, now I remember that we had this issue before. Unfortunately I still get the error, even when using param_grid = {'tau':[list(np.arange(0.1,1.0,0.1)),0]}

P.S.: Maybe it would make sense to open a separate issue for the data types in param_grid? I think it would sense if both list, numpy arrays or other iterables would be valid inputs?

JohannesWiesner on Jul 31, 2023