scikit-learn: Infinite loop bug in Gridsearch CV Svm.SVC(), Windows 10

This is going to be a bit shorter as I cannot determine with 100% accuracy what steps are causing this bug.

Setup: Windows 10 most recent update Anaconda most recent update Jupyter Notebook and Prompt latest update (issue tested in both; no warnings, errors, bugs, etc are printed in either) All packages (scikit, numpy, etc.) latest update AMD FX 8350 8-core Nvidia GeForce GTX 980 16 GB ram

The issue: When running with n_jobs set to -1, my grid_search_wrapper runs fine when calculating MLPClassifier() and takes up ~70% of CPU processing power. The jobs (192 x 10 crossvalidation = 1920) run in about 8 minutes and returns the expected dataframe of results.

When running with the clf set to a SVM machine, the process always starts up and prints out:

Fitting X folds for each of Y candidates, totalling (sic) X*Y fits

After this, my computer sits for hours without any progress. Killing the kernel does not halt the ~10-15 spawned processes. When n_jobs is set to -1 killing python through task manager ends the CPU usage. When n_jobs = 1, my CPU usage is only ~20% (I believe only one core is being utillized) but no python processes are spawned in task manager. I have to restart my computer to stop the single core calculation.

Note that training individual models without passing them through the gridsearch function succeeds. I have not tested all combinations by hands, but I have tested all the individual kernels. Training an SVM model took, on average, 1-2 minutes when running singularly.

Here are the following variations of inputs and the resultant output of the grid_search_wrapper function:

with n_jobs = -1:
ml_params = {
    'kernel': ['linear', 'poly', 'rbf', 'sigmoid'],
    'degree': [2,3,4],
    tol': [1e-3, 1e-4, 1e-2]
}

FAIL

ml_params = {
    'kernel': ['linear', 'rbf', 'sigmoid'],
}

PASS

ml_params = {
    'kernel': ['linear', 'rbf', 'sigmoid'],
    tol': [1e-3, 1e-4, 1e-2]
}

FAIL

ml_params = {
    'kernel': ['linear']
}

Pass

ml_params = {
    'kernel': ['rbf']
}

Pass

ml_params = {
    'kernel': ['sigmoid']
}

Pass

ml_params = {
    'kernel': ['poly']
}

FAIL

with n_jobs = 1:

ml_params = {
    'kernel': ['linear', 'poly', 'rbf', 'sigmoid'],
    'degree': [2,3,4],
    tol': [1e-3, 1e-4, 1e-2]
}

FAIL

ml_params = {
    'kernel': ['linear', 'rbf', 'sigmoid'],
}

FAIL

ml_params = {
    'kernel': ['linear', 'rbf', 'sigmoid'],
    tol': [1e-3, 1e-4, 1e-2]
}

FAIL


ml_params = {
    'kernel': ['linear']
}

Pass


ml_params = {
    'kernel': ['rbf']
}

Pass


ml_params = {
    'kernel': ['sigmoid']
}

Pass


ml_params = {
    'kernel': ['poly']
}

FAIL

Note that k_folds was set to 3 instead of 10 as was used when training MLPClassifier in order to help facilitate figuring out what was happening with SVM. I think this setting is irrelevant to the problem.

Data set: 15000 instances x 90 predictors (relatively small, memory usage is about 2 GB during SVM runs)

Code dump:

def grid_search_wrapper(clf, param_grid, scoring, X_train, X_test, y_train, y_test, refit_score='accuracy_score'):
    #https://towardsdatascience.com/fine-tuning-a-classifier-in-scikit-learn-66e048c21e65
    from sklearn.model_selection import GridSearchCV
    from sklearn.model_selection import StratifiedKFold
    """
    fits a GridSearchCV classifier using refit_score for optimization
    prints classifier performance metrics
    """
    
    skf = StratifiedKFold(n_splits=3)
    grid_search = GridSearchCV(clf, param_grid, cv = skf, scoring=scorers, refit=refit_score, return_train_score=True, n_jobs=1, verbose = 1)
    grid_search.fit(X_train, y_train)

    # make the predictions
    y_pred = grid_search.predict(X_test)

    print('Best params for {}'.format(refit_score))
    print(grid_search.best_params_)

    # confusion matrix on the test data.
    print('\nConfusion matrix of Model optimized for {} on the test data:'.format(refit_score))
    print(pd.DataFrame(confusion_matrix(y_test, y_pred),
                 columns=['pred_neg', 'pred_pos'], index=['neg', 'pos']))
    return grid_search

#ignore my shitty code and not importing at the top
from sklearn.metrics import roc_curve, precision_recall_curve, auc, make_scorer, recall_score, accuracy_score, precision_score, confusion_matrix

#gridsearch for optimal MLPs
#ml_params = {
#    'activation': ['relu', 'tanh', 'logistic'],
#    'alpha': [1e-3, 1e-4, 1e-5, 1e-6],
#    'hidden_layer_sizes': [[100,25,], [50,50,], [75,25,25], [50,25,10]],
#    'max_iter': [100, 500, 1000, 2500]    
#}

ml_params = {
    #    'kernel': ['linear', 'poly', 'rbf', 'sigmoid'],
    #poly kernel is problematic
    'kernel': ['linear', 'rbf', 'sigmoid'],
    #'degree': [2,3,4],
    'tol': [1e-3, 1e-4, 1e-2]

#SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
#    decision_function_shape='ovr', degree=3, gamma='scale', kernel='rbf',
#    max_iter=-1, probability=False, random_state=None, shrinking=True,
#    tol=0.001, verbose=False)    
}
scorers = {
    'precision_score': make_scorer(precision_score),
    'recall_score': make_scorer(recall_score),
    'accuracy_score': make_scorer(accuracy_score)
}
grid_search_clf = grid_search_wrapper(refit_score = 'recall_score', param_grid = ml_params, scoring = scorers, X_train = x_train, X_test = x_test, y_train = y_train, y_test = y_test, clf = svm.SVC())


results = pd.DataFrame(grid_search_clf.cv_results_)
results = results.sort_values(by='mean_test_recall_score', ascending=False) 

#for MLP
#results[['mean_test_precision_score', 'mean_test_accuracy_score', 'mean_test_recall_score', 'param_activation', 'param_alpha', 'param_hidden_layer_sizes', 'param_max_iter']]

#for svm
results[['mean_test_precision_score', 'mean_test_accuracy_score', 'mean_test_recall_score', 'param_kernel', 'param_tol']]

About this issue

Original URL
State: closed
Created 5 years ago
Comments: 16 (6 by maintainers)

Most upvoted comments

No the issue is probably the infinite max_iter

jnothman on Feb 17, 2019

Searching over tol is an unusual things to do. You should try setting a finite max_iter

jnothman on Feb 16, 2019