cuml: [BUG] Bootstrapping causes accuracy drop in cuML RF

Describe the bug I have been investing the accuracy bug in cuML RF (#2518), and I managed to isolate the cause of the accuracy drop. The bootstrapping option causes cuML RF to do worse than sklearn.

Steps/Code to reproduce bug Download the dataset in NumPy, which has been obtained from #2561:

Then run the following script:

import itertools

import numpy as np
from sklearn.model_selection import cross_validate, KFold
from sklearn.ensemble import RandomForestClassifier
from cuml.ensemble import RandomForestClassifier as cuml_RandomForestClassifier

# Preprocessed data
X = np.load('data/loans_X.npy')
y = np.load('data/loans_y.npy')

param_range = {
    'n_estimators': [1, 10, 100],
    'max_features': [1.0],
    'bootstrap': [False, True],
    'random_state': [0]
}

max_depth = 21
n_bins = 64

cv_fold = KFold(n_splits=10, shuffle=True, random_state=2020)

param_set = (dict(zip(param_range, x)) for x in itertools.product(*param_range.values()))
for params in param_set:
    print(f'==== params = {params} ====')
    skl_clf = RandomForestClassifier(n_jobs=-1, max_depth=max_depth, **params)
    scores = cross_validate(skl_clf, X, y, cv=cv_fold, n_jobs=-1, return_train_score=True)
    skl_train_acc = scores['train_score']
    skl_cv_acc = scores['test_score']
    print(f'sklearn: Training accuracy = {skl_train_acc.mean()} (std={skl_train_acc.std()}), ' +
          f'CV accuracy = {skl_cv_acc.mean()} (std={skl_cv_acc.std()})')
    
    for split_algo in [0, 1]:
        cuml_clf = cuml_RandomForestClassifier(n_bins=n_bins, max_depth=max_depth, n_streams=1,
                                               split_algo=split_algo, **params)
        scores = cross_validate(cuml_clf, X, y, cv=cv_fold, return_train_score=True)
        cuml_train_acc = scores['train_score']
        cuml_cv_acc = scores['test_score']
        print(f'cuml, split_algo = {split_algo}: Training accuracy = {cuml_train_acc.mean()} ' +
              f'(std={cuml_train_acc.std()}), CV accuracy = {cuml_cv_acc.mean()} ' +
              f'(std={cuml_cv_acc.std()})')

cuML RF gives substantially lower training accuracy than sklearn (up to 9%p lower):

Training accuracy, bootstrap=True

n_estimators sklearn cuML (split_algo=0) cuML (split_algo=1)
1 0.876951 0.822472 0.821807
10 0.925004 0.857921 0.861096
100 0.931354 0.84961 0.852527

On the other hand, turning off bootstrapping with bootstrap=False improves the accuracy of cuML RF relative to sklearn:

Training accuracy, bootstrap=False

n_estimators sklearn cuML (split_algo=0) cuML (split_algo=1)
1 0.92087 0.921404 0.928852
10 0.922088 0.921404 0.928852
100 0.92228 0.921404 0.928852

To make sure that bootstrapping is the issue, I wrote the following script to generate bootstraps with NumPy and fed the same bootstraps into both cuML RF and sklearn:

import time

import numpy as np
from sklearn.base import clone
from sklearn.metrics import accuracy_score
from sklearn.ensemble import RandomForestClassifier
from cuml.ensemble import RandomForestClassifier as cuml_RandomForestClassifier

def fit_with_custom_bootstrap(base_estimator, X, y, *, n_estimators, random_state):
    assert len(X.shape) == 2 and len(y.shape) == 1
    assert X.shape[0] == y.shape[0]
    rng = np.random.default_rng(seed=random_state)
    estimators = []
    for _ in range(n_estimators):
        estimator = clone(base_estimator)
        indices = rng.choice(X.shape[0], size=(X.shape[0],), replace=True)
        bootstrap_X, bootstrap_y = X[indices, :], y[indices]
        assert bootstrap_X.shape == X.shape
        assert bootstrap_y.shape == y.shape
        estimator.fit(bootstrap_X, bootstrap_y)

        estimators.append(estimator)
    return estimators

def predict_unweighted_vote(estimators, X_test):
    s = np.zeros((X_test.shape[0], 2))
    for estimator in estimators:
        s[np.arange(X_test.shape[0]), estimator.predict(X_test).astype(np.int32)] += 1.0
    s /= len(estimators)
    return np.argmax(s, axis=1)

def predict_weighted_vote(estimators, X_test):
    s = estimators[0].predict_proba(X_test)
    for estimator in estimators[1:]:
        s += estimator.predict_proba(X_test)
    s /= len(estimators)
    return np.argmax(s, axis=1)

X = np.load('data/loans_X.npy')
y = np.load('data/loans_y.npy')
assert np.array_equal(np.unique(y), np.array([0., 1.]))

max_depth = 21
n_bins = 64
split_algo = 0
n_estimators = 1  # Also number of bootstraps

# Since we generate our own bootstraps, disable bootstrap in cuML / sklearn
params = {
    'n_estimators': 1,
    'max_features': 1.0,
    'bootstrap': False,
    'random_state': 0
}

cuml_clf = cuml_RandomForestClassifier(n_bins=n_bins, max_depth=max_depth, n_streams=1,
                                       split_algo=split_algo, **params)

tstart = time.perf_counter()
estimators = fit_with_custom_bootstrap(cuml_clf, X, y, n_estimators=n_estimators, random_state=0)
tend = time.perf_counter()
print(f'cuml, Training: {tend - tstart} sec')
tstart = time.perf_counter()
y_pred = predict_unweighted_vote(estimators, X)
tend = time.perf_counter()
print(f'cuml, Prediction: {tend - tstart} sec')
print(accuracy_score(y, y_pred))

skl_clf = RandomForestClassifier(n_jobs=-1, max_depth=max_depth, **params)

tstart = time.perf_counter()
estimators = fit_with_custom_bootstrap(skl_clf, X, y, n_estimators=n_estimators, random_state=0)
tend = time.perf_counter()
print(f'sklearn, Training: {tend - tstart} sec')
tstart = time.perf_counter()
y_pred = predict_weighted_vote(estimators, X)
tend = time.perf_counter()
print(f'sklearn, Prediction: {tend - tstart} sec')
print(accuracy_score(y, y_pred))

The results now look a lot better: cuML RF gives competitive training accuracy as sklearn.

n_estimators sklearn cuML (split_algo=0) cuML (split_algo=1)
1 0.87526379 0.875951111 0.875735555
10 0.92300364 0.921437212 0.931396502
100 0.9296966 0.919802517 0.930890215

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Comments: 24 (16 by maintainers)

Most upvoted comments

I’ll try this on the Airline delays & NYC taxi datasets and report back. Good detective work so far @hcho3 !

Got it. It is then likely that bootstrapping is not the only issue that’s causing accuracy drop. There are probably multiple factors in play. I will investigate further.