cuml: [BUG] Bootstrapping causes accuracy drop in cuML RF
Describe the bug I have been investing the accuracy bug in cuML RF (#2518), and I managed to isolate the cause of the accuracy drop. The bootstrapping option causes cuML RF to do worse than sklearn.
Steps/Code to reproduce bug Download the dataset in NumPy, which has been obtained from #2561:
- loans_X.npy (392.9 MB)
- loans_y.npy (5.0 MB)
Then run the following script:
import itertools
import numpy as np
from sklearn.model_selection import cross_validate, KFold
from sklearn.ensemble import RandomForestClassifier
from cuml.ensemble import RandomForestClassifier as cuml_RandomForestClassifier
# Preprocessed data
X = np.load('data/loans_X.npy')
y = np.load('data/loans_y.npy')
param_range = {
'n_estimators': [1, 10, 100],
'max_features': [1.0],
'bootstrap': [False, True],
'random_state': [0]
}
max_depth = 21
n_bins = 64
cv_fold = KFold(n_splits=10, shuffle=True, random_state=2020)
param_set = (dict(zip(param_range, x)) for x in itertools.product(*param_range.values()))
for params in param_set:
print(f'==== params = {params} ====')
skl_clf = RandomForestClassifier(n_jobs=-1, max_depth=max_depth, **params)
scores = cross_validate(skl_clf, X, y, cv=cv_fold, n_jobs=-1, return_train_score=True)
skl_train_acc = scores['train_score']
skl_cv_acc = scores['test_score']
print(f'sklearn: Training accuracy = {skl_train_acc.mean()} (std={skl_train_acc.std()}), ' +
f'CV accuracy = {skl_cv_acc.mean()} (std={skl_cv_acc.std()})')
for split_algo in [0, 1]:
cuml_clf = cuml_RandomForestClassifier(n_bins=n_bins, max_depth=max_depth, n_streams=1,
split_algo=split_algo, **params)
scores = cross_validate(cuml_clf, X, y, cv=cv_fold, return_train_score=True)
cuml_train_acc = scores['train_score']
cuml_cv_acc = scores['test_score']
print(f'cuml, split_algo = {split_algo}: Training accuracy = {cuml_train_acc.mean()} ' +
f'(std={cuml_train_acc.std()}), CV accuracy = {cuml_cv_acc.mean()} ' +
f'(std={cuml_cv_acc.std()})')
cuML RF gives substantially lower training accuracy than sklearn (up to 9%p lower):
Training accuracy, bootstrap=True
| n_estimators | sklearn | cuML (split_algo=0) | cuML (split_algo=1) |
|---|---|---|---|
| 1 | 0.876951 | 0.822472 | 0.821807 |
| 10 | 0.925004 | 0.857921 | 0.861096 |
| 100 | 0.931354 | 0.84961 | 0.852527 |
On the other hand, turning off bootstrapping with bootstrap=False improves the accuracy of cuML RF relative to sklearn:
Training accuracy, bootstrap=False
| n_estimators | sklearn | cuML (split_algo=0) | cuML (split_algo=1) |
|---|---|---|---|
| 1 | 0.92087 | 0.921404 | 0.928852 |
| 10 | 0.922088 | 0.921404 | 0.928852 |
| 100 | 0.92228 | 0.921404 | 0.928852 |
To make sure that bootstrapping is the issue, I wrote the following script to generate bootstraps with NumPy and fed the same bootstraps into both cuML RF and sklearn:
import time
import numpy as np
from sklearn.base import clone
from sklearn.metrics import accuracy_score
from sklearn.ensemble import RandomForestClassifier
from cuml.ensemble import RandomForestClassifier as cuml_RandomForestClassifier
def fit_with_custom_bootstrap(base_estimator, X, y, *, n_estimators, random_state):
assert len(X.shape) == 2 and len(y.shape) == 1
assert X.shape[0] == y.shape[0]
rng = np.random.default_rng(seed=random_state)
estimators = []
for _ in range(n_estimators):
estimator = clone(base_estimator)
indices = rng.choice(X.shape[0], size=(X.shape[0],), replace=True)
bootstrap_X, bootstrap_y = X[indices, :], y[indices]
assert bootstrap_X.shape == X.shape
assert bootstrap_y.shape == y.shape
estimator.fit(bootstrap_X, bootstrap_y)
estimators.append(estimator)
return estimators
def predict_unweighted_vote(estimators, X_test):
s = np.zeros((X_test.shape[0], 2))
for estimator in estimators:
s[np.arange(X_test.shape[0]), estimator.predict(X_test).astype(np.int32)] += 1.0
s /= len(estimators)
return np.argmax(s, axis=1)
def predict_weighted_vote(estimators, X_test):
s = estimators[0].predict_proba(X_test)
for estimator in estimators[1:]:
s += estimator.predict_proba(X_test)
s /= len(estimators)
return np.argmax(s, axis=1)
X = np.load('data/loans_X.npy')
y = np.load('data/loans_y.npy')
assert np.array_equal(np.unique(y), np.array([0., 1.]))
max_depth = 21
n_bins = 64
split_algo = 0
n_estimators = 1 # Also number of bootstraps
# Since we generate our own bootstraps, disable bootstrap in cuML / sklearn
params = {
'n_estimators': 1,
'max_features': 1.0,
'bootstrap': False,
'random_state': 0
}
cuml_clf = cuml_RandomForestClassifier(n_bins=n_bins, max_depth=max_depth, n_streams=1,
split_algo=split_algo, **params)
tstart = time.perf_counter()
estimators = fit_with_custom_bootstrap(cuml_clf, X, y, n_estimators=n_estimators, random_state=0)
tend = time.perf_counter()
print(f'cuml, Training: {tend - tstart} sec')
tstart = time.perf_counter()
y_pred = predict_unweighted_vote(estimators, X)
tend = time.perf_counter()
print(f'cuml, Prediction: {tend - tstart} sec')
print(accuracy_score(y, y_pred))
skl_clf = RandomForestClassifier(n_jobs=-1, max_depth=max_depth, **params)
tstart = time.perf_counter()
estimators = fit_with_custom_bootstrap(skl_clf, X, y, n_estimators=n_estimators, random_state=0)
tend = time.perf_counter()
print(f'sklearn, Training: {tend - tstart} sec')
tstart = time.perf_counter()
y_pred = predict_weighted_vote(estimators, X)
tend = time.perf_counter()
print(f'sklearn, Prediction: {tend - tstart} sec')
print(accuracy_score(y, y_pred))
The results now look a lot better: cuML RF gives competitive training accuracy as sklearn.
| n_estimators | sklearn | cuML (split_algo=0) | cuML (split_algo=1) |
|---|---|---|---|
| 1 | 0.87526379 | 0.875951111 | 0.875735555 |
| 10 | 0.92300364 | 0.921437212 | 0.931396502 |
| 100 | 0.9296966 | 0.919802517 | 0.930890215 |
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Comments: 24 (16 by maintainers)
I’ll try this on the Airline delays & NYC taxi datasets and report back. Good detective work so far @hcho3 !
Got it. It is then likely that bootstrapping is not the only issue that’s causing accuracy drop. There are probably multiple factors in play. I will investigate further.