cuml: [BUG] Considerable difference between the sklearn and cuml RF accuracy

The difference in accuracy between sklearn and cuml RF varies in the range of 3-7% (3% difference obtained after hyper-parameter tuning) for the below example. The base code for the example below is the rf notebook present in the notebooks-contrib repo (https://github.com/rapidsai/notebooks-contrib/blob/branch-0.14/intermediate_notebooks/examples/rf_demo.ipynb) :

from cuml import RandomForestClassifier as cuRF
from sklearn.ensemble import RandomForestClassifier as sklRF
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
import cudf
import numpy as np
import pandas as pd
import os
from urllib.request import urlretrieve
import gzip
 
# ## Helper function to download and extract the Higgs dataset
 
def download_higgs(compressed_filepath, decompressed_filepath):
    higgs_url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/00280/HIGGS.csv.gz'
    if not os.path.isfile(compressed_filepath):
        urlretrieve(higgs_url, compressed_filepath)
    if not os.path.isfile(decompressed_filepath):
        cf = gzip.GzipFile(compressed_filepath)
        with open(decompressed_filepath, 'wb') as df:
            df.write(cf.read())
 
 
def main():
 
    # ## Download Higgs data and read using cudf
 
    data_dir = 'raid/data/rfc/'
    if not os.path.exists(data_dir):
        print('creating rf data directory')
        os.system('mkdir raid/data/rfc')
 
    compressed_filepath = data_dir+'HIGGS.csv.gz' # Set this as path for gzipped Higgs data file, if you already have
    decompressed_filepath = data_dir+'HIGGS.csv' # Set this as path for decompressed Higgs data file, if you already have
    download_higgs(compressed_filepath, decompressed_filepath)
 
    col_names = ['label'] + ["col-{}".format(i) for i in range(2, 30)] # Assign column names
    dtypes_ls = ['int32'] + ['float32' for _ in range(2, 30)] # Assign dtypes to each column
    data = cudf.read_csv(decompressed_filepath, names=col_names, dtype=dtypes_ls)
    data.head().to_pandas()
 
 
    # ## Make train test splits
 
    X, y = data[data.columns.difference(['label'])].fillna(value=0).as_matrix(), data['label'].to_array()

    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=500_000)
    print(X_train.shape, y_train.shape, X_test.shape, y_test.shape)
 
    # cuml Random Forest params
 
    cu_rf_params = {
        'n_estimators': 25,
        'max_depth': 25,
        'n_bins': 512,
        'seed' : 0,

    }
 
    # Train cuml RF
    cu_rf = cuRF(**cu_rf_params)
    cu_rf.fit(X_train, y_train)
 
 
    # sklearn Random Forest params
 
    skl_rf_params = {
        'n_estimators': 25,
        'max_depth': 25,
        'random_state' : 0,
    }
 
    # Train sklearn RF parallely
    skl_rf = sklRF(**skl_rf_params, n_jobs=20)
    skl_rf.fit(X_train, y_train)
 
    from sklearn.metrics import confusion_matrix
    cu_preds = cu_rf.predict(X_test)
    sk_preds = skl_rf.predict(X_test)
    cu_conf_mat = confusion_matrix(y_test, cu_preds)
    sk_conf_mat = confusion_matrix(y_test, sk_preds)
    print(" cuml con matri : ")
    print(cu_conf_mat)
    print(" sklearn con matri : ")
    print(sk_conf_mat)

    # ## Predict and compare cuml and sklearn RandomForestClassifier
 
    print("cuml RF Accuracy Score: ", accuracy_score(cu_rf.predict(X_test), y_test))
    print("sklearn RF Accuracy Score: ", accuracy_score(skl_rf.predict(X_test), y_test))
 
if __name__ == '__main__':
    main()

Output :


(7179904, 28) (7179904,) (500000, 28) (500000,)
/home/saloni/miniconda3/envs/float64-rf/bin/ipython:61: UserWarning: For reproducible results, n_streams==1 is recommended. If n_streams is > 1, results may vary due to stream/thread timing differences, even when random_seed is set
[W] [14:07:02.530014] Expected column ('F') major order, but got the opposite. Converting data, this will result in additional memory utilization.
[W] [14:11:31.697048] Expected column ('F') major order, but got the opposite. Converting data, this will result in additional memory utilization.
 cuml confusion matri :
[[147193  88201]
 [ 56040 208566]]
 sklearn confusion matri :
[[168559  66835]
 [ 61668 202938]]
[W] [14:11:57.015702] Expected column ('F') major order, but got the opposite. Converting data, this will result in additional memory utilization.
cuml RF Accuracy Score:  0.711518
sklearn RF Accuracy Score:  0.742994

About this issue

Original URL
State: closed
Created 4 years ago
Reactions: 1
Comments: 17 (16 by maintainers)

Commits related to this issue

Enable probability output from RF binary classifier (alternative implementaton) (#3869) Alternative implementation of #3862 that does not depend on #3854 Closes #3764 Closes #2518 Authors: - Phi... — committed to rapidsai/cuml by hcho3 3 years ago
Implement vector leaf for random forest (#4191) Fixes #3764,#2518 To do: - post charts confirming the improvement in accuracy - address python tests - benchmark Authors: - Rory Mitchell (htt... — committed to rapidsai/cuml by RAMitchell 3 years ago
Enable probability output from RF binary classifier (alternative implementaton) (#3869) Alternative implementation of #3862 that does not depend on #3854 Closes #3764 Closes #2518 Authors: - Phi... — committed to vimarsh6739/cuml by hcho3 3 years ago
Implement vector leaf for random forest (#4191) Fixes #3764,#2518 To do: - post charts confirming the improvement in accuracy - address python tests - benchmark Authors: - Rory Mitchell (htt... — committed to vimarsh6739/cuml by RAMitchell 3 years ago

Most upvoted comments

on reducing the number of trees to 1 and max_depth=1 we see a large difference in sklearn and cuml’s confusion matrix and accuracy. For reference the dataset has 52% 1’s and 48% 0’s:

 cu_rf_params = {
         'n_estimators': 1,
         'max_depth': 1,
         'n_bins': 512,
         'max_features': 1.0,
         'seed' : 0,
     }

    skl_rf_params = {
        'n_estimators': 1,
        'max_depth': 1,
        'max_features': 1.0,
        'random_state' : 0,
    }

Output:

(10500000, 28) (10500000,) (500000, 28) (500000,)
/home/saloni/miniconda3/envs/duplication-reduc/bin/ipython:35: UserWarning: For reproducible results in Random Forest Classifier or for almost reproducible results in Random Forest Regressor, n_streams==1 is recommended. If n_streams is > 1, results may vary due to stream/thread timing differences, even when random_seed is set
[W] [08:32:01.751873] Expected column ('F') major order, but got the opposite. Converting data, this will result in additional memory utilization.
[W] [08:32:23.564072] Expected column ('F') major order, but got the opposite. Converting data, this will result in additional memory utilization.
 cuml con matri :
[[     0 234585]
 [     0 265415]]
 sklearn con matri :
[[ 91895 142690]
 [ 51969 213446]]
[W] [08:32:25.031102] Expected column ('F') major order, but got the opposite. Converting data, this will result in additional memory utilization.
cuml RF Accuracy Score:  0.53083
sklearn RF Accuracy Score:  0.610682

When the max_depth is increased to max_depth=10 the accuracy of both the models increases considerably but cuML’s accuracy is lower than sklearns :

(10500000, 28) (10500000,) (500000, 28) (500000,)
/home/saloni/miniconda3/envs/duplication-reduc/bin/ipython:35: UserWarning: For reproducible results in Random Forestesults in Random Forest Regressor, n_streams==1 is recommended. If n_streams is > 1, results may vary due to stream/tom_seed is set
[W] [08:39:34.370599] Expected column ('F') major order, but got the opposite. Converting data, this will result in a
[W] [08:42:55.570608] Expected column ('F') major order, but got the opposite. Converting data, this will result in a
 cuml con matri :
[[156682  78541]
 [ 71548 193229]]
 sklearn con matri :
[[159556  75667]
 [ 72049 192728]]
[W] [08:42:57.083508] Expected column ('F') major order, but got the opposite. Converting data, this will result in a
cuml RF Accuracy Score:  0.699822
sklearn RF Accuracy Score:  0.704568

Salonijain27 on Jul 15, 2020

I think I spotted at least one of the reasons for the discrepancy. The depth of the tree is 0-based in sklearn and it is 1-based in cuML. What it means is when max_depth is set to 1, sklearn trained model has three nodes, one root and two leaf nodes. However cuML model contains one root and one leaf node. To get similar models from both, the max_depth parameter needs to be more by 1 in case of cuML. I modified above posted code by @hcho3 to accommodate this:

import numpy as np
import matplotlib.pyplot as plt

from sklearn.ensemble import RandomForestClassifier
from cuml.ensemble import RandomForestClassifier as cuml_RandomForestClassifier

from sklearn.model_selection import cross_val_score

# Preprocessed data
X = np.load('data/loans_X.npy')
y = np.load('data/loans_y.npy')
X_test = np.load('data/loans_X_test.npy')
y_test = np.load('data/loans_y_test.npy')

params = {
    'n_estimators': 1,
    'max_features': 1.0,
    'bootstrap': False
}

n_bins = 512
max_depth = 8

skl_clf = RandomForestClassifier(n_jobs=-1, max_depth=max_depth, **params)
skl_clf.fit(X, y)
skl_train_accuracy = skl_clf.score(X, y)
skl_test_accuracy = skl_clf.score(X_test, y_test)
print(f'sklearn: Training accuracy = {skl_train_accuracy}, Test accuracy = {skl_test_accuracy}')

cuml_clf = cuml_RandomForestClassifier(n_bins=n_bins, max_depth=max_depth+1, **params)
cuml_clf.fit(X, y)
cuml_train_accuracy = cuml_clf.score(X, y)
cuml_test_accuracy = cuml_clf.score(X_test, y_test)
print(f'cuml: Training accuracy = {cuml_train_accuracy}, Test accuracy = {cuml_test_accuracy}')

I got following result with this code:

depth = 1
sklearn: Training accuracy = 0.5961166666666666, Test accuracy = 0.7918
cuml: Training accuracy = 0.5961166620254517, Test accuracy = 0.7918000221252441

depth = 2
sklearn: Training accuracy = 0.6000111111111112, Test accuracy = 0.8201
cuml: Training accuracy = 0.6000111103057861, Test accuracy = 0.8201000094413757

depth = 3
sklearn: Training accuracy = 0.6409666666666667, Test accuracy = 0.4536
cuml: Training accuracy = 0.6409666538238525, Test accuracy = 0.4535999894142151

depth = 4
sklearn: Training accuracy = 0.6518222222222222, Test accuracy = 0.45665
cuml: Training accuracy = 0.6518222093582153, Test accuracy = 0.45669999718666077

Up to depth 4, the models and accuracy are very close to each other. However for further depths the cuML testing accuracy started dropping

depth = 5
sklearn: Training accuracy = 0.6656722222222222, Test accuracy = 0.50165
cuml: Training accuracy = 0.665672242641449, Test accuracy = 0.4603999853134155

depth = 6
sklearn: Training accuracy = 0.6787611111111111, Test accuracy = 0.5725
cuml: Training accuracy = 0.6787777543067932, Test accuracy = 0.5057500004768372

depth = 7
sklearn: Training accuracy = 0.7041055555555555, Test accuracy = 0.54715
cuml: Training accuracy = 0.7040888667106628, Test accuracy = 0.48510000109672546

depth = 8
sklearn: Training accuracy = 0.72555, Test accuracy = 0.38175
cuml: Training accuracy = 0.7254166603088379, Test accuracy = 0.24729999899864197

So the issue is only partially solved at the moment and needs further analysis.

vinaydes on Jul 17, 2020