cuml: [BUG] Considerable difference between the sklearn and cuml RF accuracy
The difference in accuracy between sklearn and cuml RF varies in the range of 3-7% (3% difference obtained after hyper-parameter tuning) for the below example. The base code for the example below is the rf notebook present in the notebooks-contrib repo (https://github.com/rapidsai/notebooks-contrib/blob/branch-0.14/intermediate_notebooks/examples/rf_demo.ipynb) :
from cuml import RandomForestClassifier as cuRF
from sklearn.ensemble import RandomForestClassifier as sklRF
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
import cudf
import numpy as np
import pandas as pd
import os
from urllib.request import urlretrieve
import gzip
# ## Helper function to download and extract the Higgs dataset
def download_higgs(compressed_filepath, decompressed_filepath):
higgs_url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/00280/HIGGS.csv.gz'
if not os.path.isfile(compressed_filepath):
urlretrieve(higgs_url, compressed_filepath)
if not os.path.isfile(decompressed_filepath):
cf = gzip.GzipFile(compressed_filepath)
with open(decompressed_filepath, 'wb') as df:
df.write(cf.read())
def main():
# ## Download Higgs data and read using cudf
data_dir = 'raid/data/rfc/'
if not os.path.exists(data_dir):
print('creating rf data directory')
os.system('mkdir raid/data/rfc')
compressed_filepath = data_dir+'HIGGS.csv.gz' # Set this as path for gzipped Higgs data file, if you already have
decompressed_filepath = data_dir+'HIGGS.csv' # Set this as path for decompressed Higgs data file, if you already have
download_higgs(compressed_filepath, decompressed_filepath)
col_names = ['label'] + ["col-{}".format(i) for i in range(2, 30)] # Assign column names
dtypes_ls = ['int32'] + ['float32' for _ in range(2, 30)] # Assign dtypes to each column
data = cudf.read_csv(decompressed_filepath, names=col_names, dtype=dtypes_ls)
data.head().to_pandas()
# ## Make train test splits
X, y = data[data.columns.difference(['label'])].fillna(value=0).as_matrix(), data['label'].to_array()
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=500_000)
print(X_train.shape, y_train.shape, X_test.shape, y_test.shape)
# cuml Random Forest params
cu_rf_params = {
'n_estimators': 25,
'max_depth': 25,
'n_bins': 512,
'seed' : 0,
}
# Train cuml RF
cu_rf = cuRF(**cu_rf_params)
cu_rf.fit(X_train, y_train)
# sklearn Random Forest params
skl_rf_params = {
'n_estimators': 25,
'max_depth': 25,
'random_state' : 0,
}
# Train sklearn RF parallely
skl_rf = sklRF(**skl_rf_params, n_jobs=20)
skl_rf.fit(X_train, y_train)
from sklearn.metrics import confusion_matrix
cu_preds = cu_rf.predict(X_test)
sk_preds = skl_rf.predict(X_test)
cu_conf_mat = confusion_matrix(y_test, cu_preds)
sk_conf_mat = confusion_matrix(y_test, sk_preds)
print(" cuml con matri : ")
print(cu_conf_mat)
print(" sklearn con matri : ")
print(sk_conf_mat)
# ## Predict and compare cuml and sklearn RandomForestClassifier
print("cuml RF Accuracy Score: ", accuracy_score(cu_rf.predict(X_test), y_test))
print("sklearn RF Accuracy Score: ", accuracy_score(skl_rf.predict(X_test), y_test))
if __name__ == '__main__':
main()
Output :
(7179904, 28) (7179904,) (500000, 28) (500000,)
/home/saloni/miniconda3/envs/float64-rf/bin/ipython:61: UserWarning: For reproducible results, n_streams==1 is recommended. If n_streams is > 1, results may vary due to stream/thread timing differences, even when random_seed is set
[W] [14:07:02.530014] Expected column ('F') major order, but got the opposite. Converting data, this will result in additional memory utilization.
[W] [14:11:31.697048] Expected column ('F') major order, but got the opposite. Converting data, this will result in additional memory utilization.
cuml confusion matri :
[[147193 88201]
[ 56040 208566]]
sklearn confusion matri :
[[168559 66835]
[ 61668 202938]]
[W] [14:11:57.015702] Expected column ('F') major order, but got the opposite. Converting data, this will result in additional memory utilization.
cuml RF Accuracy Score: 0.711518
sklearn RF Accuracy Score: 0.742994
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Reactions: 1
- Comments: 17 (16 by maintainers)
Commits related to this issue
- Enable probability output from RF binary classifier (alternative implementaton) (#3869) Alternative implementation of #3862 that does not depend on #3854 Closes #3764 Closes #2518 Authors: - Phi... — committed to rapidsai/cuml by hcho3 3 years ago
- Implement vector leaf for random forest (#4191) Fixes #3764,#2518 To do: - post charts confirming the improvement in accuracy - address python tests - benchmark Authors: - Rory Mitchell (htt... — committed to rapidsai/cuml by RAMitchell 3 years ago
- Enable probability output from RF binary classifier (alternative implementaton) (#3869) Alternative implementation of #3862 that does not depend on #3854 Closes #3764 Closes #2518 Authors: - Phi... — committed to vimarsh6739/cuml by hcho3 3 years ago
- Implement vector leaf for random forest (#4191) Fixes #3764,#2518 To do: - post charts confirming the improvement in accuracy - address python tests - benchmark Authors: - Rory Mitchell (htt... — committed to vimarsh6739/cuml by RAMitchell 3 years ago
on reducing the number of trees to 1 and max_depth=1 we see a large difference in sklearn and cuml’s confusion matrix and accuracy. For reference the dataset has 52% 1’s and 48% 0’s:
Output:
When the max_depth is increased to
max_depth=10the accuracy of both the models increases considerably but cuML’s accuracy is lower than sklearns :I think I spotted at least one of the reasons for the discrepancy. The depth of the tree is 0-based in sklearn and it is 1-based in cuML. What it means is when
max_depthis set to1, sklearn trained model has three nodes, one root and two leaf nodes. However cuML model contains one root and one leaf node. To get similar models from both, themax_depthparameter needs to be more by1in case of cuML. I modified above posted code by @hcho3 to accommodate this:I got following result with this code:
Up to depth 4, the models and accuracy are very close to each other. However for further depths the cuML testing accuracy started dropping
So the issue is only partially solved at the moment and needs further analysis.