LightGBM: [dask] DaskLGBMClassifier very slow and not using CPU
Using @jameslamb 's Dockerfile to set up dask+lightgbm:
wget https://raw.githubusercontent.com/jameslamb/talks/main/recent-developments-in-lightgbm/Dockerfile
sudo docker build -t dasklgbm .
sudo docker run --rm -p 8787:8787 dasklgbm
sudo docker ps -a
sudo docker exec -ti ... /bin/bash
pip3 install -U dask-ml
ipython
Then run this code:
import pandas as pd
from sklearn import metrics
from dask.distributed import Client, LocalCluster
import dask.dataframe as dd
import dask.array as da
from dask_ml import preprocessing
from lightgbm.dask import DaskLGBMClassifier
cluster = LocalCluster(n_workers=16, threads_per_worker=1)
client = Client(cluster)
d_train = pd.read_csv("https://s3.amazonaws.com/benchm-ml--main/train-1m.csv")
d_test = pd.read_csv("https://s3.amazonaws.com/benchm-ml--main/test.csv")
d_all = pd.concat([d_train,d_test])
dx_all = dd.from_pandas(d_all, npartitions=16)
vars_cat = ["Month","DayofMonth","DayOfWeek","UniqueCarrier", "Origin", "Dest"]
vars_num = ["DepTime","Distance"]
for col in vars_cat:
dx_all[col] = preprocessing.LabelEncoder().fit_transform(dx_all[col])
X_all = dx_all[vars_cat+vars_num].to_dask_array(lengths=True)
y_all = da.where((dx_all["dep_delayed_15min"]=="Y").to_dask_array(lengths=True),1,0)
X_train = X_all[0:d_train.shape[0],]
y_train = y_all[0:d_train.shape[0]]
X_test = X_all[d_train.shape[0]:(d_train.shape[0]+d_test.shape[0]),]
y_test = y_all[d_train.shape[0]:(d_train.shape[0]+d_test.shape[0])]
X_train.persist()
y_train.persist()
client.has_what()
md = DaskLGBMClassifier(num_leaves=512, learning_rate=0.1, n_estimators=100, tree_learner="data", silent=False)
%time md.fit( client=client, X=X_train, y=y_train)
md_loc = md.to_local()
X_test_loc = X_test.compute()
y_pred = md_loc.predict_proba(X_test)[:,1]
print(metrics.roc_auc_score(y_test, y_pred))
It runs very slowly (>30minutes vs regular lightgbm in <4 seconds) and also not using CPUs while running

For comparison regular lightgbm:
import pandas as pd
import numpy as np
from sklearn import preprocessing
from sklearn import metrics
import lightgbm as lgb
d_train = pd.read_csv("https://s3.amazonaws.com/benchm-ml--main/train-1m.csv")
d_test = pd.read_csv("https://s3.amazonaws.com/benchm-ml--main/test.csv")
d_all = pd.concat([d_train,d_test])
vars_cat = ["Month","DayofMonth","DayOfWeek","UniqueCarrier", "Origin", "Dest"]
vars_num = ["DepTime","Distance"]
for col in vars_cat:
d_all[col] = preprocessing.LabelEncoder().fit_transform(d_all[col])
X_all = d_all[vars_cat+vars_num].to_numpy()
y_all = np.where(d_all["dep_delayed_15min"]=="Y",1,0)
X_train = X_all[0:d_train.shape[0],]
y_train = y_all[0:d_train.shape[0]]
X_test = X_all[d_train.shape[0]:(d_train.shape[0]+d_test.shape[0]),]
y_test = y_all[d_train.shape[0]:(d_train.shape[0]+d_test.shape[0])]
md = lgb.LGBMClassifier(num_leaves=512, learning_rate=0.1, n_estimators=100)
%time md.fit(X_train, y_train)
y_pred = md.predict_proba(X_test)[:,1]
print(metrics.roc_auc_score(y_test, y_pred))
runs in 3.7 seconds.
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Reactions: 1
- Comments: 19 (12 by maintainers)
@jameslamb Sure, thanks. I think it’s fixed based on @jmoralez 's results above.
@szilard this issue was closed today by a bot we use to close issues that are
awaiting responsefor too long. If you run these benchmarks again in the future and find that this problem still exists, please come back and re-open this, and we’d be happy to help.Not yet, I can ping you here once 3.2 is released