LightGBM: instability caused by floating point errors

How you are using LightGBM?

LightGBM component: Python package

Environment info

Operating System: Ubuntu 20.04.1 LTS

CPU/GPU model: Intel® Xeon® Platinum 8259CL CPU @ 2.50GHz

C++ compiler version: gcc version 9.3.0 (Ubuntu 9.3.0-10ubuntu2)

CMake version: 3.16.3

Java version: N/A

Python version: 3.8.5

R version: N/A

Other:

LightGBM version or commit hash: 3.0.0 (also happens in 3.0.0.99 afc76d2cb8234f6876ed75d923a7916bfef9a1e5)

Error message and / or logs

Given the same data/parameters/seeds lightgbm sometimes produces different models/outputs. This makes experiment reproducibility problematic. I believe this is the result of some kind of floating point operation error propagation. Models initially start the same but when the pickled model files are inspected they start to differ after >250 iterations. I have noticed this error before in lightgbm 2.3.1 but worked around it by disabling bagging. #2598 In this version I get the same error even without using bagging. In the notebook inside the linked repo I have tried running the models 5 times and got different results every time. Results also change when the notebook is restarted.

Reproducible example(s)

Data and code required to reproduce this bug: https://github.com/rebidaldal/lightgbmBugreport

X_train = pd.read_feather("X_train.feather")
y_train = pd.read_feather("y_train.feather")
X_test = pd.read_feather("X_test.feather")

param = {}
param['boosting_type']      = 'gbdt'
param['feature_fraction']   = 0.2
param['feature_fraction_bynode']   = 0.5
param['lambda_l2']          = 1
param['learning_rate']      = 0.02
param['max_bin']          = 31
param['max_delta_step']          = 63
param['max_depth']          = 20
param['metric']             = 'rmse'
param['min_data_in_bin']       = 8191
param['min_data_in_leaf']   = 8191
param['min_gain_to_split']       = 1
param['num_leaves']         = 100
param['objective']          = 'regression'
param['verbosity']          = -1
param['seed'] = 1

dTrain = lgb.Dataset(X_train, label=y_train) 
model = lgb.train( param, dTrain, 1000)
preds = model.predict(X_test)
preds

array([15.032409 , 15.06296509, 14.98345116, …, 15.24531394, 15.1225563 , 15.07920504])

dTrain = lgb.Dataset(X_train, label=y_train)    
model2 = lgb.train( param, dTrain, 1000)
preds2 = model2.predict(X_test)
preds2

array([15.05416012, 15.00900578, 14.97445924, …, 15.17322033, 15.05276989, 15.01204842])

np.corrcoef(preds, preds2)

array([[1. , 0.98490951], [0.98490951, 1. ]])

Steps to reproduce

Download and unzip data from repo: https://github.com/rebidaldal/lightgbmBugreport
Run the jupyter notebook from the repo :bug.ipynb or the code above
Observe results

About this issue

Original URL
State: closed
Created 4 years ago
Comments: 45

Most upvoted comments

Hi Guolin, I will try to make docker containers for deterministic and default (3.0.99) branches and share it with you. I already have some Amazon Machine Images on hand but they are kind of troublesome to share. Is there any other branches you would suggest me to try and reproduce the bug?

Thanks.

rebidaldal on Oct 27, 2020

Hi Guolin,

Today as I was trying to set up the docker image I noticed that as-of version 3.1.0 and param["deterministic"] = True the instability issue looks solved.

fix1

After 100 runs the results do not differ at all 😄 Also as far as I can see with this rudimentary benchmark, the performance is not adversely affected at all. ( %1 difference is negligible )

fix2

One remaining issue is param["deterministic"] = True needs param["boost_from_average"] = False to work correctly, maybe it should override that, or display a warning to do so. Tomorrow I will test this version with a larger real-life use case and see if the issue is solved for good.

rebidaldal on Nov 17, 2020

Hi @guolinke

I have checked out branch #3385 with the two fixes and re-ran my tests and it looks like the instability problem is solved for now. I will do more tests on real-life sized data and I will let you know if there are any remaining issues.

no_bug

Thank you very much for the fixes.

rebidaldal on Sep 12, 2020

#3385 should fix the bagging.

@rebidaldal please let me know if the problem still exists.

guolinke on Sep 12, 2020