LightGBM: instability caused by floating point errors
How you are using LightGBM?
LightGBM component: Python package
Environment info
Operating System: Ubuntu 20.04.1 LTS
CPU/GPU model: Intel® Xeon® Platinum 8259CL CPU @ 2.50GHz
C++ compiler version: gcc version 9.3.0 (Ubuntu 9.3.0-10ubuntu2)
CMake version: 3.16.3
Java version: N/A
Python version: 3.8.5
R version: N/A
Other:
LightGBM version or commit hash: 3.0.0 (also happens in 3.0.0.99 afc76d2cb8234f6876ed75d923a7916bfef9a1e5)
Error message and / or logs
Given the same data/parameters/seeds lightgbm sometimes produces different models/outputs. This makes experiment reproducibility problematic. I believe this is the result of some kind of floating point operation error propagation. Models initially start the same but when the pickled model files are inspected they start to differ after >250 iterations. I have noticed this error before in lightgbm 2.3.1 but worked around it by disabling bagging. #2598 In this version I get the same error even without using bagging. In the notebook inside the linked repo I have tried running the models 5 times and got different results every time. Results also change when the notebook is restarted.
Reproducible example(s)
Data and code required to reproduce this bug: https://github.com/rebidaldal/lightgbmBugreport
X_train = pd.read_feather("X_train.feather")
y_train = pd.read_feather("y_train.feather")
X_test = pd.read_feather("X_test.feather")
param = {}
param['boosting_type'] = 'gbdt'
param['feature_fraction'] = 0.2
param['feature_fraction_bynode'] = 0.5
param['lambda_l2'] = 1
param['learning_rate'] = 0.02
param['max_bin'] = 31
param['max_delta_step'] = 63
param['max_depth'] = 20
param['metric'] = 'rmse'
param['min_data_in_bin'] = 8191
param['min_data_in_leaf'] = 8191
param['min_gain_to_split'] = 1
param['num_leaves'] = 100
param['objective'] = 'regression'
param['verbosity'] = -1
param['seed'] = 1
dTrain = lgb.Dataset(X_train, label=y_train)
model = lgb.train( param, dTrain, 1000)
preds = model.predict(X_test)
preds
array([15.032409 , 15.06296509, 14.98345116, …, 15.24531394, 15.1225563 , 15.07920504])
dTrain = lgb.Dataset(X_train, label=y_train)
model2 = lgb.train( param, dTrain, 1000)
preds2 = model2.predict(X_test)
preds2
array([15.05416012, 15.00900578, 14.97445924, …, 15.17322033, 15.05276989, 15.01204842])
np.corrcoef(preds, preds2)
array([[1. , 0.98490951], [0.98490951, 1. ]])
Steps to reproduce
- Download and unzip data from repo: https://github.com/rebidaldal/lightgbmBugreport
- Run the jupyter notebook from the repo :bug.ipynb or the code above
- Observe results
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Comments: 45
Hi Guolin, I will try to make docker containers for deterministic and default (3.0.99) branches and share it with you. I already have some Amazon Machine Images on hand but they are kind of troublesome to share. Is there any other branches you would suggest me to try and reproduce the bug?
Thanks.
Hi Guolin,
Today as I was trying to set up the docker image I noticed that as-of version 3.1.0 and
param["deterministic"] = Truethe instability issue looks solved.After 100 runs the results do not differ at all 😄 Also as far as I can see with this rudimentary benchmark, the performance is not adversely affected at all. ( %1 difference is negligible )
One remaining issue is
param["deterministic"] = Trueneedsparam["boost_from_average"] = Falseto work correctly, maybe it should override that, or display a warning to do so. Tomorrow I will test this version with a larger real-life use case and see if the issue is solved for good.Hi @guolinke
I have checked out branch #3385 with the two fixes and re-ran my tests and it looks like the instability problem is solved for now. I will do more tests on real-life sized data and I will let you know if there are any remaining issues.
Thank you very much for the fixes.
#3385 should fix the bagging.
@rebidaldal please let me know if the problem still exists.