LightGBM: lightgbm.basic.LightGBMError: Bug in GPU histogram! split 11937: 12, smaller_leaf: 10245, larger_leaf: 1704

version: 2.3.2

[LightGBM] [Fatal] Bug in GPU histogram! split 11937: 12, smaller_leaf: 10245, larger_leaf: 1704

Traceback (most recent call last):
  File "lgb_prefit_4ff5fa97-86b3-420c-aa87-5f01abcc18c3.py", line 10, in <module>
    model.fit(X, y, sample_weight=sample_weight, init_score=init_score, eval_set=eval_set, eval_names=valid_X_features, eval_sample_weight=eval_sample_weight, eval_init_score=init_score, eval_metric=eval_metric, early_stopping_rounds=early_stopping_rounds, feature_name=X_features, verbose=verbose_fit)
  File "/home/jon/.pyenv/versions/3.6.7/lib/python3.6/site-packages/lightgbm_gpu/sklearn.py", line 818, in fit
    callbacks=callbacks, init_model=init_model)
  File "/home/jon/.pyenv/versions/3.6.7/lib/python3.6/site-packages/lightgbm_gpu/sklearn.py", line 610, in fit
    callbacks=callbacks, init_model=init_model)
  File "/home/jon/.pyenv/versions/3.6.7/lib/python3.6/site-packages/lightgbm_gpu/engine.py", line 250, in train
    booster.update(fobj=fobj)
  File "/home/jon/.pyenv/versions/3.6.7/lib/python3.6/site-packages/lightgbm_gpu/basic.py", line 2106, in update
    ctypes.byref(is_finished)))
  File "/home/jon/.pyenv/versions/3.6.7/lib/python3.6/site-packages/lightgbm_gpu/basic.py", line 46, in _safe_call
    raise LightGBMError(decode_string(_LIB.LGBM_GetLastError()))
lightgbm.basic.LightGBMError: Bug in GPU histogram! split 11937: 12, smaller_leaf: 10245, larger_leaf: 1704

script and pickle file:

lgbm_histbug.zip

@sh1ng need help seeing if this is fixed in even later master.

About this issue

  • Original URL
  • State: open
  • Created 4 years ago
  • Comments: 39 (10 by maintainers)

Most upvoted comments

@guolinke Have just built it from the latest master branch, still fails. I’ll try to separate a minimum reproducible example and create an issue then.

Still happens in version 3.0

lightgbm.basic.LightGBMError: Check failed: (best_split_info.left_count) > (0) at /root/repo/LightGBM/src/treelearner/serial_tree_learner.cpp, line 630

https://github.com/h2oai/h2o4gpu/blob/master/tests/python/open_data/gbm/test_lightgbm.py#L265-L284

@nightflight-dk Thanks for having a trial. Since #4528 is a very large PR, we plan to decompose it into several parts, and merge them one by one. We expect to finish the merge process by the end of this month. Multi-GPU and distributed training will be added after #4528 is being merged. I will point that out once PRs are open for that.

+1, this bug makes lightgbm GPU useless. still happens to me on latest master

Since there hasn’t been any activity for a year, I would like to bring this topic up again.

Got the version 3.3.3, python. Training on GPU, on Windows.

The issue is bugging me for the past 2 days… The data set is 500k, with 1500 features. There seems to be some correlation with min_gain_to_split parameter. When the value is 1 I have not yet seen any errors, however on value 0 (default) it seems to crash quite often. Take this comment with caution since I have not ran enough tests yet…

crashed when

{‘learning_rate’: 0.43467624523546383, ‘max_depth’: 8, ‘num_leaves’: 201, ‘feature_fraction’: 0.9, ‘bagging_fraction’: 0.7000000000000001, ‘bagging_freq’: 8}

{‘learning_rate’: 0.021403440298427053, ‘max_depth’: 2, ‘num_leaves’: 176, ‘lambda_l1’: 3.8066251775052895, ‘lambda_l2’: 1.08526150100961e-08, ‘feature_fraction’: 0.6, ‘bagging_fraction’: 0.9, ‘bagging_freq’: 6}

{‘learning_rate’: 0.3493368922746614, ‘max_depth’: 6, ‘num_leaves’: 109, ‘lambda_l1’: 4.506588272812341e-05, ‘lambda_l2’: 2.5452579091348995e-07, ‘feature_fraction’: 0.7000000000000001, ‘bagging_fraction’: 1.0, ‘bagging_freq’: 6, ‘min_gain_to_split’: 0}

{‘learning_rate’: 0.17840010040986135, ‘max_depth’: 12, ‘num_leaves’: 251, ‘lambda_l1’: 0.004509589012189404, ‘lambda_l2’: 3.882151732343819e-08, ‘feature_fraction’: 0.30000000000000004, ‘bagging_fraction’: 1.0, ‘bagging_freq’: 8, ‘min_gain_to_split’: 0}

the code is:

    params = {
        'device_type': "gpu",
        'objective': 'multiclass',  # 
        'metric': 'multi_logloss',  # 
        "boosting_type": "gbdt",
        "num_class": 3,
        'random_state': 123,
        'verbosity': -1,  # hides "No further splits with positive gain, best gain: -inf" warnings
        "learning_rate": trial.suggest_float("learning_rate", 0.01, 0.9, log=True),  # 0.1
        'max_depth': trial.suggest_int('max_depth', 2, 12),
        'num_leaves': trial.suggest_int('num_leaves', 2, 256),  # def 31
        'lambda_l1': trial.suggest_float('lambda_l1', 1e-8, 10.0, log=True),  # 0
        'lambda_l2': trial.suggest_float('lambda_l2', 1e-8, 10.0, log=True),  # 0
        'feature_fraction': trial.suggest_float('feature_fraction', 0.1, 1.0, step=0.1),  # 1
        'bagging_fraction': trial.suggest_float('bagging_fraction', 0.1, 1.0, step=0.1),  # 1
        'bagging_freq': trial.suggest_int('bagging_freq', 0, 10),  # 0
        'min_gain_to_split': trial.suggest_int('min_gain_to_split', 0, 5),
    }

with a few changes here and there

exception is:

[LightGBM] [Fatal] Check failed: (best_split_info.left_count) > (0) at D:\a\1\s\python-package\compile\src\treelearner\serial_tree_learner.cpp, line 653 .

[W 2022-11-07 09:49:32,774] Trial 49 failed because of the following error: LightGBMError('Check failed: (best_split_info.left_count) > (0) at D:\\a\\1\\s\\python-package\\compile\\src\\treelearner\\serial_tree_learner.cpp, line 653 .\n')
Traceback (most recent call last):
  File "D:\dev\Py_Global_vEnv_2022\lib\site-packages\optuna\study\_optimize.py", line 196, in _run_trial
    value_or_values = func(trial)
  File "D:\dev\Pycharm2022\LearningCNN\test9_realData4_optuna.py", line 174, in objective
    model = lgb.train(
  File "D:\dev\Py_Global_vEnv_2022\lib\site-packages\lightgbm\engine.py", line 292, in train
    booster.update(fobj=fobj)
  File "D:\dev\Py_Global_vEnv_2022\lib\site-packages\lightgbm\basic.py", line 3021, in update
    _safe_call(_LIB.LGBM_BoosterUpdateOneIter(
  File "D:\dev\Py_Global_vEnv_2022\lib\site-packages\lightgbm\basic.py", line 125, in _safe_call
    raise LightGBMError(_LIB.LGBM_GetLastError().decode('utf-8'))
lightgbm.basic.LightGBMError: Check failed: (best_split_info.left_count) > (0) at D:\a\1\s\python-package\compile\src\treelearner\serial_tree_learner.cpp, line 653 .

Traceback (most recent call last):
  File "D:\dev\Pycharm2022\LearningCNN\test9_realData4_optuna.py", line 237, in <module>
    study.optimize(objective, n_trials=_NUMBER_OF_TRIALS)
  File "D:\dev\Py_Global_vEnv_2022\lib\site-packages\optuna\study\study.py", line 419, in optimize
    _optimize(
  File "D:\dev\Py_Global_vEnv_2022\lib\site-packages\optuna\study\_optimize.py", line 66, in _optimize
    _optimize_sequential(
  File "D:\dev\Py_Global_vEnv_2022\lib\site-packages\optuna\study\_optimize.py", line 160, in _optimize_sequential
    frozen_trial = _run_trial(study, func, catch)
  File "D:\dev\Py_Global_vEnv_2022\lib\site-packages\optuna\study\_optimize.py", line 234, in _run_trial
    raise func_err
  File "D:\dev\Py_Global_vEnv_2022\lib\site-packages\optuna\study\_optimize.py", line 196, in _run_trial
    value_or_values = func(trial)
  File "D:\dev\Pycharm2022\LearningCNN\test9_realData4_optuna.py", line 174, in objective
    model = lgb.train(
  File "D:\dev\Py_Global_vEnv_2022\lib\site-packages\lightgbm\engine.py", line 292, in train
    booster.update(fobj=fobj)
  File "D:\dev\Py_Global_vEnv_2022\lib\site-packages\lightgbm\basic.py", line 3021, in update
    _safe_call(_LIB.LGBM_BoosterUpdateOneIter(
  File "D:\dev\Py_Global_vEnv_2022\lib\site-packages\lightgbm\basic.py", line 125, in _safe_call
    raise LightGBMError(_LIB.LGBM_GetLastError().decode('utf-8'))
lightgbm.basic.LightGBMError: Check failed: (best_split_info.left_count) > (0) at D:\a\1\s\python-package\compile\src\treelearner\serial_tree_learner.cpp, line 653 .


Process finished with exit code 1

I am using optuna for optimization so the set of parameters is always different.

Tried using different split ratio (0.19/0.20/0.21) - does not seem to fix anything

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.19, random_state=42, shuffle=True)

as well as tried experimenting with the amount of data (600_000/600_001/200_001). Nothing seems to help fix the issue… Can this fix be expected in the next major release? I see that the topic is still active…

Thank you @nightflight-dk , actually we had re-written the LightGBM GPU version, and previous OpenCL and CUDA versions will be deprecated. refer to PR https://github.com/microsoft/LightGBM/pull/4528

Hi, I’m using the GPU setting and have the same issue. I tried “deterministic = True” but it did not solve the problem. I saw that the LightGBM v3.2.0 may fix this defect. I have a few question as follows:

  1. In the v3.2.0 release thread, I noticed that this bug #2793 is not in bold. Does this mean that it may not be fixed until a later release?
  2. Does a fix exists for it in a non-release (build from source) option? If so, can you please guide me to it?
  3. Assuming that a fix may be part of v3.2.0 release, is this release about to happen? I noticed that v3.1.1 was released 3 months ago.

I apologize if my questions are a bit out of bound. Best regards