LightGBM: Parameter min_data_in_leaf ignored by lightgbm.cv()

Environment info

Component: Python package

Operating System: Windows 10

CPU/GPU model: GeForce 960M

CMake version: 3.18.2

Python version: 3.8.3

LightGBM version: 3.0.0

Error message and / or logs

[LightGBM] [Warning] min_data_in_leaf is set=10, min_child_samples=30 will be ignored. Current value: min_data_in_leaf=10

Reproducible example(s)

param = {
         'min_data_in_leaf':200,
         'feature_pre_filter' : False,
         'objective': 'multiclass',
         'metric': 'multi_logloss',
         'num_class':21}

cvm = lightgbm.cv(param, nfold=4, train_set = train_data, categorical_feature=categorical_feature)

Steps to reproduce

  1. Run the script above.
  2. Receive the warning, [LightGBM] [Warning] min_data_in_leaf is set=10, min_child_samples=30 will be ignored. Current value: min_data_in_leaf=10.
  3. Notice the error lists min_data_in_leaf=10 instead of 200.

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Comments: 40

Most upvoted comments

@mosscoder Please don’t forget to set bagging_freq.

Note: to enable bagging, bagging_freq should be set to a non zero value as well https://lightgbm.readthedocs.io/en/latest/Parameters.html#bagging_fraction

is the hyperparameters unchangeable during searching in cv?

Good question. It is indeed a path-dependent issue, but it requires a conjunction of as many as three conditions:

  • at least two models (CV or single) to be trained in a sequence (in a Notebook or a .py script) (kudos to @Merudo) AND
  • a synonymous parameter name passed together with the primary one AND
  • a change in the primary parameter’s name to default value (a change measured from previous model run to the current one)

Then the default parameter value passed via primary name takes precedence during model training over the custom value passed via the synonym. The model object preserves both, so a reader (extracting model object from object storage in systems like mlflow etc) naturally assumes that the custom value was used for training (which is not true here). I tried various scenarios (paths) and synonyms, and all required these three conditions to get an incorrect metric (one based on default despite custom value passed to a synonym). I can imagine it extends to other hyperparameters too. Reloading dataset before each model is trained does not help. I hope someone can reproduce this using independent tests.

For example:

{'feature_pre_filter': False,
 'metric': 'multi_logloss',
 'min_data': 20,
 'min_data_in_leaf': 1,
 'num_class': 3,
 'objective': 'multiclass',
 'verbose': -1}
0.11551373151193695
In repeat 0 we expect the user has set CUSTOM value to min_data and/or min_data_in_leaf
.. because model metric (0.11551) is consistent with using custom value(s)
.. but what she actually used for min_data was 20
.. and what she actually used for min_data_in_leaf was 1
.. min_data changed from 20 to 20
.. min_data_in_leaf changed from 1 to 1
.. as expected

{'feature_pre_filter': False,
 'metric': 'multi_logloss',
 'min_child_samples': 1,
 'min_data_in_leaf': 20,
 'num_class': 3,
 'objective': 'multiclass',
 'verbose': -1}
0.09750119910994513
In repeat 0 we expect the user has set DEFAULT value to both min_child_samples and min_data_in_leaf..
.. because model metric (0.09750) is consistent with using default values
.. but what she actually used for min_child_samples was 1
.. and what she actually used for min_data_in_leaf was 20
.. min_child_samples changed from 20 to 1
.. min_data_in_leaf changed from 1 to 20

---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
<ipython-input-1-f6519d947615> in <module>
    106                         print(".. %s changed from %d to %d" % (DEFAULT_NAME, prev_test_val_for_def_name, test_val_for_def_name))
    107                         # assert(test_val == DEFAULT_VALUE) # OK (never fails)
--> 108                         assert(test_val == DEFAULT_VALUE and test_val_for_def_name == DEFAULT_VALUE)
    109                         print(".. as expected\n")
    110 

AssertionError: 

@mirekphd I think the purpose of parameter alias is for the convenience, not for the user to use a different alias every time in sequential training. If this indeed is a problem, we can throw the error directly when the user tries this behavior, or disable the alias function totally.

I believe, consistently using only one parameter name (min_data_in_leaf for example) solves the problem:

         'min_data_in_leaf':15,
#          'min_child_samples': 30,

Yup, this is true, a metric-based regression test where there is no need for name conflict resolution never fails:

                # param = {**orig_param, **{test_name:test_val}} # OK (never fails)
                param = {**orig_param, **{test_name:test_val}, **{DEFAULT_NAME:test_val_for_def_name}}

I think it is related to #2594, the dataset objective will preserve its initial values. The affected parameters are:

https://github.com/microsoft/LightGBM/blob/5b5f4e39a9d9b075ef0aedafb9e400ede521a34f/python-package/lightgbm/basic.py#L810-L828

It seems lgb.cv stores parameters entered in previous calls in the dataset, and uses those instead of the most recent parameters.

Example:

from sklearn.datasets import load_digits

import lightgbm as lgb

X, y = load_digits(n_class=3, return_X_y=True)
train_data = lgb.Dataset(X, y)

categorical_feature = [0, 2]
param = {
         'min_data_in_leaf':15,
         'min_child_samples': 30,
         'feature_pre_filter' : False,
         'objective': 'multiclass',
         'metric': 'multi_logloss',
         'num_class':3}

cvm = lgb.cv(param, nfold=4, train_set=train_data, categorical_feature=categorical_feature)

param['min_data_in_leaf'] = 20

cvm = lgb.cv(param, nfold=4, train_set=train_data, categorical_feature=categorical_feature)

On the second call, I get the warnings:

[LightGBM] [Warning] min_data_in_leaf is set=15, min_child_samples=30 will be ignored. Current value: min_data_in_leaf=15
[LightGBM] [Warning] min_data_in_leaf is set=20, min_child_samples=30 will be ignored. Current value: min_data_in_leaf=20
[LightGBM] [Warning] min_data_in_leaf is set=15, min_child_samples=30 will be ignored. Current value: min_data_in_leaf=15
[LightGBM] [Warning] min_data_in_leaf is set=20, min_child_samples=30 will be ignored. Current value: min_data_in_leaf=20
[LightGBM] [Warning] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000639 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 778
[LightGBM] [Info] Number of data points in the train set: 402, number of used features: 57
[LightGBM] [Warning] min_data_in_leaf is set=15, min_child_samples=30 will be ignored. Current value: min_data_in_leaf=15
[LightGBM] [Warning] min_data_in_leaf is set=15, min_child_samples=30 will be ignored. Current value: min_data_in_leaf=15
[LightGBM] [Warning] min_data_in_leaf is set=20, min_child_samples=30 will be ignored. Current value: min_data_in_leaf=20
[LightGBM] [Warning] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000651 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 778
[LightGBM] [Info] Number of data points in the train set: 403, number of used features: 57
[LightGBM] [Warning] min_data_in_leaf is set=15, min_child_samples=30 will be ignored. Current value: min_data_in_leaf=15
[LightGBM] [Warning] min_data_in_leaf is set=15, min_child_samples=30 will be ignored. Current value: min_data_in_leaf=15
[LightGBM] [Warning] min_data_in_leaf is set=20, min_child_samples=30 will be ignored. Current value: min_data_in_leaf=20
[LightGBM] [Warning] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000526 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 778
[LightGBM] [Info] Number of data points in the train set: 403, number of used features: 57
[LightGBM] [Warning] min_data_in_leaf is set=15, min_child_samples=30 will be ignored. Current value: min_data_in_leaf=15
[LightGBM] [Warning] min_data_in_leaf is set=15, min_child_samples=30 will be ignored. Current value: min_data_in_leaf=15
[LightGBM] [Warning] min_data_in_leaf is set=20, min_child_samples=30 will be ignored. Current value: min_data_in_leaf=20
[LightGBM] [Warning] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000535 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 778
[LightGBM] [Info] Number of data points in the train set: 403, number of used features: 57
[LightGBM] [Warning] min_data_in_leaf is set=15, min_child_samples=30 will be ignored. Current value: min_data_in_leaf=15

It warns that min_data_in_leaf is set to 15, while it was instead set to 20.

Workaround

A workaround is to rebuild the dataset after each call to lgb.cv. For exmple:

from sklearn.datasets import load_digits

import lightgbm as lgb

X, y = load_digits(n_class=3, return_X_y=True)
train_data = lgb.Dataset(X, y)

categorical_feature = [0, 2]
param = {
         'min_data_in_leaf':15,
         'min_child_samples': 30,
         'feature_pre_filter' : False,
         'objective': 'multiclass',
         'metric': 'multi_logloss',
         'num_class':3}

cvm = lgb.cv(param, nfold=4, train_set=train_data, categorical_feature=categorical_feature)

train_data = lgb.Dataset(X, y)    # rebuild the dataset to forget previous parameters entered
param['min_data_in_leaf'] = 20

cvm = lgb.cv(param, nfold=4, train_set=train_data, categorical_feature=categorical_feature)

In this case, there is no reference to min_data_in_leaf being 15.

it maybe just a coincidence that your metrics are the same in 2.3.1 and 3.0.0 for a particalar set of params.

In general yes, but not in this case, for various reasons. One of them is that exact equality as a confluence of several changes acting in opposite directions is a rather unlikely explanation (and one not passing the Occam razor test if I may say so:) if we observe it on 4 mostly uncorrelated metrics each computed on 3 different datasets, i.e. on 12 values in total…

For instance I can reveal here MSE values, as these metrics are dataset-dependent:

  1. default hyperparameters lightgbm: v2.3.1 (py37) | v3.0.0 (py38) test set1: 142091.4 | 142091.4 test set2: 166397.3 | 166397.3 test set3: 149005.5 | 149005.5

vs.

  1. custom hyperparameters lightgbm: v2.3.1 (py37) | v2.3.1 (py38) | v3.0.0 (py38) test set1: 137259.1 | 137259.1 | 140917.6 test set2: 178320.1 | 178320.1 | 188110.4 test set3: 145246.7 | 145246.7 | 151146.2

But I will nevertheless try to produce a self-contained example 😃