LightGBM: Parameter min_data_in_leaf ignored by lightgbm.cv()
Environment info
Component: Python package
Operating System: Windows 10
CPU/GPU model: GeForce 960M
CMake version: 3.18.2
Python version: 3.8.3
LightGBM version: 3.0.0
Error message and / or logs
[LightGBM] [Warning] min_data_in_leaf is set=10, min_child_samples=30 will be ignored. Current value: min_data_in_leaf=10
Reproducible example(s)
param = {
'min_data_in_leaf':200,
'feature_pre_filter' : False,
'objective': 'multiclass',
'metric': 'multi_logloss',
'num_class':21}
cvm = lightgbm.cv(param, nfold=4, train_set = train_data, categorical_feature=categorical_feature)
Steps to reproduce
- Run the script above.
- Receive the warning,
[LightGBM] [Warning] min_data_in_leaf is set=10, min_child_samples=30 will be ignored. Current value: min_data_in_leaf=10. - Notice the error lists
min_data_in_leaf=10instead of200.
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Comments: 40
@mosscoder Please don’t forget to set
bagging_freq.Good question. It is indeed a path-dependent issue, but it requires a conjunction of as many as three conditions:
Then the default parameter value passed via primary name takes precedence during model training over the custom value passed via the synonym. The model object preserves both, so a reader (extracting model object from object storage in systems like mlflow etc) naturally assumes that the custom value was used for training (which is not true here). I tried various scenarios (paths) and synonyms, and all required these three conditions to get an incorrect metric (one based on default despite custom value passed to a synonym). I can imagine it extends to other hyperparameters too. Reloading dataset before each model is trained does not help. I hope someone can reproduce this using independent tests.
For example:
@mirekphd I think the purpose of parameter alias is for the convenience, not for the user to use a different alias every time in sequential training. If this indeed is a problem, we can throw the error directly when the user tries this behavior, or disable the alias function totally.
Yup, this is true, a metric-based regression test where there is no need for name conflict resolution never fails:
I think it is related to #2594, the dataset objective will preserve its initial values. The affected parameters are:
https://github.com/microsoft/LightGBM/blob/5b5f4e39a9d9b075ef0aedafb9e400ede521a34f/python-package/lightgbm/basic.py#L810-L828
It seems
lgb.cvstores parameters entered in previous calls in the dataset, and uses those instead of the most recent parameters.Example:
On the second call, I get the warnings:
It warns that
min_data_in_leafis set to 15, while it was instead set to 20.Workaround
A workaround is to rebuild the dataset after each call to
lgb.cv. For exmple:In this case, there is no reference to
min_data_in_leafbeing15.In general yes, but not in this case, for various reasons. One of them is that exact equality as a confluence of several changes acting in opposite directions is a rather unlikely explanation (and one not passing the Occam razor test if I may say so:) if we observe it on 4 mostly uncorrelated metrics each computed on 3 different datasets, i.e. on 12 values in total…
For instance I can reveal here MSE values, as these metrics are dataset-dependent:
vs.
But I will nevertheless try to produce a self-contained example 😃