LightGBM: Error saving very large LightGBM models

How you are using LightGBM?

Python package

LightGBM component:

Environment info

Operating System: Windows 10

CPU/GPU model: GPU

C++ compiler version: NA

CMake version: NA

Java version: NA

Python version: 3.6.6

R version: NA

Other: NA

LightGBM version or commit hash: 3.1.0

Error message and / or logs

I’m observing errors when trying to train models with sufficiently large tree models (on either CPU or GPU). Namely, when the max_leaves and num_boosting_rounds are sufficiently high, the boosting rounds all finish, but when trying to serialise and deserialise the model back, an error occurs.

To avoid the automatic to and from_string calls after the final boosting round, I’ve tried setting keep_training_booster=True and then saving the model out to disk, then reloading it. Saving the model as text or as pickle both succeed on save but then fail on model load.

I’ve investigated this issue and found that when writing out to a text file the last tree written is “Tree=4348” even though I’ve requested more boosting rounds than this. When loading the model there’s obviously a mismatch between the the number of elements in the “tree_sizes” attribute of the file (5000) and the actual number of trees in the file (4348) which causes an error

I believe the underlying issue is the same as here: https://github.com/microsoft/LightGBM/issues/2828 I also found this comment alluding to a 2GB limit of string stream and my text file is almost exactly 2GB: https://github.com/microsoft/LightGBM/issues/372#issuecomment-325695953

I added some of my own logging inside the lightgbm python layer and have the following logs

....
[LightGBM] [Debug] Trained a tree with leaves = 5000 and max_depth = 38
[LightGBM] [Debug] Trained a tree with leaves = 5000 and max_depth = 43
[LightGBM] [Debug] Trained a tree with leaves = 5000 and max_depth = 46
CASEY: Finished final boosting iteration
Training complete: 21154.90s
Attempting to save model as pickle
Attempting to convert model with default buffer len 1048576
Creating string buffer for actual len (2147483648)
Converting model with actual buffer len (2147483648)
Converted model to string
Decode model string to utf-8
Successfully saved model as pickle
Attempting to load model from pickle
<program hangs and pulls up dialog box indicating python has stopped working>

Reproducible example(s)

Note the model must be really large to observe this error. This took almost 6 hours on a V100 GPU. If model size is not dependent on number of rows or columns, you might be able to use smaller numbers than I did and speed things up a little. Before getting to enough boosting rounds for the model to crash, performance of the model continues to increase so there’s reason to believe a model this big is really necessary.

n = int(2e7)
m = 250
max_leaves = 5000
max_bin = 255
x_train = np.random.randn(n, m).astype(np.float32)
A = np.random.randint(-5, 5, size=(m, 1))
y_train = (x_train @ A).astype(np.float32)

print(f"x_train.shape = {x_train.shape}, y_train.shape = {y_train.shape}")
n_boosting_rounds = 5000
params = {
    'task': 'train',
    'boosting_type': 'gbdt',
    'objective': 'regression',
     'device': 'gpu',
    'metric': {'rmse'},
    'device': 'gpu',
    'num_leaves': max_leaves,
    'bagging_fraction': 0.5,
    'feature_fraction': 0.5,
    'learning_rate': 0.01,
    'verbose': 2,
    'max_bin': max_bin,
}
ds_train = lgb.Dataset(x_train, y_train.ravel(), free_raw_data=False).construct()
start = time.perf_counter()
gbm = lgb.train(
    params,
    ds_train,
    num_boost_round=n_boosting_rounds,
    keep_training_booster=True, # set this to False and the code will crash here
)

print(f"Training complete: {elapsed_train_time:.2f}")

model_file = f"{train_start}_model.txt"
print(f"Attempting to save model as {model_format}", flush=True)
gbm.save_model(model_file)
print(f"Successfully saved model as {model_format}!", flush=True)

print(f"Attempting to load model from {model_format}", flush=True)
gbm = lgb.Booster(model_file=model_file) # program dies here!!
print(f"Successfully loaded model from {model_format}", flush=True)

Steps to reproduce

  1. Generate some fake linear data
  2. Train a gbdt with sufficient boosting_rounds and max_leaves to cause an error (note the boosting is fine, it’s the clean up bit after boosting that’s problematic)

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Comments: 17

Most upvoted comments

@StrikerRUS Although this issue is closed, I’ll leave this here for reference in case there are plans for a fix

The original issue was seen on Windows with 5000 leaves and 5000 boosting rounds being sufficient to observe the problem consistently on data of shape (3e6, 250). I reran the same experiment on a Linux machine with the CPU using both 5000 boosting rounds and 8000 boosting rounds. Both models produced an output text file over 2GB (which I never observed on Windows) and didn’t produce any python crashes.

The larger of the two files was 3.7GB and I manually checked the tail of the file and found “Tree=7999” indicating the full model is contained without the truncation I was seeing previously.

All of this strongly suggests this is the issue as another user referenced in a previous comment

Sorry, this was locked accidentally. Just unlocked it. We’d still love help with this feature!

One more thing to add to this is that I can train a huge model on Linux (larger than 2GB), then load the model in Windows and do inference. I cross referenced the predictions with the Linux ones on a few thousand random data points and the L1 norm of the error is 0 so I’m fairly confident the model loaded in Windows is not corrupt (I was worried it was silently only loading 2GB of trees). The model load function appears to use string streams as well so I’m less sure about my previous hypothesis about the cause.

OK, got it! Thanks a lot for a lot of details! I’m going to link this issue to the feature request of supporting huge models so that they will be available there.

Hmmm, however, this issue is marked as closed. https://bugs.python.org/issue16865

Python version: 3.6.6

Maybe you could try newer Python version?

Thanks @StrikerRUS ! I did know pickle has some issues at the 4GB limit but thought I might be safe at 2GB. I will kick off a run now with joblib to see if that helps.

I’m not certain exactly how these serialisation libraries work so hopefully they’re not calling some of the objects methods during serialisation which could lead to the string conversion issue again. Will comment here when I have some results though