xgboost: Memory leak in subsequent model fitting?

Hi, I created a XGBoost model with data subsampling (subsample < 1) and I implemented a loop to train it using different random_state values and choose the best performing initialization.

This is the simplified Python code:

# setup train_x, train_y, eval_set, eval_metric #
accuracies = []
best = None
for i in range(100):
    variation = xgb.XGBClassifier(random_state=i, n_jobs=-1)
    variation.fit(train_x, train_y, eval_set=eval_set, eval_metric=eval_metric)
    accuracy = 1 - variation.evals_result()['validation_1']['error'][-1]
    if best is None or accuracy > max(accuracies):
        best = variation
    accuracies.append(accuracy)

It works as expected, but I noticed that if I monitor the RAM during the process, the usage continues to grow after each iteration even if I assigned a new XGBClassifier to the same variation variable. Python should free the unreferenced memory location every time and maintain the memory usage constant, right?

This is the screenshot of System Monitor Resources: memory Computation starts at the first CPU spike at 100% on the left and a new iteration starts at each drop in CPU usage while RAM usage steps up accordingly, but never goes down.

Is this a bug in resource deallocation inside XGBClassifier or am I missing something?

Many thanks in advance, Cheers

About this issue

  • Original URL
  • State: closed
  • Created 5 years ago
  • Comments: 18 (7 by maintainers)

Most upvoted comments

Awesome, that was it! 🏆 And there is more: with forced garbage collection, “del variation” is not needed anymore, since a new instance is assigned to the same variable and Python recycles the memory occupied by the previous one (as expected in my first comment in this issue).

Many thanks! 👍

Sorry for the long wait. I’m refactoring the IO logic right now. Will start squashing bugs once it’s sorted out.

Not yet. Marked as blocking. Thanks for the patience.