xgboost: [jvm-packages] XGBoost Spark training quite slow - Good practices

Hello!

I posted a question about some OOM errors I was facing with training here: https://discuss.xgboost.ai/t/xgboost4j-spark-fails-with-oom-errors/1054. Thankfully, I was able to resolve these issues and found a configuration that works. However, it takes ~8 min for 1 round. I’m looking for tips to speed this up.

My data is ~120GB before transforming into a vector. I’m using spark 2.3.2 and xgboost 0.82. My configuration ensures most of the training and validation data are present in memory. I’m using spark.memory.storageFraction = 0.16 which is quite low. Execution apparently requires lot of memory. I get OOM errors if I increase this value. I’ve tried increasing spark.memory.fraction as well and I get OOM errors if I increase it to 0.7.

Here’s my full spark configuration:

spark.executor.cores             8
spark.driver.memory              11171M
spark.executor.memory            10000M
spark.executor.memoryOverhead    15600M
spark.default.parallelism        1280
spark.sql.shuffle.partitions     2000
spark.memory.fraction            0.6
spark.memory.storageFraction     0.16
spark.task.cpus                  4

For xgboost, I use this configuration:

val booster = new XGBoostClassifier(
  Map(
    "missing" -> -999999.0,
    "booster" -> "gbtree",
    "objective" -> "binary:logistic",
    "eval_metric" -> "logloss",
    "tree_method" -> "approx",
    "eta" -> 0.2,
    "gamma" -> 1.0,
    "alpha" -> 20,
    "max_depth" -> 4,
    "num_round" -> 1800,
    "num_workers" -> 160,
    "nthread" -> 4,
    "timeout_request_workers" -> 60000L
  )
).setLabelCol(targetVar)
.setEvalSets(evalSet)
.setUseExternalMemory(true)
.setCheckpointInterval(2)
.setCheckpointPath("checkpoints_path")

FYI, I can train this data on a single super large machine and it takes ~1 min per iteration (though the first iteration takes more than 1 hour in addition to 0.5 hours for loading data) on this machine. The goal is to move this whole training process to xgboost-spark so it can scale with the data and we don’t have to get larger machines.

Posting here because I didn’t get any responses on the discussion forum.

@CodingCat @trivialfis

Any help will be appreciated. Thank you!

About this issue

  • Original URL
  • State: open
  • Created 5 years ago
  • Comments: 21 (10 by maintainers)

Most upvoted comments

@billbargens one iteration = building one tree here.

Meanwhile I had another question - how do you know the progress of training if you don’t use checkpoints?

I think you might be able to track executor log or driver log and see how many iterations has been done (a.k.a 🌲 )

try to remove

.setCheckpointInterval(2)
.setCheckpointPath("checkpoints_path")

for now, a fix is coming soon