xgboost: [jvm-packages] XGBoost Spark training quite slow - Good practices
Hello!
I posted a question about some OOM errors I was facing with training here: https://discuss.xgboost.ai/t/xgboost4j-spark-fails-with-oom-errors/1054. Thankfully, I was able to resolve these issues and found a configuration that works. However, it takes ~8 min for 1 round. I’m looking for tips to speed this up.
My data is ~120GB before transforming into a vector. I’m using spark 2.3.2 and xgboost 0.82. My configuration ensures most of the training and validation data are present in memory. I’m using spark.memory.storageFraction = 0.16 which is quite low. Execution apparently requires lot of memory. I get OOM errors if I increase this value. I’ve tried increasing spark.memory.fraction as well and I get OOM errors if I increase it to 0.7.
Here’s my full spark configuration:
spark.executor.cores 8
spark.driver.memory 11171M
spark.executor.memory 10000M
spark.executor.memoryOverhead 15600M
spark.default.parallelism 1280
spark.sql.shuffle.partitions 2000
spark.memory.fraction 0.6
spark.memory.storageFraction 0.16
spark.task.cpus 4
For xgboost, I use this configuration:
val booster = new XGBoostClassifier(
Map(
"missing" -> -999999.0,
"booster" -> "gbtree",
"objective" -> "binary:logistic",
"eval_metric" -> "logloss",
"tree_method" -> "approx",
"eta" -> 0.2,
"gamma" -> 1.0,
"alpha" -> 20,
"max_depth" -> 4,
"num_round" -> 1800,
"num_workers" -> 160,
"nthread" -> 4,
"timeout_request_workers" -> 60000L
)
).setLabelCol(targetVar)
.setEvalSets(evalSet)
.setUseExternalMemory(true)
.setCheckpointInterval(2)
.setCheckpointPath("checkpoints_path")
FYI, I can train this data on a single super large machine and it takes ~1 min per iteration (though the first iteration takes more than 1 hour in addition to 0.5 hours for loading data) on this machine. The goal is to move this whole training process to xgboost-spark so it can scale with the data and we don’t have to get larger machines.
Posting here because I didn’t get any responses on the discussion forum.
Any help will be appreciated. Thank you!
About this issue
- Original URL
- State: open
- Created 5 years ago
- Comments: 21 (10 by maintainers)
@billbargens one iteration = building one tree here.
I think you might be able to track executor log or driver log and see how many iterations has been done (a.k.a 🌲 )
try to remove
for now, a fix is coming soon