xgboost: [jvm-packages] spark hangs when training is run in quick succession

I am getting an infinite hang when I run the following code a few times in quick succession:

import ml.dmlc.xgboost4j.scala.spark.XGBoostClassifier

val dataPath = "__SPARK_HOME_LOCATION__/data/mllib/sample_binary_classification_data.txt"
val data = spark.read.format("libsvm").option("vectorType", "dense").load(dataPath)
val xgbClassifier = new XGBoostClassifier()

xgbClassifier.fit(data).transform(data).show()

Steps to reproduce:

open spark-shell with xgboost jars
Run the above code
quickly rerun the last line until a hang happens

Other information: When the hang happens I only get the tracker message, and nothing after that, I have to kill the spark job. (If I wait between runs, they always succeed. )

Tracker started, with env={DMLC_NUM_SERVER=0, DMLC_TRACKER_URI=XXX.XXX.XXX.XXX, DMLC_TRACKER_PORT=9096, DMLC_NUM_WORKER=2}

My environment:

XGBoost Master
Spark 2.4.3
(Happens in both: Zeppelin and Spark-Shell)

About this issue

Original URL
State: closed
Created 5 years ago
Comments: 16 (15 by maintainers)

Most upvoted comments

For folks has similar issue. A quick fix can be achieved by running this before your training job.

    val spark = org.apache.spark.sql.SparkSession.builder().getOrCreate()
    val nWorker = spark.sparkContext.defaultParallelism
    spark.range(0, nWorker).rdd.barrier.mapPartitions { x => { ml.dmlc.xgboost4j.java.Rabit.shutdown(); x } }.collect()

austinzh on Apr 17, 2023

Sure. Let me take it.

austinzh on Apr 17, 2023