xgboost: [jvm-packages] XGBoostClassifier training fails with large data on a multi-node cluster

Hi, I have a pipeline of hyperparameter tuning, evaluator, and cross-validate on an XGBoostClassifier model. However, I run into the following issue and was wondering if I could get some help understanding what it means. Any suggestion or insight will be greatly appreciated. Also, I can provide more information on this, if required.

20/12/10 09:39:34 ERROR XGBoostTaskFailedListener: Training Task Failed during XGBoost Training: ExceptionFailure(ml.dmlc.xgboost4j.java.XGBoostError,[09:39:34] /workspace/jvm-packages/xgboost4j/src/native/xgboost4j.cpp:159: [09:39:34] /workspace/jvm-packages/xgboost4j/src/native/xgboost4j.cpp:78: Check failed: jenv->ExceptionOccurred(): 
Stack trace:
  [bt] (0) /tmp/libxgboost4j169322301632920248.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x57) [0x7f6600a5e947]
  [bt] (1) /tmp/libxgboost4j169322301632920248.so(XGBoost4jCallbackDataIterNext+0x2d55) [0x7f6600a5b595]
  [bt] (2) /tmp/libxgboost4j169322301632920248.so(xgboost::data::SimpleDMatrix::SimpleDMatrix<xgboost::data::IteratorAdapter>(xgboost::data::IteratorAdapter*, float, int)+0x2c0) [0x7f6600b1cda0]
  [bt] (3) /tmp/libxgboost4j169322301632920248.so(xgboost::DMatrix* xgboost::DMatrix::Create<xgboost::data::IteratorAdapter>(xgboost::data::IteratorAdapter*, float, int, std::string const&, unsigned long)+0x45) [0x7f6600b11d15]
  [bt] (4) /tmp/libxgboost4j169322301632920248.so(XGDMatrixCreateFromDataIter+0x153) [0x7f6600a5f943]
  [bt] (5) /tmp/libxgboost4j169322301632920248.so(Java_ml_dmlc_xgboost4j_java_XGBoostJNI_XGDMatrixCreateFromDataIter+0x96) [0x7f6600a57426]
  [bt] (6) [0x7f68ad018427]


Stack trace:
  [bt] (0) /tmp/libxgboost4j169322301632920248.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x57) [0x7f6600a5e947]
  [bt] (1) /tmp/libxgboost4j169322301632920248.so(XGBoost4jCallbackDataIterNext+0x2664) [0x7f6600a5aea4]
  [bt] (2) /tmp/libxgboost4j169322301632920248.so(xgboost::data::SimpleDMatrix::SimpleDMatrix<xgboost::data::IteratorAdapter>(xgboost::data::IteratorAdapter*, float, int)+0x2c0) [0x7f6600b1cda0]
  [bt] (3) /tmp/libxgboost4j169322301632920248.so(xgboost::DMatrix* xgboost::DMatrix::Create<xgboost::data::IteratorAdapter>(xgboost::data::IteratorAdapter*, float, int, std::string const&, unsigned long)+0x45) [0x7f6600b11d15]
  [bt] (4) /tmp/libxgboost4j169322301632920248.so(XGDMatrixCreateFromDataIter+0x153) [0x7f6600a5f943]
  [bt] (5) /tmp/libxgboost4j169322301632920248.so(Java_ml_dmlc_xgboost4j_java_XGBoostJNI_XGDMatrixCreateFromDataIter+0x96) [0x7f6600a57426]
  [bt] (6) [0x7f68ad018427]

,[Ljava.lang.StackTraceElement;@2c0925ec,ml.dmlc.xgboost4j.java.XGBoostError: [09:39:34] /workspace/jvm-packages/xgboost4j/src/native/xgboost4j.cpp:159: [09:39:34] /workspace/jvm-packages/xgboost4j/src/native/xgboost4j.cpp:78: Check failed: jenv->ExceptionOccurred(): 
Stack trace:
  [bt] (0) /tmp/libxgboost4j169322301632920248.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x57) [0x7f6600a5e947]
  [bt] (1) /tmp/libxgboost4j169322301632920248.so(XGBoost4jCallbackDataIterNext+0x2d55) [0x7f6600a5b595]
  [bt] (2) /tmp/libxgboost4j169322301632920248.so(xgboost::data::SimpleDMatrix::SimpleDMatrix<xgboost::data::IteratorAdapter>(xgboost::data::IteratorAdapter*, float, int)+0x2c0) [0x7f6600b1cda0]
  [bt] (3) /tmp/libxgboost4j169322301632920248.so(xgboost::DMatrix* xgboost::DMatrix::Create<xgboost::data::IteratorAdapter>(xgboost::data::IteratorAdapter*, float, int, std::string const&, unsigned long)+0x45) [0x7f6600b11d15]
  [bt] (4) /tmp/libxgboost4j169322301632920248.so(XGDMatrixCreateFromDataIter+0x153) [0x7f6600a5f943]
  [bt] (5) /tmp/libxgboost4j169322301632920248.so(Java_ml_dmlc_xgboost4j_java_XGBoostJNI_XGDMatrixCreateFromDataIter+0x96) [0x7f6600a57426]
  [bt] (6) [0x7f68ad018427]


Stack trace:
  [bt] (0) /tmp/libxgboost4j169322301632920248.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x57) [0x7f6600a5e947]
  [bt] (1) /tmp/libxgboost4j169322301632920248.so(XGBoost4jCallbackDataIterNext+0x2664) [0x7f6600a5aea4]
  [bt] (2) /tmp/libxgboost4j169322301632920248.so(xgboost::data::SimpleDMatrix::SimpleDMatrix<xgboost::data::IteratorAdapter>(xgboost::data::IteratorAdapter*, float, int)+0x2c0) [0x7f6600b1cda0]
  [bt] (3) /tmp/libxgboost4j169322301632920248.so(xgboost::DMatrix* xgboost::DMatrix::Create<xgboost::data::IteratorAdapter>(xgboost::data::IteratorAdapter*, float, int, std::string const&, unsigned long)+0x45) [0x7f6600b11d15]
  [bt] (4) /tmp/libxgboost4j169322301632920248.so(XGDMatrixCreateFromDataIter+0x153) [0x7f6600a5f943]
  [bt] (5) /tmp/libxgboost4j169322301632920248.so(Java_ml_dmlc_xgboost4j_java_XGBoostJNI_XGDMatrixCreateFromDataIter+0x96) [0x7f6600a57426]
  [bt] (6) [0x7f68ad018427]


	at ml.dmlc.xgboost4j.java.XGBoostJNI.checkCall(XGBoostJNI.java:48)
	at ml.dmlc.xgboost4j.java.DMatrix.<init>(DMatrix.java:54)
	at ml.dmlc.xgboost4j.scala.DMatrix.<init>(DMatrix.scala:42)
	at ml.dmlc.xgboost4j.scala.spark.Watches$.buildWatches(XGBoost.scala:790)
	at ml.dmlc.xgboost4j.scala.spark.XGBoost$$anonfun$trainForNonRanking$1.apply(XGBoost.scala:451)
	at ml.dmlc.xgboost4j.scala.spark.XGBoost$$anonfun$trainForNonRanking$1.apply(XGBoost.scala:450)
	at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:823)
	at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:823)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:346)
	at org.apache.spark.rdd.RDD$$anonfun$7.apply(RDD.scala:359)
	at org.apache.spark.rdd.RDD$$anonfun$7.apply(RDD.scala:357)
	at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1165)
	at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1156)
	at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:1091)
	at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1156)
	at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:882)
	at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:357)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:308)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
	at org.apache.spark.scheduler.Task.run(Task.scala:123)
	at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

About this issue

  • Original URL
  • State: open
  • Created 4 years ago
  • Comments: 32 (2 by maintainers)

Most upvoted comments

@dchristle ignore my last question. I forgot there is .setMissing.

var bri = new XGBoostRegBridge("uid", model)
  bri.xgbRegressionModel.setFeaturesCol("feature_vector")
  bri.xgbRegressionModel.setMissing(0.0F)
  var pred = bri.xgbRegressionModel.transform(train_sparse)

The above code works! I can’t believe it was that simple. I will have to make sure the predictions are coming out as expected, but it does work! @monicasenapati

@monicasenapati I am trying to use the xgboost4j-spark library which has a different API than the xgboost4j library. The xgboost4j-spark transform function uses distributed computing for predictions, which is necessary for me due to size of my data. My data is highly sparse and in long format so I’m also trying to avoid high memory cost operations on my data. I’m definitely open to alternatives

@jmpanfil I too use xgboost4j-spark and have similar data. Highly sparse and very large. Not sure if the data type could be an issue though. If you find something please do let me know since I am having a roadblock too. Thank you!

Great! I will try to reproduce the error on my end and investigate the root cause.

@hcho3 Thank you so much for your time. I appreciate it. I was able to surpass that issue now. I discovered it was a bug in my code that was not parsing the input CSV files as I intended them to be. This current issue now appears to be fixed. I am running into another spark error. I will have to fix that.

Great! I will try to reproduce the error on my end and investigate the root cause.

ErrorSample.zip This contains a training script and sample data I am trying to train on.