xgboost: [jvm-packages] Models saved using xgboost4j-spark cannot be loaded in Python xgboost

A similar problem was reported in this issue, which was closed without any verification. The page cited as a reason for closing without verification claims there should be no problem, yet the claim flies in the face of multiple people having experienced the problem.

Here I’ll attempt to provide specific steps to reproduce the problem based on the instructions for using XGBoost with Spark from Databricks. The steps should be reproducible in the Databricks Community Edition.

The instructions in the Scala notebook work sufficiently well for xgboostModel.save("/tmp/myXgboostModel") to generate /tmp/myXgboostModel/data and /tmp/myXgboostModel/metadata/part-00000 (and the associated _SUCCESS file) using saveModelAsHadoopFile() under the covers.

The data file (download it) is 90388 bytes in my environment and begins with ??_reg_??features??label?.

The metadata file is:

{"class":"ml.dmlc.xgboost4j.scala.spark.XGBoostRegressionModel","timestamp":1499039951741,"sparkVersion":"2.0.2","uid":"XGBoostRegressionModel_e053248158d9","paramMap":{"use_external_memory":true,"lambda_bias":0.0,"lambda":1.0,"sample_type":"uniform","max_bin":16,"subsample":1.0,"labelCol":"label","alpha":0.0,"predictionCol":"prediction","skip_drop":0.0,"booster":"gbtree","min_child_weight":1.0,"scale_pos_weight":1.0,"grow_policy":"depthwise","tree_method":"auto","sketch_eps":0.03,"featuresCol":"features","colsample_bytree":1.0,"normalize_type":"tree","gamma":0.0,"max_depth":6,"eta":0.3,"max_delta_step":0.0,"colsample_bylevel":1.0,"rate_drop":0.0}}

Attempting to load the model in Python with:

import xgboost as xgb
bst = xgb.Booster({'nthread':4})
bst.load_model("/dbfs/tmp/myXgboostModel/data")

results in

XGBoostError: [01:22:40] src/gbm/gbm.cc:20: Unknown gbm type 
---------------------------------------------------------------------------
XGBoostError                              Traceback (most recent call last)
<ipython-input-10-b93cf7356f83> in <module>()
      1 import xgboost as xgb
      2 bst = xgb.Booster({'nthread':4})
----> 3 bst.load_model("/dbfs/tmp/myXgboostModel/data")

/usr/local/lib/python2.7/dist-packages/xgboost-0.6-py2.7.egg/xgboost/core.pyc in load_model(self, fname)
   1005         if isinstance(fname, STRING_TYPES):
   1006             # assume file name, cannot use os.path.exist to check, file can be from URL.
-> 1007             _check_call(_LIB.XGBoosterLoadModel(self.handle, c_str(fname)))
   1008         else:
   1009             buf = fname

/usr/local/lib/python2.7/dist-packages/xgboost-0.6-py2.7.egg/xgboost/core.pyc in _check_call(ret)
    125     """
    126     if ret != 0:
--> 127         raise XGBoostError(_LIB.XGBGetLastError())
    128 
    129 

XGBoostError: [01:22:40] src/gbm/gbm.cc:20: Unknown gbm type 

Stack trace returned 10 entries:
[bt] (0) /usr/local/lib/python2.7/dist-packages/xgboost-0.6-py2.7.egg/xgboost/libxgboost.so(_ZN7xgboost15GradientBooster6CreateERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEERKSt6vectorISt10shared_ptrINS_7DMatrixEESaISC_EEf+0x429) [0x7f8941d33ce9]
[bt] (1) /usr/local/lib/python2.7/dist-packages/xgboost-0.6-py2.7.egg/xgboost/libxgboost.so(_ZN7xgboost11LearnerImpl4LoadEPN4dmlc6StreamE+0x6d5) [0x7f8941bce9f5]
[bt] (2) /usr/local/lib/python2.7/dist-packages/xgboost-0.6-py2.7.egg/xgboost/libxgboost.so(XGBoosterLoadModel+0x28) [0x7f8941d364f8]
[bt] (3) /usr/lib/x86_64-linux-gnu/libffi.so.6(ffi_call_unix64+0x4c) [0x7f895fd05e40]
[bt] (4) /usr/lib/x86_64-linux-gnu/libffi.so.6(ffi_call+0x2eb) [0x7f895fd058ab]
[bt] (5) /databricks/python/lib/python2.7/lib-dynload/_ctypes.x86_64-linux-gnu.so(_ctypes_callproc+0x48f) [0x7f895ff153df]
[bt] (6) /databricks/python/lib/python2.7/lib-dynload/_ctypes.x86_64-linux-gnu.so(+0x11d82) [0x7f895ff19d82]
[bt] (7) /databricks/python/bin/python(PyObject_Call+0x43) [0x4b0de3]
[bt] (8) /databricks/python/bin/python(PyEval_EvalFrameEx+0x601f) [0x4c9b6f]
[bt] (9) /databricks/python/bin/python(PyEval_EvalCodeEx+0x255) [0x4c22e5]

The obvious question is whether the data file output is the same as typical model output? I can’t find any info on this topic. If not, what’s the correct way to read models output in Hadoop format in Python?

Environment information:

OS: Debbian Linux on AWS (Databricks Runtime 3.0 beta)
Scala & Python built from master at ed8bc4521e2967d7c6290a4be5895c10327f021a

Python build instructions:

cd /databricks/driver
git clone --recursive https://github.com/dmlc/xgboost
cd xgboost
git checkout ed8bc4521e2967d7c6290a4be5895c10327f021a
make -j
cd python-package
sudo python setup.py install

About this issue

Original URL
State: closed
Created 7 years ago
Comments: 21 (7 by maintainers)

Most upvoted comments

@ssimeonov xgboostModel.booster.saveModel(“/tmp/xgbm”) succeeds, However, when python’s booster loaded successfully , the probability predicted by the spark’s booster model is not the same as the probability predicted by the python’s booster model even on the same instance. Do you facing this issue??

DevHaufior on Apr 26, 2018