xgboost: [jvm-packages] Models saved using xgboost4j-spark cannot be loaded in Python xgboost
A similar problem was reported in this issue, which was closed without any verification. The page cited as a reason for closing without verification claims there should be no problem, yet the claim flies in the face of multiple people having experienced the problem.
Here I’ll attempt to provide specific steps to reproduce the problem based on the instructions for using XGBoost with Spark from Databricks. The steps should be reproducible in the Databricks Community Edition.
The instructions in the Scala notebook work sufficiently well for xgboostModel.save("/tmp/myXgboostModel") to generate /tmp/myXgboostModel/data and /tmp/myXgboostModel/metadata/part-00000 (and the associated _SUCCESS file) using saveModelAsHadoopFile() under the covers.
The data file (download it) is 90388 bytes in my environment and begins with ??_reg_??features??label?.
The metadata file is:
{"class":"ml.dmlc.xgboost4j.scala.spark.XGBoostRegressionModel","timestamp":1499039951741,"sparkVersion":"2.0.2","uid":"XGBoostRegressionModel_e053248158d9","paramMap":{"use_external_memory":true,"lambda_bias":0.0,"lambda":1.0,"sample_type":"uniform","max_bin":16,"subsample":1.0,"labelCol":"label","alpha":0.0,"predictionCol":"prediction","skip_drop":0.0,"booster":"gbtree","min_child_weight":1.0,"scale_pos_weight":1.0,"grow_policy":"depthwise","tree_method":"auto","sketch_eps":0.03,"featuresCol":"features","colsample_bytree":1.0,"normalize_type":"tree","gamma":0.0,"max_depth":6,"eta":0.3,"max_delta_step":0.0,"colsample_bylevel":1.0,"rate_drop":0.0}}
Attempting to load the model in Python with:
import xgboost as xgb
bst = xgb.Booster({'nthread':4})
bst.load_model("/dbfs/tmp/myXgboostModel/data")
results in
XGBoostError: [01:22:40] src/gbm/gbm.cc:20: Unknown gbm type
---------------------------------------------------------------------------
XGBoostError Traceback (most recent call last)
<ipython-input-10-b93cf7356f83> in <module>()
1 import xgboost as xgb
2 bst = xgb.Booster({'nthread':4})
----> 3 bst.load_model("/dbfs/tmp/myXgboostModel/data")
/usr/local/lib/python2.7/dist-packages/xgboost-0.6-py2.7.egg/xgboost/core.pyc in load_model(self, fname)
1005 if isinstance(fname, STRING_TYPES):
1006 # assume file name, cannot use os.path.exist to check, file can be from URL.
-> 1007 _check_call(_LIB.XGBoosterLoadModel(self.handle, c_str(fname)))
1008 else:
1009 buf = fname
/usr/local/lib/python2.7/dist-packages/xgboost-0.6-py2.7.egg/xgboost/core.pyc in _check_call(ret)
125 """
126 if ret != 0:
--> 127 raise XGBoostError(_LIB.XGBGetLastError())
128
129
XGBoostError: [01:22:40] src/gbm/gbm.cc:20: Unknown gbm type
Stack trace returned 10 entries:
[bt] (0) /usr/local/lib/python2.7/dist-packages/xgboost-0.6-py2.7.egg/xgboost/libxgboost.so(_ZN7xgboost15GradientBooster6CreateERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEERKSt6vectorISt10shared_ptrINS_7DMatrixEESaISC_EEf+0x429) [0x7f8941d33ce9]
[bt] (1) /usr/local/lib/python2.7/dist-packages/xgboost-0.6-py2.7.egg/xgboost/libxgboost.so(_ZN7xgboost11LearnerImpl4LoadEPN4dmlc6StreamE+0x6d5) [0x7f8941bce9f5]
[bt] (2) /usr/local/lib/python2.7/dist-packages/xgboost-0.6-py2.7.egg/xgboost/libxgboost.so(XGBoosterLoadModel+0x28) [0x7f8941d364f8]
[bt] (3) /usr/lib/x86_64-linux-gnu/libffi.so.6(ffi_call_unix64+0x4c) [0x7f895fd05e40]
[bt] (4) /usr/lib/x86_64-linux-gnu/libffi.so.6(ffi_call+0x2eb) [0x7f895fd058ab]
[bt] (5) /databricks/python/lib/python2.7/lib-dynload/_ctypes.x86_64-linux-gnu.so(_ctypes_callproc+0x48f) [0x7f895ff153df]
[bt] (6) /databricks/python/lib/python2.7/lib-dynload/_ctypes.x86_64-linux-gnu.so(+0x11d82) [0x7f895ff19d82]
[bt] (7) /databricks/python/bin/python(PyObject_Call+0x43) [0x4b0de3]
[bt] (8) /databricks/python/bin/python(PyEval_EvalFrameEx+0x601f) [0x4c9b6f]
[bt] (9) /databricks/python/bin/python(PyEval_EvalCodeEx+0x255) [0x4c22e5]
The obvious question is whether the data file output is the same as typical model output? I can’t find any info on this topic. If not, what’s the correct way to read models output in Hadoop format in Python?
Environment information:
- OS: Debbian Linux on AWS (Databricks Runtime 3.0 beta)
- Scala & Python built from
masterated8bc4521e2967d7c6290a4be5895c10327f021a
Python build instructions:
cd /databricks/driver
git clone --recursive https://github.com/dmlc/xgboost
cd xgboost
git checkout ed8bc4521e2967d7c6290a4be5895c10327f021a
make -j
cd python-package
sudo python setup.py install
About this issue
- Original URL
- State: closed
- Created 7 years ago
- Comments: 21 (7 by maintainers)
@ssimeonov xgboostModel.booster.saveModel(“/tmp/xgbm”) succeeds, However, when python’s booster loaded successfully , the probability predicted by the spark’s booster model is not the same as the probability predicted by the python’s booster model even on the same instance. Do you facing this issue??