xgboost: an illegal memory access was encountered in xgboost in gpu lossguide

@RAMitchell With gpu_hist_experimental when max_depth 0 and leaves 2^6.

[17:32:01] /home/jenkins/slave_dir_from_mr-0xc1/workspace/Pipeline_nonccl-cuda8_h2oai-DBHMFLWQV3M3EOCHHC2PD3CFRCEC3CG5T2F2WDY636RTNYJQOPAA/rabit/include/rabit/./internal/../../dmlc/./logging.h:300: [17:32:01] /home/jenkins/slave_dir_from_mr-0xc1/workspace/Pipeline_nonccl-cuda8_h2oai-DBHMFLWQV3M3EOCHHC2PD3CFRCEC3CG5T2F2WDY636RTNYJQOPAA/src/tree/updater_gpu_hist_experimental.cu:509: GPU plugin exception: /home/jenkins/slave_dir_from_mr-0xc1/workspace/Pipeline_nonccl-cuda8_h2oai-DBHMFLWQV3M3EOCHHC2PD3CFRCEC3CG5T2F2WDY636RTNYJQOPAA/src/tree/../common/device_helpers.cuh(107): an illegal memory access was encountered


Stack trace returned 10 entries:
[bt] (0) /home/jon/.pyenv/versions/3.6.1/lib/python3.6/site-packages/xgboost/libxgboost.so(_ZN4dmlc15LogMessageFatalD1Ev+0x3c) [0x7fd7217a761c]
[bt] (1) /home/jon/.pyenv/versions/3.6.1/lib/python3.6/site-packages/xgboost/libxgboost.so(_ZN7xgboost4tree24GPUHistMakerExperimental6UpdateERKSt6vectorINS_6detail18bst_gpair_internalIfEESaIS5_EEPNS_7DMatrixERKS2_IPNS_7RegTreeESaISD_EE+0x217) [0x7fd721a09817]
[bt] (2) /home/jon/.pyenv/versions/3.6.1/lib/python3.6/site-packages/xgboost/libxgboost.so(_ZN7xgboost3gbm6GBTree13BoostNewTreesERKSt6vectorINS_6detail18bst_gpair_internalIfEESaIS5_EEPNS_7DMatrixEiPS2_ISt10unique_ptrINS_7RegTreeESt14default_deleteISD_EESaISG_EE+0x9ce) [0x7fd72186499e]
[bt] (3) /home/jon/.pyenv/versions/3.6.1/lib/python3.6/site-packages/xgboost/libxgboost.so(_ZN7xgboost3gbm6GBTree7DoBoostEPNS_7DMatrixEPSt6vectorINS_6detail18bst_gpair_internalIfEESaIS7_EEPNS_11ObjFunctionE+0xbad) [0x7fd721865e2d]
[bt] (4) /home/jon/.pyenv/versions/3.6.1/lib/python3.6/site-packages/xgboost/libxgboost.so(_ZN7xgboost11LearnerImpl13UpdateOneIterEiPNS_7DMatrixE+0x361) [0x7fd721872be1]
[bt] (5) /home/jon/.pyenv/versions/3.6.1/lib/python3.6/site-packages/xgboost/libxgboost.so(XGBoosterUpdateOneIter+0x27) [0x7fd7217b1cf7]
[bt] (6) /usr/lib/x86_64-linux-gnu/libffi.so.6(ffi_call_unix64+0x4c) [0x7fd789f25e40]
[bt] (7) /usr/lib/x86_64-linux-gnu/libffi.so.6(ffi_call+0x2eb) [0x7fd789f258ab]
[bt] (8) /home/jon/.pyenv/versions/3.6.1/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so(_ctypes_callproc+0x2cf) [0x7fd78a139c4f]
[bt] (9) /home/jon/.pyenv/versions/3.6.1/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so(+0x8c19) [0x7fd78a130c19]

gpu_hist_crash.zip

This fails on any recent gpu_hist_experimental loss guide code or the latest dmlc code with gpu_hist (that’s how setup in the script).

@RAMitchell Contains code and files to reproduce. In just prior xgboost version when still gpu_hist_experimental, it would show the above message. Now (using head of dmlc master as of right now) it just gives:

jon@mr-dl10:~/h2oai/tmp/371c0b_20171128172953_43053$ python xgb.py
Segmentation fault (core dumped)

Actually, ran it yet again and it locks-up my system for 10 seconds and says:

jon@mr-dl10:~/h2oai/tmp/371c0b_20171128172953_43053$ python xgb.py
terminate called after throwing an instance of 'thrust::system::system_error'
  what():  /home/jon/xgboost.dmlc/src/tree/updater_gpu_hist.cu(436): an illegal memory access was encountered
Aborted (core dumped)

About this issue

  • Original URL
  • State: closed
  • Created 7 years ago
  • Comments: 25 (8 by maintainers)

Commits related to this issue

Most upvoted comments

any updates : getting error : terminate called after throwing an instance of ‘thrust::system::system_error’ what(): xgboost/src/predictor/…/common/device_helpers.cuh(79): unknown error Aborted (core dumped) both with gpu_exact and gpu_hist

Re-opening the issue, to see if there’s something that can be done.