xgboost: cudaErrorIllegalAddress: an illegal memory access was encountered
Hello. While using XGBoost, I have been encountering persistent errors.
File "/home/siwon/.local/lib/python3.10/site-packages/xgboost/core.py", line 271, in _check_call
raise XGBoostError(py_str(_LIB.XGBGetLastError()))
xgboost.core.XGBoostError: [20:09:18] ../src/tree/updater_gpu_hist.cu:799: Exception in gpu_hist: [20:09:18] ../src/c_api/../data/../common/common.h:46: ../src/tree/gpu_hist/row_partitioner.cuh: 295: cudaErrorIllegalAddress: an illegal memory access was encountered
Stack trace:
[bt] (0) /home/siwon/.local/lib/python3.10/site-packages/xgboost/lib/libxgboost.so(+0x738dca) [0x7f68a1d1fdca]
[bt] (1) /home/siwon/.local/lib/python3.10/site-packages/xgboost/lib/libxgboost.so(+0x73ca89) [0x7f68a1d23a89]
[bt] (2) /home/siwon/.local/lib/python3.10/site-packages/xgboost/lib/libxgboost.so(+0xaf6920) [0x7f68a20dd920]
[bt] (3) /home/siwon/.local/lib/python3.10/site-packages/xgboost/lib/libxgboost.so(+0xaf74d1) [0x7f68a20de4d1]
[bt] (4) /home/siwon/.local/lib/python3.10/site-packages/xgboost/lib/libxgboost.so(+0xafbfe1) [0x7f68a20e2fe1]
[bt] (5) /home/siwon/.local/lib/python3.10/site-packages/xgboost/lib/libxgboost.so(+0xafcf66) [0x7f68a20e3f66]
[bt] (6) /home/siwon/.local/lib/python3.10/site-packages/xgboost/lib/libxgboost.so(+0x41b919) [0x7f68a1a02919]
[bt] (7) /home/siwon/.local/lib/python3.10/site-packages/xgboost/lib/libxgboost.so(+0x41c790) [0x7f68a1a03790]
[bt] (8) /home/siwon/.local/lib/python3.10/site-packages/xgboost/lib/libxgboost.so(+0x47ee17) [0x7f68a1a65e17]
Stack trace:
[bt] (0) /home/siwon/.local/lib/python3.10/site-packages/xgboost/lib/libxgboost.so(+0xadc86a) [0x7f68a20c386a]
[bt] (1) /home/siwon/.local/lib/python3.10/site-packages/xgboost/lib/libxgboost.so(+0xafd1b4) [0x7f68a20e41b4]
[bt] (2) /home/siwon/.local/lib/python3.10/site-packages/xgboost/lib/libxgboost.so(+0x41b919) [0x7f68a1a02919]
[bt] (3) /home/siwon/.local/lib/python3.10/site-packages/xgboost/lib/libxgboost.so(+0x41c790) [0x7f68a1a03790]
[bt] (4) /home/siwon/.local/lib/python3.10/site-packages/xgboost/lib/libxgboost.so(+0x47ee17) [0x7f68a1a65e17]
[bt] (5) /home/siwon/.local/lib/python3.10/site-packages/xgboost/lib/libxgboost.so(XGBoosterUpdateOneIter+0x70) [0x7f68a1735e50]
[bt] (6) /lib/x86_64-linux-gnu/libffi.so.8(+0x7e2e) [0x7f6908ad1e2e]
[bt] (7) /lib/x86_64-linux-gnu/libffi.so.8(+0x4493) [0x7f6908ace493]
[bt] (8) /usr/lib/python3.10/lib-dynload/_ctypes.cpython-310-x86_64-linux-gnu.so(+0xa3e9) [0x7f690b3793e9]
terminate called after throwing an instance of 'thrust::system::system_error'
what(): device free failed: cudaErrorIllegalAddress: an illegal memory access was encountered
Aborted
I have tested it with both the stable latest version 1.7.3 and the 2.0.0 version, and the same error occurred in both (I switched to the latest version because there was a post in the issues suggesting to use the nightly version). Training intermittently stops due to an error, but when I tried to resume the training, the same error keeps occurring and the training no longer progresses.
This is hyper parameter set
{'booster': 'gbtree', 'objective': 'binary:logistic', 'eval_metric': 'logloss', 'learning_rate': 0.2017833004505858, 'n_estimators': 825, 'max_depth': 10, 'subsample': 0.95, 'colsample_bytree': 0.895082641310972, 'gamma': 0, 'min_child_weight': 1, 'lambda': 0.0065881886741868895, 'alpha': 0.21095574595537914, 'device': 'gpu', 'tree_method': 'gpu_hist', 'scale_pos_weight': 13.479705048987812}
If anyone can offer advice on this error, it would be greatly appreciated.
About this issue
- Original URL
- State: open
- Created 10 months ago
- Comments: 21 (12 by maintainers)
Thank you for sharing! I think https://github.com/dmlc/xgboost/pull/9529 should be able to fix it. The PR will be part of 2.0.