xgboost: dask xgboost fit: munmap_chunk(): invalid pointer: 0x00007fa5380304b0
Using rapids 0.14, conda, Ubuntu 16/18, cuda 10.0, cuda 11.1 driver, dask/disributed 2.17 that matches rapids 0.14.
After updating to 1.3.0 master nightly, I’m hitting this with any dask fit. It’s pervasive, so I’ll probably have to go back to 1.2.1 for now unless easy fix. Once it happens the worker is restarted and xgboost hangs.
It’s late here, so I’ll post repro if possible on weekend.
*** Error in `dask-worker [tcp://172.16.4.18:34761]': munmap_chunk(): invalid pointer: 0x00007fa5380304b0 ***
======= Backtrace: =========
/lib/x86_64-linux-gnu/libc.so.6(+0x777f5)[0x7fa6385ca7f5]
/lib/x86_64-linux-gnu/libc.so.6(cfree+0x1a8)[0x7fa6385d76e8]
/home/jenkins/minicondadai/lib/python3.6/site-packages/xgboost/lib/libxgboost.so(_ZN2dh10AllReducer4InitEi+0xc7b)[0x7fa44fdb452b]
/home/jenkins/minicondadai/lib/python3.6/site-packages/xgboost/lib/libxgboost.so(_ZN7xgboost4tree23GPUHistMakerSpecialisedINS_6detail20GradientPairInternalIdEEE6UpdateEPNS_16HostDeviceVectorINS3_IfEEEEPNS_7DMatrixERKSt6vectorIPNS_7RegTreeESaISE_EE+0x2e8)[0x7fa44ff3e568]
/home/jenkins/minicondadai/lib/python3.6/site-packages/xgboost/lib/libxgboost.so(_ZN7xgboost3gbm6GBTree13BoostNewTreesEPNS_16HostDeviceVectorINS_6detail20GradientPairInternalIfEEEEPNS_7DMatrixEiPSt6vectorISt10unique_ptrINS_7RegTreeESt14default_deleteISC_EESaISF_EE+0x18e9)[0x7fa44fc2c559]
/home/jenkins/minicondadai/lib/python3.6/site-packages/xgboost/lib/libxgboost.so(_ZN7xgboost3gbm6GBTree7DoBoostEPNS_7DMatrixEPNS_16HostDeviceVectorINS_6detail20GradientPairInternalIfEEEEPNS_20PredictionCacheEntryE+0x12d)[0x7fa44fc30e0d]
/home/jenkins/minicondadai/lib/python3.6/site-packages/xgboost/lib/libxgboost.so(_ZN7xgboost11LearnerImpl13UpdateOneIterEiSt10shared_ptrINS_7DMatrixEE+0x52b)[0x7fa44fc68c0b]
/home/jenkins/minicondadai/lib/python3.6/site-packages/xgboost/lib/libxgboost.so(XGBoosterUpdateOneIter+0x58)[0x7fa44fb44648]
/home/jenkins/minicondadai/lib/python3.6/lib-dynload/../../libffi.so.6(ffi_call_unix64+0x4c)[0x7fa636db9630]
/home/jenkins/minicondadai/lib/python3.6/lib-dynload/../../libffi.so.6(ffi_call+0x22d)[0x7fa636db8fed]
/home/jenkins/minicondadai/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so(_ctypes_callproc+0x2ce)[0x7fa636dcff9e]
/home/jenkins/minicondadai/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so(+0x139d5)[0x7fa636dd09d5]
dask-worker [tcp://172.16.4.18:34761](_PyObject_FastCallDict+0x8b)[0x55e17dd7c00b]
dask-worker [tcp://172.16.4.18:34761](+0x1a179e)[0x55e17de0a79e]
dask-worker [tcp://172.16.4.18:34761](_PyEval_EvalFrameDefault+0x30a)[0x55e17de2d18a]
dask-worker [tcp://172.16.4.18:34761](+0x16f256)[0x55e17ddd8256]
dask-worker [tcp://172.16.4.18:34761](+0x170231)[0x55e17ddd9231]
dask-worker [tcp://172.16.4.18:34761](+0x1a1725)[0x55e17de0a725]
dask-worker [tcp://172.16.4.18:34761](_PyEval_EvalFrameDefault+0x30a)[0x55e17de2d18a]
dask-worker [tcp://172.16.4.18:34761](+0x16f256)[0x55e17ddd8256]
dask-worker [tcp://172.16.4.18:34761](+0x170231)[0x55e17ddd9231]
dask-worker [tcp://172.16.4.18:34761](+0x1a1725)[0x55e17de0a725]
dask-worker [tcp://172.16.4.18:34761](_PyEval_EvalFrameDefault+0x10c6)[0x55e17de2df46]
dask-worker [tcp://172.16.4.18:34761](PyEval_EvalCodeEx+0x329)[0x55e17dddd4f9]
dask-worker [tcp://172.16.4.18:34761](+0x175426)[0x55e17ddde426]
dask-worker [tcp://172.16.4.18:34761](PyObject_Call+0x3e)[0x55e17dd7bc2e]
dask-worker [tcp://172.16.4.18:34761](_PyEval_EvalFrameDefault+0x1ab6)[0x55e17de2e936]
dask-worker [tcp://172.16.4.18:34761](PyEval_EvalCodeEx+0x966)[0x55e17ddddb36]
dask-worker [tcp://172.16.4.18:34761](+0x175426)[0x55e17ddde426]
dask-worker [tcp://172.16.4.18:34761](PyObject_Call+0x3e)[0x55e17dd7bc2e]
dask-worker [tcp://172.16.4.18:34761](_PyEval_EvalFrameDefault+0x1ab6)[0x55e17de2e936]
dask-worker [tcp://172.16.4.18:34761](PyEval_EvalCodeEx+0x329)[0x55e17dddd4f9]
dask-worker [tcp://172.16.4.18:34761](+0x175426)[0x55e17ddde426]
dask-worker [tcp://172.16.4.18:34761](PyObject_Call+0x3e)[0x55e17dd7bc2e]
dask-worker [tcp://172.16.4.18:34761](_PyEval_EvalFrameDefault+0x1ab6)[0x55e17de2e936]
dask-worker [tcp://172.16.4.18:34761](+0x16fffb)[0x55e17ddd8ffb]
dask-worker [tcp://172.16.4.18:34761](+0x1a1725)[0x55e17de0a725]
dask-worker [tcp://172.16.4.18:34761](_PyEval_EvalFrameDefault+0x30a)[0x55e17de2d18a]
dask-worker [tcp://172.16.4.18:34761](PyEval_EvalCodeEx+0x329)[0x55e17dddd4f9]
dask-worker [tcp://172.16.4.18:34761](+0x175426)[0x55e17ddde426]
dask-worker [tcp://172.16.4.18:34761](PyObject_Call+0x3e)[0x55e17dd7bc2e]
dask-worker [tcp://172.16.4.18:34761](_PyEval_EvalFrameDefault+0x1ab6)[0x55e17de2e936]
dask-worker [tcp://172.16.4.18:34761](+0x16fffb)[0x55e17ddd8ffb]
dask-worker [tcp://172.16.4.18:34761](+0x1a1725)[0x55e17de0a725]
dask-worker [tcp://172.16.4.18:34761](_PyEval_EvalFrameDefault+0x30a)[0x55e17de2d18a]
dask-worker [tcp://172.16.4.18:34761](+0x16fffb)[0x55e17ddd8ffb]
dask-worker [tcp://172.16.4.18:34761](+0x1a1725)[0x55e17de0a725]
dask-worker [tcp://172.16.4.18:34761](_PyEval_EvalFrameDefault+0x30a)[0x55e17de2d18a]
dask-worker [tcp://172.16.4.18:34761](_PyFunction_FastCallDict+0x11b)[0x55e17ddd966b]
dask-worker [tcp://172.16.4.18:34761](_PyObject_FastCallDict+0x26f)[0x55e17dd7c1ef]
dask-worker [tcp://172.16.4.18:34761](_PyObject_Call_Prepend+0x63)[0x55e17dd80cf3]
dask-worker [tcp://172.16.4.18:34761](PyObject_Call+0x3e)[0x55e17dd7bc2e]
dask-worker [tcp://172.16.4.18:34761](+0x210c36)[0x55e17de79c36]
FYI @trivialfis
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Comments: 39 (39 by maintainers)
Odd remote closure of issue although I only referenced the issue. Bad github.