cudf: [BUG] Invalid memory access in dask cudf concat.

This has happened in our CI a few times but I haven’t been able to reproduce it deterministically yet. At the end is the error log copied from one of the failing logs from Jenkins. Opening an issue and see if there’s someone else who can reproduce it more deterministically. Sorry for the noise.

The used cudf version is 0.18.

[2021-03-25T08:11:20.565Z] 
[2021-03-25T08:11:20.565Z] terminate called after throwing an instance of 'thrust::system::system_error'
[2021-03-25T08:11:20.565Z]   what():  for_each: failed to synchronize: cudaErrorIllegalAddress: an illegal memory access was encountered
[2021-03-25T08:11:20.565Z] Fatal Python error: Aborted
[2021-03-25T08:11:20.565Z] 
[2021-03-25T08:11:20.565Z] Thread 0x00007f36acffd700 (most recent call first):
[2021-03-25T08:11:20.565Z]   File "/opt/python/envs/gpu_test/lib/python3.7/threading.py", line 300 in wait
[2021-03-25T08:11:20.565Z]   File "/opt/python/envs/gpu_test/lib/python3.7/queue.py", line 179 in get
[2021-03-25T08:11:20.565Z]   File "/opt/python/envs/gpu_test/lib/python3.7/site-packages/distributed/threadpoolexecutor.py", line 51 in _worker
[2021-03-25T08:11:20.565Z]   File "/opt/python/envs/gpu_test/lib/python3.7/threading.py", line 870 in run
[2021-03-25T08:11:20.565Z]   File "/opt/python/envs/gpu_test/lib/python3.7/threading.py", line 926 in _bootstrap_inner
[2021-03-25T08:11:20.565Z]   File "/opt/python/envs/gpu_test/lib/python3.7/threading.py", line 890 in _bootstrap
[2021-03-25T08:11:20.565Z] 
[2021-03-25T08:11:20.565Z] Thread 0x00007f36ad7fe700 (most recent call first):
[2021-03-25T08:11:20.565Z]   File "/opt/python/envs/gpu_test/lib/python3.7/multiprocessing/popen_fork.py", line 28 in poll
[2021-03-25T08:11:20.565Z]   File "/opt/python/envs/gpu_test/lib/python3.7/multiprocessing/popen_fork.py", line 48 in wait
[2021-03-25T08:11:20.565Z]   File "/opt/python/envs/gpu_test/lib/python3.7/multiprocessing/process.py", line 140 in join
[2021-03-25T08:11:20.565Z]   File "/opt/python/envs/gpu_test/lib/python3.7/site-packages/distributed/process.py", line 233 in _watch_process
[2021-03-25T08:11:20.565Z]   File "/opt/python/envs/gpu_test/lib/python3.7/threading.py", line 870 in run
[2021-03-25T08:11:20.565Z]   File "/opt/python/envs/gpu_test/lib/python3.7/threading.py", line 926 in _bootstrap_inner
[2021-03-25T08:11:20.565Z]   File "/opt/python/envs/gpu_test/lib/python3.7/threading.py", line 890 in _bootstrap
[2021-03-25T08:11:20.565Z] 
[2021-03-25T08:11:20.565Z] Thread 0x00007f36adfff700 (most recent call first):
[2021-03-25T08:11:20.565Z]   File "/opt/python/envs/gpu_test/lib/python3.7/threading.py", line 296 in wait
[2021-03-25T08:11:20.565Z]   File "/opt/python/envs/gpu_test/lib/python3.7/queue.py", line 170 in get
[2021-03-25T08:11:20.565Z]   File "/opt/python/envs/gpu_test/lib/python3.7/site-packages/distributed/process.py", line 218 in _watch_message_queue
[2021-03-25T08:11:20.565Z]   File "/opt/python/envs/gpu_test/lib/python3.7/threading.py", line 870 in run
[2021-03-25T08:11:20.565Z]   File "/opt/python/envs/gpu_test/lib/python3.7/threading.py", line 926 in _bootstrap_inner
[2021-03-25T08:11:20.565Z]   File "/opt/python/envs/gpu_test/lib/python3.7/threading.py", line 890 in _bootstrap
[2021-03-25T08:11:20.565Z] 
[2021-03-25T08:11:20.565Z] Thread 0x00007f36f0f89700 (most recent call first):
[2021-03-25T08:11:20.565Z]   File "/opt/python/envs/gpu_test/lib/python3.7/multiprocessing/popen_fork.py", line 28 in poll
[2021-03-25T08:11:20.565Z]   File "/opt/python/envs/gpu_test/lib/python3.7/multiprocessing/popen_fork.py", line 48 in wait
[2021-03-25T08:11:20.565Z]   File "/opt/python/envs/gpu_test/lib/python3.7/multiprocessing/process.py", line 140 in join
[2021-03-25T08:11:20.565Z]   File "/opt/python/envs/gpu_test/lib/python3.7/site-packages/distributed/process.py", line 233 in _watch_process
[2021-03-25T08:11:20.565Z]   File "/opt/python/envs/gpu_test/lib/python3.7/threading.py", line 870 in run
[2021-03-25T08:11:20.565Z]   File "/opt/python/envs/gpu_test/lib/python3.7/threading.py", line 926 in _bootstrap_inner
[2021-03-25T08:11:20.565Z]   File "/opt/python/envs/gpu_test/lib/python3.7/threading.py", line 890 in _bootstrap
[2021-03-25T08:11:20.565Z] 
[2021-03-25T08:11:20.565Z] Thread 0x00007f36f3ffb700 (most recent call first):
[2021-03-25T08:11:20.565Z]   File "/opt/python/envs/gpu_test/lib/python3.7/threading.py", line 296 in wait
[2021-03-25T08:11:20.565Z]   File "/opt/python/envs/gpu_test/lib/python3.7/queue.py", line 170 in get
[2021-03-25T08:11:20.565Z]   File "/opt/python/envs/gpu_test/lib/python3.7/site-packages/distributed/process.py", line 218 in _watch_message_queue
[2021-03-25T08:11:20.565Z]   File "/opt/python/envs/gpu_test/lib/python3.7/threading.py", line 870 in run
[2021-03-25T08:11:20.565Z]   File "/opt/python/envs/gpu_test/lib/python3.7/threading.py", line 926 in _bootstrap_inner
[2021-03-25T08:11:20.565Z]   File "/opt/python/envs/gpu_test/lib/python3.7/threading.py", line 890 in _bootstrap
[2021-03-25T08:11:20.565Z] 
[2021-03-25T08:11:20.565Z] Thread 0x00007f36f47fc700 (most recent call first):
[2021-03-25T08:11:20.565Z]   File "/opt/python/envs/gpu_test/lib/python3.7/multiprocessing/popen_fork.py", line 28 in poll
[2021-03-25T08:11:20.565Z]   File "/opt/python/envs/gpu_test/lib/python3.7/multiprocessing/popen_fork.py", line 48 in wait
[2021-03-25T08:11:20.565Z]   File "/opt/python/envs/gpu_test/lib/python3.7/multiprocessing/process.py", line 140 in join
[2021-03-25T08:11:20.565Z]   File "/opt/python/envs/gpu_test/lib/python3.7/site-packages/distributed/process.py", line 233 in _watch_process
[2021-03-25T08:11:20.565Z]   File "/opt/python/envs/gpu_test/lib/python3.7/threading.py", line 870 in run
[2021-03-25T08:11:20.565Z]   File "/opt/python/envs/gpu_test/lib/python3.7/threading.py", line 926 in _bootstrap_inner
[2021-03-25T08:11:20.565Z]   File "/opt/python/envs/gpu_test/lib/python3.7/threading.py", line 890 in _bootstrap
[2021-03-25T08:11:20.565Z] 
[2021-03-25T08:11:20.565Z] Thread 0x00007f36f57fe700 (most recent call first):
[2021-03-25T08:11:20.565Z]   File "/opt/python/envs/gpu_test/lib/python3.7/threading.py", line 296 in wait
[2021-03-25T08:11:20.565Z]   File "/opt/python/envs/gpu_test/lib/python3.7/queue.py", line 170 in get
[2021-03-25T08:11:20.565Z]   File "/opt/python/envs/gpu_test/lib/python3.7/site-packages/distributed/process.py", line 218 in _watch_message_queue
[2021-03-25T08:11:20.565Z]   File "/opt/python/envs/gpu_test/lib/python3.7/threading.py", line 870 in run
[2021-03-25T08:11:20.565Z]   File "/opt/python/envs/gpu_test/lib/python3.7/threading.py", line 926 in _bootstrap_inner
[2021-03-25T08:11:20.565Z]   File "/opt/python/envs/gpu_test/lib/python3.7/threading.py", line 890 in _bootstrap
[2021-03-25T08:11:20.565Z] 
[2021-03-25T08:11:20.565Z] Thread 0x00007f36f5fff700 (most recent call first):
[2021-03-25T08:11:20.565Z]   File "/opt/python/envs/gpu_test/lib/python3.7/multiprocessing/popen_fork.py", line 28 in poll
[2021-03-25T08:11:20.565Z]   File "/opt/python/envs/gpu_test/lib/python3.7/multiprocessing/popen_fork.py", line 48 in wait
[2021-03-25T08:11:20.565Z]   File "/opt/python/envs/gpu_test/lib/python3.7/multiprocessing/process.py", line 140 in join
[2021-03-25T08:11:20.565Z]   File "/opt/python/envs/gpu_test/lib/python3.7/site-packages/distributed/process.py", line 233 in _watch_process
[2021-03-25T08:11:20.565Z]   File "/opt/python/envs/gpu_test/lib/python3.7/threading.py", line 870 in run
[2021-03-25T08:11:20.565Z]   File "/opt/python/envs/gpu_test/lib/python3.7/threading.py", line 926 in _bootstrap_inner
[2021-03-25T08:11:20.565Z]   File "/opt/python/envs/gpu_test/lib/python3.7/threading.py", line 890 in _bootstrap
[2021-03-25T08:11:20.565Z] 
[2021-03-25T08:11:20.565Z] Thread 0x00007f36ff46b700 (most recent call first):
[2021-03-25T08:11:20.565Z]   File "/opt/python/envs/gpu_test/lib/python3.7/threading.py", line 296 in wait
[2021-03-25T08:11:20.565Z]   File "/opt/python/envs/gpu_test/lib/python3.7/queue.py", line 170 in get
[2021-03-25T08:11:20.565Z]   File "/opt/python/envs/gpu_test/lib/python3.7/site-packages/distributed/process.py", line 218 in _watch_message_queue
[2021-03-25T08:11:20.565Z]   File "/opt/python/envs/gpu_test/lib/python3.7/threading.py", line 870 in run
[2021-03-25T08:11:20.565Z]   File "/opt/python/envs/gpu_test/lib/python3.7/threading.py", line 926 in _bootstrap_inner
[2021-03-25T08:11:20.565Z]   File "/opt/python/envs/gpu_test/lib/python3.7/threading.py", line 890 in _bootstrap
[2021-03-25T08:11:20.565Z] 
[2021-03-25T08:11:20.565Z] Thread 0x00007f3702ffd700 (most recent call first):
[2021-03-25T08:11:20.566Z]   File "/opt/python/envs/gpu_test/lib/python3.7/threading.py", line 300 in wait
[2021-03-25T08:11:20.566Z]   File "/opt/python/envs/gpu_test/lib/python3.7/queue.py", line 179 in get
[2021-03-25T08:11:20.566Z]   File "/opt/python/envs/gpu_test/lib/python3.7/site-packages/distributed/threadpoolexecutor.py", line 51 in _worker
[2021-03-25T08:11:20.566Z]   File "/opt/python/envs/gpu_test/lib/python3.7/threading.py", line 870 in run
[2021-03-25T08:11:20.566Z]   File "/opt/python/envs/gpu_test/lib/python3.7/threading.py", line 926 in _bootstrap_inner
[2021-03-25T08:11:20.566Z]   File "/opt/python/envs/gpu_test/lib/python3.7/threading.py", line 890 in _bootstrap
[2021-03-25T08:11:20.566Z] 
[2021-03-25T08:11:20.566Z] Thread 0x00007f37037fe700 (most recent call first):
[2021-03-25T08:11:20.566Z]   File "/opt/python/envs/gpu_test/lib/python3.7/site-packages/distributed/profile.py", line 269 in _watch
[2021-03-25T08:11:20.566Z]   File "/opt/python/envs/gpu_test/lib/python3.7/threading.py", line 870 in run
[2021-03-25T08:11:20.566Z]   File "/opt/python/envs/gpu_test/lib/python3.7/threading.py", line 926 in _bootstrap_inner
[2021-03-25T08:11:20.566Z]   File "/opt/python/envs/gpu_test/lib/python3.7/threading.py", line 890 in _bootstrap
[2021-03-25T08:11:20.566Z] 
[2021-03-25T08:11:20.566Z] Thread 0x00007f3703fff700 (most recent call first):
[2021-03-25T08:11:20.566Z]   File "/opt/python/envs/gpu_test/lib/python3.7/selectors.py", line 468 in select
[2021-03-25T08:11:20.566Z]   File "/opt/python/envs/gpu_test/lib/python3.7/asyncio/base_events.py", line 1750 in _run_once
[2021-03-25T08:11:20.566Z]   File "/opt/python/envs/gpu_test/lib/python3.7/asyncio/base_events.py", line 541 in run_forever
[2021-03-25T08:11:20.566Z]   File "/opt/python/envs/gpu_test/lib/python3.7/site-packages/tornado/platform/asyncio.py", line 199 in start
[2021-03-25T08:11:20.566Z]   File "/opt/python/envs/gpu_test/lib/python3.7/site-packages/distributed/utils.py", line 428 in run_loop
[2021-03-25T08:11:20.566Z]   File "/opt/python/envs/gpu_test/lib/python3.7/threading.py", line 870 in run
[2021-03-25T08:11:20.566Z]   File "/opt/python/envs/gpu_test/lib/python3.7/threading.py", line 926 in _bootstrap_inner
[2021-03-25T08:11:20.566Z]   File "/opt/python/envs/gpu_test/lib/python3.7/threading.py", line 890 in _bootstrap
[2021-03-25T08:11:20.566Z] 
[2021-03-25T08:11:20.566Z] Thread 0x00007f3803fff700 (most recent call first):
[2021-03-25T08:11:20.566Z]   File "/opt/python/envs/gpu_test/lib/python3.7/concurrent/futures/thread.py", line 78 in _worker
[2021-03-25T08:11:20.566Z]   File "/opt/python/envs/gpu_test/lib/python3.7/threading.py", line 870 in run
[2021-03-25T08:11:20.566Z]   File "/opt/python/envs/gpu_test/lib/python3.7/threading.py", line 926 in _bootstrap_inner
[2021-03-25T08:11:20.566Z]   File "/opt/python/envs/gpu_test/lib/python3.7/threading.py", line 890 in _bootstrap
[2021-03-25T08:11:20.566Z] 
[2021-03-25T08:11:20.566Z] Current thread 0x00007f3a53a72740 (most recent call first):
[2021-03-25T08:11:20.566Z]   File "/opt/python/envs/gpu_test/lib/python3.7/site-packages/cudf/core/column/column.py", line 278 in _concat
[2021-03-25T08:11:20.566Z]   File "/opt/python/envs/gpu_test/lib/python3.7/site-packages/cudf/core/series.py", line 1734 in _concat
[2021-03-25T08:11:20.566Z]   File "/opt/python/envs/gpu_test/lib/python3.7/site-packages/cudf/core/reshape.py", line 378 in concat
[2021-03-25T08:11:20.566Z]   File "/opt/python/envs/gpu_test/lib/python3.7/site-packages/dask_cudf/backends.py", line 210 in concat_cudf
[2021-03-25T08:11:20.566Z]   File "/opt/python/envs/gpu_test/lib/python3.7/site-packages/dask/dataframe/methods.py", line 422 in concat
[2021-03-25T08:11:20.566Z]   File "/opt/python/envs/gpu_test/lib/python3.7/site-packages/dask/dataframe/core.py", line 102 in _concat
[2021-03-25T08:11:20.566Z]   File "/opt/python/envs/gpu_test/lib/python3.7/site-packages/dask/dataframe/core.py", line 107 in finalize
[2021-03-25T08:11:20.566Z]   File "/opt/python/envs/gpu_test/lib/python3.7/site-packages/dask/base.py", line 566 in <listcomp>
[2021-03-25T08:11:20.566Z]   File "/opt/python/envs/gpu_test/lib/python3.7/site-packages/dask/base.py", line 566 in compute
[2021-03-25T08:11:20.566Z]   File "/opt/python/envs/gpu_test/lib/python3.7/site-packages/dask/base.py", line 283 in compute
[2021-03-25T08:11:20.566Z]   File "tests/python/test_with_dask.py", line 186 in run_boost_from_prediction
[2021-03-25T08:11:20.566Z]   File "/workspace/tests/python-gpu/test_gpu_with_dask.py", line 183 in test_boost_from_prediction
[2021-03-25T08:11:20.566Z]   File "/opt/python/envs/gpu_test/lib/python3.7/site-packages/_pytest/python.py", line 183 in pytest_pyfunc_call
[2021-03-25T08:11:20.566Z]   File "/opt/python/envs/gpu_test/lib/python3.7/site-packages/pluggy/callers.py", line 187 in _multicall
[2021-03-25T08:11:20.566Z]   File "/opt/python/envs/gpu_test/lib/python3.7/site-packages/pluggy/manager.py", line 87 in <lambda>
[2021-03-25T08:11:20.566Z]   File "/opt/python/envs/gpu_test/lib/python3.7/site-packages/pluggy/manager.py", line 93 in _hookexec
[2021-03-25T08:11:20.566Z]   File "/opt/python/envs/gpu_test/lib/python3.7/site-packages/pluggy/hooks.py", line 286 in __call__
[2021-03-25T08:11:20.566Z]   File "/opt/python/envs/gpu_test/lib/python3.7/site-packages/_pytest/python.py", line 1641 in runtest
[2021-03-25T08:11:20.566Z]   File "/opt/python/envs/gpu_test/lib/python3.7/site-packages/_pytest/runner.py", line 162 in pytest_runtest_call
[2021-03-25T08:11:20.566Z]   File "/opt/python/envs/gpu_test/lib/python3.7/site-packages/pluggy/callers.py", line 187 in _multicall
[2021-03-25T08:11:20.566Z]   File "/opt/python/envs/gpu_test/lib/python3.7/site-packages/pluggy/manager.py", line 87 in <lambda>
[2021-03-25T08:11:20.566Z]   File "/opt/python/envs/gpu_test/lib/python3.7/site-packages/pluggy/manager.py", line 93 in _hookexec
[2021-03-25T08:11:20.566Z]   File "/opt/python/envs/gpu_test/lib/python3.7/site-packages/pluggy/hooks.py", line 286 in __call__
[2021-03-25T08:11:20.566Z]   File "/opt/python/envs/gpu_test/lib/python3.7/site-packages/_pytest/runner.py", line 255 in <lambda>
[2021-03-25T08:11:20.566Z]   File "/opt/python/envs/gpu_test/lib/python3.7/site-packages/_pytest/runner.py", line 311 in from_call
[2021-03-25T08:11:20.566Z]   File "/opt/python/envs/gpu_test/lib/python3.7/site-packages/_pytest/runner.py", line 255 in call_runtest_hook
[2021-03-25T08:11:20.566Z]   File "/opt/python/envs/gpu_test/lib/python3.7/site-packages/_pytest/runner.py", line 215 in call_and_report
[2021-03-25T08:11:20.566Z]   File "/opt/python/envs/gpu_test/lib/python3.7/site-packages/_pytest/runner.py", line 126 in runtestprotocol
[2021-03-25T08:11:20.566Z]   File "/opt/python/envs/gpu_test/lib/python3.7/site-packages/_pytest/runner.py", line 109 in pytest_runtest_protocol
[2021-03-25T08:11:20.566Z]   File "/opt/python/envs/gpu_test/lib/python3.7/site-packages/pluggy/callers.py", line 187 in _multicall
[2021-03-25T08:11:20.566Z]   File "/opt/python/envs/gpu_test/lib/python3.7/site-packages/pluggy/manager.py", line 87 in <lambda>
[2021-03-25T08:11:20.566Z]   File "/opt/python/envs/gpu_test/lib/python3.7/site-packages/pluggy/manager.py", line 93 in _hookexec
[2021-03-25T08:11:20.566Z]   File "/opt/python/envs/gpu_test/lib/python3.7/site-packages/pluggy/hooks.py", line 286 in __call__
[2021-03-25T08:11:20.566Z]   File "/opt/python/envs/gpu_test/lib/python3.7/site-packages/_pytest/main.py", line 348 in pytest_runtestloop
[2021-03-25T08:11:20.566Z]   File "/opt/python/envs/gpu_test/lib/python3.7/site-packages/pluggy/callers.py", line 187 in _multicall
[2021-03-25T08:11:20.566Z]   File "/opt/python/envs/gpu_test/lib/python3.7/site-packages/pluggy/manager.py", line 87 in <lambda>
[2021-03-25T08:11:20.566Z]   File "/opt/python/envs/gpu_test/lib/python3.7/site-packages/pluggy/manager.py", line 93 in _hookexec
[2021-03-25T08:11:20.566Z]   File "/opt/python/envs/gpu_test/lib/python3.7/site-packages/pluggy/hooks.py", line 286 in __call__
[2021-03-25T08:11:20.566Z]   File "/opt/python/envs/gpu_test/lib/python3.7/site-packages/_pytest/main.py", line 323 in _main
[2021-03-25T08:11:20.566Z]   File "/opt/python/envs/gpu_test/lib/python3.7/site-packages/_pytest/main.py", line 269 in wrap_session
[2021-03-25T08:11:20.566Z]   File "/opt/python/envs/gpu_test/lib/python3.7/site-packages/_pytest/main.py", line 316 in pytest_cmdline_main
[2021-03-25T08:11:20.566Z]   File "/opt/python/envs/gpu_test/lib/python3.7/site-packages/pluggy/callers.py", line 187 in _multicall
[2021-03-25T08:11:20.566Z]   File "/opt/python/envs/gpu_test/lib/python3.7/site-packages/pluggy/manager.py", line 87 in <lambda>
[2021-03-25T08:11:20.566Z]   File "/opt/python/envs/gpu_test/lib/python3.7/site-packages/pluggy/manager.py", line 93 in _hookexec
[2021-03-25T08:11:20.566Z]   File "/opt/python/envs/gpu_test/lib/python3.7/site-packages/pluggy/hooks.py", line 286 in __call__
[2021-03-25T08:11:20.566Z]   File "/opt/python/envs/gpu_test/lib/python3.7/site-packages/_pytest/config/__init__.py", line 163 in main
[2021-03-25T08:11:20.566Z]   File "/opt/python/envs/gpu_test/lib/python3.7/site-packages/_pytest/config/__init__.py", line 185 in console_main

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Comments: 23 (16 by maintainers)

Most upvoted comments

I think it’s caused by xgboost setting the device to something else than 0 in another test.

Relevant piece:

2021-03-31T19:09:52.917Z] ========= Invalid __global__ read of size 8
[2021-03-31T19:09:52.917Z] =========     at 0x00000460 in void cudf::detail::fused_concatenate_kernel<long, int=256, bool=0>(cudf::column_device_view const *, unsigned long const *, int, cudf::mutable_column_device_view, int*)
[2021-03-31T19:09:52.917Z] =========     by thread (19,0,0) in block (2,0,0)
[2021-03-31T19:09:52.917Z] =========     Address 0x7f471c0006f8 is out of bounds
[2021-03-31T19:09:52.917Z] =========     Device Frame:void cudf::detail::fused_concatenate_kernel<long, int=256, bool=0>(cudf::column_device_view const *, unsigned long const *, int, cudf::mutable_column_device_view, int*) (void cudf::detail::fused_concatenate_kernel<long, int=256, bool=0>(cudf::column_device_view const *, unsigned long const *, int, cudf::mutable_column_device_view, int*) : 0x460)