gpu-bdb: [BUG] Memory View Error in distributed.merge_frames
[BUG] Memory View Error in distributed.protocol.utils.merge_frames
We seem to be hitting these Memory View Error in distributed.protocol.utils.merge_frames
IOLoop._discard_future_result of <tornado.platform.asyncio.AsyncIOLoop object at 0
x7fbd33549150>>, <Task finished coro=<Worker.gather_dep() done, defined at /home/rgelhausen/conda/envs/rapids-tpcx-bb/lib/python3.7/site-packages/distributed/worker.p
y:1965> exception=TypeError("memoryview: a bytes-like object is required, not 'rmm._lib.device_buffer.DeviceBuffer'")>)
Traceback (most recent call last):
File "/home/rgelhausen/conda/envs/rapids-tpcx-bb/lib/python3.7/site-packages/tornado/ioloop.py", line 743, in _run_callback
ret = callback()
File "/home/rgelhausen/conda/envs/rapids-tpcx-bb/lib/python3.7/site-packages/tornado/ioloop.py", line 767, in _discard_future_result
future.result()
Full Trace
distributed.worker - INFO - Can't find dependencies for key ('getitem-merge-drop-duplicates-chunk-06145ed4cdda05382101cc39bb5f569f', 0, 55, 0)
tornado.application - ERROR - Exception in callback functools.partial(<bound method IOLoop._discard_future_result of <tornado.platform.asyncio.AsyncIOLoop object at 0
x7fbd33549150>>, <Task finished coro=<Worker.gather_dep() done, defined at /home/rgelhausen/conda/envs/rapids-tpcx-bb/lib/python3.7/site-packages/distributed/worker.p
y:1965> exception=TypeError("memoryview: a bytes-like object is required, not 'rmm._lib.device_buffer.DeviceBuffer'")>)
Traceback (most recent call last):
File "/home/rgelhausen/conda/envs/rapids-tpcx-bb/lib/python3.7/site-packages/tornado/ioloop.py", line 743, in _run_callback
ret = callback()
File "/home/rgelhausen/conda/envs/rapids-tpcx-bb/lib/python3.7/site-packages/tornado/ioloop.py", line 767, in _discard_future_result
future.result()
File "/home/rgelhausen/conda/envs/rapids-tpcx-bb/lib/python3.7/site-packages/distributed/worker.py", line 1983, in gather_dep
self.rpc, deps, worker, who=self.address
File "/home/rgelhausen/conda/envs/rapids-tpcx-bb/lib/python3.7/site-packages/distributed/worker.py", line 3258, in get_data_from_worker
return await retry_operation(_get_data, operation="get_data_from_worker")
File "/home/rgelhausen/conda/envs/rapids-tpcx-bb/lib/python3.7/site-packages/distributed/utils_comm.py", line 390, in retry_operation
operation=operation,
File "/home/rgelhausen/conda/envs/rapids-tpcx-bb/lib/python3.7/site-packages/distributed/utils_comm.py", line 370, in retry
return await coro()
File "/home/rgelhausen/conda/envs/rapids-tpcx-bb/lib/python3.7/site-packages/distributed/worker.py", line 3245, in _get_data
max_connections=max_connections,
File "/home/rgelhausen/conda/envs/rapids-tpcx-bb/lib/python3.7/site-packages/distributed/core.py", line 644, in send_recv
response = await comm.read(deserializers=deserializers)
File "/home/rgelhausen/conda/envs/rapids-tpcx-bb/lib/python3.7/site-packages/distributed/comm/ucx.py", line 293, in read
allow_offload=self.allow_offload,
File "/home/rgelhausen/conda/envs/rapids-tpcx-bb/lib/python3.7/site-packages/distributed/comm/utils.py", line 87, in from_frames
res = _from_frames()
File "/home/rgelhausen/conda/envs/rapids-tpcx-bb/lib/python3.7/site-packages/distributed/comm/utils.py", line 66, in _from_frames
frames, deserialize=deserialize, deserializers=deserializers
File "/home/rgelhausen/conda/envs/rapids-tpcx-bb/lib/python3.7/site-packages/distributed/protocol/core.py", line 129, in loads
fs = merge_frames(head, fs)
File "/home/rgelhausen/conda/envs/rapids-tpcx-bb/lib/python3.7/site-packages/distributed/protocol/utils.py", line 61, in merge_frames
frames = list(map(memoryview, frames))
Minimal Reproducer:
## Fresh Env( 0.15.0a200720)
>>> import cudf
>>> from distributed.protocol import utils
>>> cudf.set_allocator(pool=True, initial_pool_size=1e+10)
>>> df_a = cudf.DataFrame()
>>> df_a['id'] = [0, 1, 2]
>>> header, frames = df_a.device_serialize()
>>> utils.merge_frames({'lengths':header['lengths']},frames)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/nvme/0/vjawa/conda/envs/july_20_rapids/lib/python3.7/site-packages/distributed/protocol/utils.py", line 61, in merge_frames
frames = list(map(memoryview, frames))
TypeError: memoryview: a bytes-like object is required, not 'Buffer'
Earlier Env (0.15.0a200716):
Output:
[<cudf.core.buffer.Buffer object at 0x7f53d547ee50>]
ENV:
# packages in environment at /home/rgelhausen/conda/envs/rapids-tpcx-bb:
cudf 0.15.0a200720 py37_gce826c57c_2961 rapidsai-nightly
cuml 0.15.0a200720 cuda10.2_py37_gee3e8a539_1200 rapidsai-nightly
dask-cuda 0.15.0a200720 py37_76 rapidsai-nightly
dask-cudf 0.15.0a200720 py37_gce826c57c_2961 rapidsai-nightly
libcudf 0.15.0a200720 cuda10.2_gce826c57c_2961 rapidsai-nightly
libcuml 0.15.0a200720 cuda10.2_gee3e8a539_1200 rapidsai-nightly
libcumlprims 0.15.0a200622 cuda10.2_45 rapidsai-nightly
librmm 0.15.0a200720 cuda10.2_gb458cfc_361 rapidsai-nightly
rmm 0.15.0a200720 py37_gb458cfc_361 rapidsai-nightly
ucx 1.8.1+g6b29558 cuda10.2_0 rapidsai-nightly
ucx-proc 1.0.0 gpu rapidsai-nightly
ucx-py 0.15.0a200720+g6b29558 py37_142 rapidsai-nightly
(rapids-tpcx-bb) rgelhausen@rl-dgx2-d17-u16-rapids-dgx202:~/shared/tpcx-bb/tpcx_bb$ conda list | grep dask
dask 2.21.0+2.g56d0d1be pypi_0 pypi
dask-cuda 0.15.0a200720 py37_76 rapidsai-nightly
dask-cudf 0.15.0a200720 py37_gce826c57c_2961 rapidsai-nightly
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Comments: 17 (17 by maintainers)
No, I don’t see it with stable.
Got it, thanks @jakirkham for explaining.
merge_framesis only designed for use with host frames currently. So if it encounters a device frame, there is a good chance it tries to move it to host, which would be slow.Well if it was using
merge_framesbefore it would have been extremely slow/inefficient. Anyways my point is that we should be figuring out wheremerge_framesis getting involved and fix that (as opposed to reverting something).Thanks more details would be very helpful.