gpu-bdb: [BUG] Memory View Error in distributed.merge_frames

[BUG] Memory View Error in distributed.protocol.utils.merge_frames

We seem to be hitting these Memory View Error in distributed.protocol.utils.merge_frames

IOLoop._discard_future_result of <tornado.platform.asyncio.AsyncIOLoop object at 0
x7fbd33549150>>, <Task finished coro=<Worker.gather_dep() done, defined at /home/rgelhausen/conda/envs/rapids-tpcx-bb/lib/python3.7/site-packages/distributed/worker.p
y:1965> exception=TypeError("memoryview: a bytes-like object is required, not 'rmm._lib.device_buffer.DeviceBuffer'")>)                      
Traceback (most recent call last):                                                                                                                                    
  File "/home/rgelhausen/conda/envs/rapids-tpcx-bb/lib/python3.7/site-packages/tornado/ioloop.py", line 743, in _run_callback                                         
    ret = callback()                                                                                                                   
  File "/home/rgelhausen/conda/envs/rapids-tpcx-bb/lib/python3.7/site-packages/tornado/ioloop.py", line 767, in _discard_future_result                                
    future.result()                                           

Full Trace


distributed.worker - INFO - Can't find dependencies for key ('getitem-merge-drop-duplicates-chunk-06145ed4cdda05382101cc39bb5f569f', 0, 55, 0)
tornado.application - ERROR - Exception in callback functools.partial(<bound method IOLoop._discard_future_result of <tornado.platform.asyncio.AsyncIOLoop object at 0
x7fbd33549150>>, <Task finished coro=<Worker.gather_dep() done, defined at /home/rgelhausen/conda/envs/rapids-tpcx-bb/lib/python3.7/site-packages/distributed/worker.p
y:1965> exception=TypeError("memoryview: a bytes-like object is required, not 'rmm._lib.device_buffer.DeviceBuffer'")>)                      
Traceback (most recent call last):                                                                                                                                    
  File "/home/rgelhausen/conda/envs/rapids-tpcx-bb/lib/python3.7/site-packages/tornado/ioloop.py", line 743, in _run_callback                                         
    ret = callback()                                                                                                                   
  File "/home/rgelhausen/conda/envs/rapids-tpcx-bb/lib/python3.7/site-packages/tornado/ioloop.py", line 767, in _discard_future_result                                
    future.result()                                                                                                                                                   
  File "/home/rgelhausen/conda/envs/rapids-tpcx-bb/lib/python3.7/site-packages/distributed/worker.py", line 1983, in gather_dep
    self.rpc, deps, worker, who=self.address                                                                                                                          
  File "/home/rgelhausen/conda/envs/rapids-tpcx-bb/lib/python3.7/site-packages/distributed/worker.py", line 3258, in get_data_from_worker                             
    return await retry_operation(_get_data, operation="get_data_from_worker")                                                                                         
  File "/home/rgelhausen/conda/envs/rapids-tpcx-bb/lib/python3.7/site-packages/distributed/utils_comm.py", line 390, in retry_operation                               
    operation=operation,                                                                                                                 
  File "/home/rgelhausen/conda/envs/rapids-tpcx-bb/lib/python3.7/site-packages/distributed/utils_comm.py", line 370, in retry            
    return await coro()                                                                                                                
  File "/home/rgelhausen/conda/envs/rapids-tpcx-bb/lib/python3.7/site-packages/distributed/worker.py", line 3245, in _get_data         
    max_connections=max_connections,                                                                                                                                  
  File "/home/rgelhausen/conda/envs/rapids-tpcx-bb/lib/python3.7/site-packages/distributed/core.py", line 644, in send_recv              
    response = await comm.read(deserializers=deserializers)                                                                                                           
  File "/home/rgelhausen/conda/envs/rapids-tpcx-bb/lib/python3.7/site-packages/distributed/comm/ucx.py", line 293, in read                                            
    allow_offload=self.allow_offload,                                                                                             
  File "/home/rgelhausen/conda/envs/rapids-tpcx-bb/lib/python3.7/site-packages/distributed/comm/utils.py", line 87, in from_frames                                    
    res = _from_frames()                                                                                                                                              
  File "/home/rgelhausen/conda/envs/rapids-tpcx-bb/lib/python3.7/site-packages/distributed/comm/utils.py", line 66, in _from_frames                                   
    frames, deserialize=deserialize, deserializers=deserializers                                                                         
  File "/home/rgelhausen/conda/envs/rapids-tpcx-bb/lib/python3.7/site-packages/distributed/protocol/core.py", line 129, in loads                                      
    fs = merge_frames(head, fs)                                                                                                          
  File "/home/rgelhausen/conda/envs/rapids-tpcx-bb/lib/python3.7/site-packages/distributed/protocol/utils.py", line 61, in merge_frames
    frames = list(map(memoryview, frames))

Minimal Reproducer:

## Fresh Env( 0.15.0a200720)
>>> import cudf
>>> from distributed.protocol import utils
>>> cudf.set_allocator(pool=True, initial_pool_size=1e+10)
>>> df_a = cudf.DataFrame()
>>> df_a['id'] = [0, 1, 2]
>>> header, frames = df_a.device_serialize()
>>> utils.merge_frames({'lengths':header['lengths']},frames)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/nvme/0/vjawa/conda/envs/july_20_rapids/lib/python3.7/site-packages/distributed/protocol/utils.py", line 61, in merge_frames
    frames = list(map(memoryview, frames))
TypeError: memoryview: a bytes-like object is required, not 'Buffer'

Earlier Env (0.15.0a200716):

Output:
[<cudf.core.buffer.Buffer object at 0x7f53d547ee50>]
ENV:
# packages in environment at /home/rgelhausen/conda/envs/rapids-tpcx-bb:
cudf                      0.15.0a200720   py37_gce826c57c_2961    rapidsai-nightly
cuml                      0.15.0a200720   cuda10.2_py37_gee3e8a539_1200    rapidsai-nightly
dask-cuda                 0.15.0a200720           py37_76    rapidsai-nightly
dask-cudf                 0.15.0a200720   py37_gce826c57c_2961    rapidsai-nightly
libcudf                   0.15.0a200720   cuda10.2_gce826c57c_2961    rapidsai-nightly
libcuml                   0.15.0a200720   cuda10.2_gee3e8a539_1200    rapidsai-nightly
libcumlprims              0.15.0a200622       cuda10.2_45    rapidsai-nightly
librmm                    0.15.0a200720   cuda10.2_gb458cfc_361    rapidsai-nightly
rmm                       0.15.0a200720   py37_gb458cfc_361    rapidsai-nightly
ucx                       1.8.1+g6b29558       cuda10.2_0    rapidsai-nightly
ucx-proc                  1.0.0                       gpu    rapidsai-nightly
ucx-py                    0.15.0a200720+g6b29558        py37_142    rapidsai-nightly
(rapids-tpcx-bb) rgelhausen@rl-dgx2-d17-u16-rapids-dgx202:~/shared/tpcx-bb/tpcx_bb$ conda list | grep dask
dask                      2.21.0+2.g56d0d1be          pypi_0    pypi
dask-cuda                 0.15.0a200720           py37_76    rapidsai-nightly
dask-cudf                 0.15.0a200720   py37_gce826c57c_2961    rapidsai-nightly

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Comments: 17 (17 by maintainers)

Most upvoted comments

No, I don’t see it with stable.

merge_frames is only designed for use with host frames currently. So if it encounters a device frame, there is a good chance it tries to move it to host, which would be slow.

Got it, thanks @jakirkham for explaining.

merge_frames is only designed for use with host frames currently. So if it encounters a device frame, there is a good chance it tries to move it to host, which would be slow.

Well if it was using merge_frames before it would have been extremely slow/inefficient. Anyways my point is that we should be figuring out where merge_frames is getting involved and fix that (as opposed to reverting something).

Thanks more details would be very helpful.