cudf: Mismatching __sizeof__ and alloc_size results

Mismatching results for __sizeof__() and alloc_size on dask_cudf.core.DataFrame partitions. The following snippet can be used to reproduce the issue:

import cudf
import dask_cudf
from numba import cuda

rows = int(1e6)

free_before = cuda.current_context().get_memory_info()[0]
df = cudf.DataFrame([('A', [8] * rows), ('B', [32] * rows)])
free_after = cuda.current_context().get_memory_info()[0]
print("df size:          ", free_before - free_after)
print("df __sizeof__():  ", df.__sizeof__())
print("df alloc_size:    ", df.as_gpu_matrix().alloc_size)

free_before = cuda.current_context().get_memory_info()[0]
cdf = dask_cudf.from_cudf(df, npartitions=16)
free_after = cuda.current_context().get_memory_info()[0]
print("cdf size:         ", free_before - free_after)
print("cdf __sizeof__(): ", sum(p.compute().__sizeof__() for p in cdf.partitions))
print("cdf alloc_size:   ", sum(p.compute().as_gpu_matrix().alloc_size for p in cdf.partitions))

The results I get are:

df size:           25165824
df __sizeof__():   16000064
df alloc_size:     16000000
cdf size:          25165824
cdf __sizeof__():  32000000
cdf alloc_size:    16000000

As we can see, __sizeof__() is double the size of alloc_size, and the latter matches that of the cudf.DataFrame, despite the overhead for actual allocated device memory in both cudf and dask_cudf dataframes.

About this issue

  • Original URL
  • State: closed
  • Created 5 years ago
  • Comments: 19 (14 by maintainers)

Most upvoted comments

I’m not tracking with how cuDF is wrapping Pandas/NumPy objects

I meant only if that happens, I don’t think it does, but I also don’t know cuDF in detail, so just raising awareness in case this may actually happen.

or why lots of small cuDF objects would be problematic with regards to accurately determining the size of currently allocated memory.

It would be a problem if we are ignoring the object size and only taking into account GPU memory. For example, if a cuDF object takes 1kB, but somehow there are 1M cuDF objects, thus taking 1GB of host memory that’s not accounted for.

Before the last few comments, I was thinking Dask should know about GPU+CPU memory usage, al a @sizeof.register(…), but now it sounds like we only want Dask to know about the GPU memory usage?

Dask (more specifically dask-cuda) has two LRU caches, one for device memory (that can be spilled to host once it reaches a certain size), and another for host memory (that can be spilled to disk), therefore, we need to know of both GPU and CPU memory usage separately. In the case of cuDF for example, we may be fine ignoring CPU memory usage for now (if it really uses just a few hundred bytes) and only take into account GPU memory usage, which is where the bulk of memory lies.

I believe internally Dask uses the sizeof method to understand how much memory the Python objects are consuming to determine if / when it should start spilling things. I agree it’s not a good long term approach to report sizeof arbitrarily for CPU/GPU memory, but in the short term it’s needed for Dask to allow it to manage memory.

The relevant code is here:

https://github.com/rapidsai/cudf/blob/0233e4b53205b65196e74889803d8e3a75d1893e/python/cudf/dataframe/dataframe.py#L296-L297

I’m moving this to the cudf issue tracker.

My first guess is that we’re not handling the index?