cudf: [BUG] `len` after `fillna` operation uses way more memory than expected

I have a dataframe that occupies ~700-800 MB when persisted. I fill all the nulls in the Dataframe using fill_na and call len on the new Dataframe. I notice an explosion in memory usage.

Reproducer:

# Create a dataframe and write to file
import numpy as np
import pandas as pd
import dask.dataframe

pdf = pd.DataFrame()
for i in range(80): 
    pdf[str(i)] = pd.Series([12,None]*100000)
ddf = dask.dataframe.from_pandas(pdf,1)
ddf.to_parquet('temp_data.parquet')

# Read the dataframe from file
import os
import dask
import dask_cudf
import cudf

path = 'temp_data.parquet/'
files = [fn for fn in os.listdir(path) if fn.endswith('.parquet')]
parts= [dask.delayed(cudf.io.parquet.read_parquet)
         (path=path+fn) for fn in files]

temp = dask_cudf.from_delayed(parts)

Now when I do len(temp)

Nvidia-smi usage shoots to a max state here:


+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.104      Driver Version: 410.104      CUDA Version: 10.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla T4            Off  | 00000000:3B:00.0 Off |                    0 |
| N/A   47C    P0    28W /  70W |    685MiB / 15079MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla T4            Off  | 00000000:5E:00.0 Off |                    0 |
| N/A   33C    P8    10W /  70W |     10MiB / 15079MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  Tesla T4            Off  | 00000000:AF:00.0 Off |                    0 |
| N/A   32C    P8    10W /  70W |     10MiB / 15079MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  Tesla T4            Off  | 00000000:D8:00.0 Off |                    0 |
| N/A   32C    P8     9W /  70W |     10MiB / 15079MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0    404162      C   /conda/envs/rapids/bin/python                841MiB |
+-----------------------------------------------------------------------------+

Now for the fill_na operation

%%time
for col in temp.columns:
    temp[col] = temp[col].fillna(-1)

CPU times: user 35.6 s, sys: 1.26 s, total: 36.8 s Wall time: 38.7 s (Which is slow)

(No change in memory usage leading me to believe this operation is only done at a metadata level but not on the complete data)

Finally: len(temp)

Nvidia-smi usage

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.104      Driver Version: 410.104      CUDA Version: 10.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla T4            Off  | 00000000:3B:00.0 Off |                    0 |
| N/A   46C    P0    28W /  70W |  13681MiB / 15079MiB |      2%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla T4            Off  | 00000000:5E:00.0 Off |                    0 |
| N/A   33C    P8    10W /  70W |     10MiB / 15079MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  Tesla T4            Off  | 00000000:AF:00.0 Off |                    0 |
| N/A   32C    P8     9W /  70W |     10MiB / 15079MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  Tesla T4            Off  | 00000000:D8:00.0 Off |                    0 |
| N/A   32C    P8     9W /  70W |     10MiB / 15079MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0    404162      C   /conda/envs/rapids/bin/python              13755MiB |
+-----------------------------------------------------------------------------+

Which is more than a 16x spike in memory usage. Not sure if my approach is wrong or there is some other underlying issue.

Environment Info cudf: Built from source at commit: rapidsai/cudf@79af3a8806bbe01a dask-cudf:Built from source at commit 24798dd8cf9502

About this issue

  • Original URL
  • State: closed
  • Created 5 years ago
  • Comments: 22 (13 by maintainers)

Most upvoted comments

In the current state of https://github.com/rapidsai/dask-cudf/pull/270 I’ve shaved off 14 seconds.

Great! I tried this approach out for this smaller example and everything worked as expected with no memory overhead.

If the performance is as expected and the original example in the issue is an inefficient way to do fillna the issue can probably be closed.

If @quasiben and others feel there is some merit in digging a bit further to see why the performance degrades so drastically in this case that would be great, especially if some bigger issue is uncovered with how things are being handled.

I think this uncovered a bigger issue around the meta_nonempty using considerable memory unnecessarily.

I’ve never seen a more colorful github issue

Looking at this now