cudf: [BUG] Holding on to a RMM allocation exception makes cuDF go OOM

Describe the bug Trying to allocate a too large cuDF dataframe, RMM raises a std::bad_alloc exception as expected. However, when holding on to the exception (a MemoryError exception in Python), subsequent small dataframe allocations fails with the same std::bad_alloc exception. This issue relates to https://github.com/rapidsai/dask-cuda/issues/725

Steps/Code to reproduce bug In the following code we try to allocate a small Series (create_single_cudf_ser()) after a big Series has failed (create_big_cudf_ser()).

Notice:

  • If not using a rmm pool (pool_allocator=False) it works
  • if we use rmm.DeviceBuffer instead of Series it works.
  • if we delete the exception before the next small allocation it works.
  • This could be RMM bug, but I was not able to reproduce it only using RMM.
import cudf
import rmm

# Setting `pool_allocator=False` makes it work
rmm.reinitialize(pool_allocator=True)

# create_big_cudf_ser -> Causes OOM on a worker
def create_big_cudf_ser():
    # return rmm.DeviceBuffer(size=30*2**30) # <= Makes it work
    n_rows = 300_000_000
    s_1 = cudf.Series([1], dtype="int64").repeat(n_rows)
    s_2 = cudf.Series([1], dtype="int64").repeat(n_rows)
    s_3 = cudf.Series([1], dtype="int64").repeat(n_rows)
    s_4 = cudf.Series([1], dtype="int64").repeat(n_rows)
    s_5 = cudf.Series([1], dtype="int64").repeat(n_rows)
    s_6 = cudf.Series([1], dtype="int64").repeat(n_rows)
    s_7 = cudf.Series([1], dtype="int64").repeat(n_rows)
    return len(s_1) + len(s_2) + len(s_3) + len(s_4) + len(s_5) + len(s_6) + len(s_7)


# create_small_cudf_ser -> should succeed on a worker
def create_single_cudf_ser():
    # return rmm.DeviceBuffer(size=2**30)  # <= Makes it work
    n_rows = 200_000_000
    series_1 = cudf.Series([1], dtype="int64").repeat(n_rows)
    return len(series_1)


print("create_single_cudf_ser()")
create_single_cudf_ser()
err = None
try:
    print("create_big_cudf_ser()")
    create_big_cudf_ser()
except MemoryError as e:
    err = e
    print(e)
# del err  # <= Makes it work
print("create_single_cudf_ser()")
create_single_cudf_ser()
print("FINISHED")

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Comments: 18 (18 by maintainers)

Most upvoted comments

@shwina, @vyasr, @VibhuJawa, thanks for the investigation. I think I have a general solution to the issue: https://github.com/dask/distributed/pull/5338

@madsbk @shwina @randerzander @VibhuJawa just wanted to give interested parties an update since I haven’t been able to resolve this issue yet. Apologies in advance, this description will go a bit into the weeds but I think it’s helpful to document my results so far in case someone else has ideas on how to proceed beyond what I’m doing.

I was able to reproduce the bug from the script in the initial issue, here’s a slightly smaller MWE:

import cudf
import rmm

# Setting `pool_allocator=False` makes it work
rmm.reinitialize(pool_allocator=True)

def create_cudf_ser(n_rows, num):
    series_list = []
    for _ in range(num):
        # Works if I don't append to the list and instead create temporary
        # Series objects that are never saved
        series_list.append(cudf.Series([1], dtype="int64").repeat(n_rows))

print("Create one small series")
create_cudf_ser(200_000_000, 1)
err = None
try:
    print("Create two large series")
    create_cudf_ser(900_000_000, 2)  # Works if I instead call with (1_800_000_000, 1) or even larger
except MemoryError as e:
    err = e
    print(e)
# del err  # <= Makes it work
print("Create one small series")
create_cudf_ser(200_000_000, 1)
print("FINISHED")

The fact that deleting the exception made it work made me pretty suspicious of what was happening here, so I implemented __del__ with a print statement for the Series, Column, and Buffer classes, and I added corresponding print statements in __init__ to see what was happening (all print(..., flush=True) to ensure immediate output). Here’s the output:

Without the `del err`

Create one small series
Creating series  140620888999312
Creating buffer  140618239789520
Creating column  140618053628560
Creating buffer  140618241784848
Creating column  140618239693312
Creating buffer  140618239816080
Creating column  140618238428224
Creating buffer  140618239789584
Creating column  140618053513440
Deleting series  140620888999312
Deleting column  140618053628560
Deleting buffer  140618239789520
Deleting column  140618239693312
Deleting buffer  140618241784848
Deleting series  140618239789968
Deleting column  140618053513440
Deleting buffer  140618239789584
Deleting column  140618238428224
Deleting buffer  140618239816080
Create two large series
Creating series  140618241784848
Creating buffer  140618239789968
Creating column  140618238428224
Creating buffer  140618239816272
Creating column  140618053513440
Creating buffer  140618239790224
Creating column  140618239693312
Creating buffer  140618239789712
Creating column  140618053628560
Deleting series  140618241784848
Deleting column  140618238428224
Deleting buffer  140618239789968
Deleting column  140618053513440
Deleting buffer  140618239816272
Creating series  140618241784848
Creating buffer  140618239789968
Creating column  140618053513440
Creating buffer  140618239816080
Creating column  140618238428224
std::bad_alloc: RMM failure at:../include/rmm/mr/device/pool_memory_resource.hpp:183: Maximum pool size exceeded
Create one small series
Creating series  140618238506192
Creating buffer  140618053464528
Creating column  140618053579552
Creating buffer  140618053464656
Creating column  140618053565040
Traceback (most recent call last):
  File "test2.py", line 24, in <module>
    create_cudf_ser(200_000_000, 1)
  File "test2.py", line 10, in create_cudf_ser
    series_list.append(cudf.Series([1], dtype="int64").repeat(n_rows))
  File "/home/nfs/vyasr/local/rapids/cudf/python/cudf/cudf/core/frame.py", line 1744, in repeat
    return self._repeat(repeats)
  File "/home/nfs/vyasr/local/rapids/cudf/python/cudf/cudf/core/frame.py", line 1751, in _repeat
    *libcudf.filling.repeat(self, count)
  File "cudf/_lib/filling.pyx", line 58, in cudf._lib.filling.repeat
  File "cudf/_lib/filling.pyx", line 86, in cudf._lib.filling._repeat_via_size_type
MemoryError: std::bad_alloc: RMM failure at:../include/rmm/mr/device/pool_memory_resource.hpp:183: Maximum pool size exceeded
Deleting series  140618238506192
Deleting column  140618053579552
Deleting buffer  140618053464528
Deleting column  140618053565040
Deleting buffer  140618053464656
Deleting series  140618241784848
Deleting series  140618238505936
Deleting column  140618238428224
Deleting column  140618053513440
Deleting buffer  140618239816080
Deleting column  140618053628560
Deleting buffer  140618239789968
Deleting buffer  140618239789712
Deleting column  140618239693312
Deleting buffer  140618239790224

With the `del err`

Create one small series
Creating series  140689729279504
Creating buffer  140686900742160
Creating column  140686894084752
Creating buffer  140687118690896
Creating column  140686901871104
Creating buffer  140687118689104
Creating column  140686900679744
Creating buffer  140686900741904
Creating column  140686893969632
Deleting series  140689729279504
Deleting column  140686894084752
Deleting buffer  140686900742160
Deleting column  140686901871104
Deleting buffer  140687118690896
Deleting series  140687675218512
Deleting column  140686893969632
Deleting buffer  140686900741904
Deleting column  140686900679744
Deleting buffer  140687118689104
Create two large series
Creating series  140687125825360
Creating buffer  140689729279504
Creating column  140686900679744
Creating buffer  140689729279696
Creating column  140686893969632
Creating buffer  140686893859856
Creating column  140686901871104
Creating buffer  140687118689104
Creating column  140686894084752
Deleting series  140687125825360
Deleting column  140686900679744
Deleting buffer  140689729279504
Deleting column  140686893969632
Deleting buffer  140689729279696
Creating series  140687125825360
Creating buffer  140689729279504
Creating column  140686893969632
Creating buffer  140689729279440
Creating column  140686900679744
std::bad_alloc: RMM failure at:../include/rmm/mr/device/pool_memory_resource.hpp:183: Maximum pool size exceeded
Deleting series  140687125825360
Deleting column  140686893969632
Deleting buffer  140689729279504
Deleting column  140686900679744
Deleting buffer  140689729279440
Deleting series  140686900741904
Deleting column  140686894084752
Deleting buffer  140687118689104
Deleting column  140686901871104
Deleting buffer  140686893859856
Create one small series
Creating series  140686893859856
Creating buffer  140687118689680
Creating column  140686901871104
Creating buffer  140687118690896
Creating column  140686894084752
Creating buffer  140687125825360
Creating column  140686900679744
Creating buffer  140687118689104
Creating column  140686893969632
Deleting series  140686893859856
Deleting column  140686901871104
Deleting buffer  140687118689680
Deleting column  140686894084752
Deleting buffer  140687118690896
Deleting series  140689729279696
Deleting column  140686893969632
Deleting buffer  140687118689104
Deleting column  140686900679744
Deleting buffer  140687125825360
FINISHED

As you can see, the number of objects that are deleted after the initial (caught) OOM exception and before the subsequent call to create_cudf_ser is larger if we delete the stored exception. If we don’t delete the exception, those objects are only deleted at the very end after the uncaught MemoryError triggers a termination of the script and a full cleanup of the environment. The fact that deleting the error has this effect and some of the other corresponding fixes in the script above (not storing the series objects into local variables, making one larger object instead of multiple smaller ones) suggests a bug related to Python’s reference counting. Manually forcing garbage collection inside or outside the function call does not appear to fix the issue, indicating that the most likely culprit is that the MemoryError object is actively maintaining a reference to the function-local stack frame and preventing its destruction.

It is possible that the problem is not in Python itself, but rather in Cython’s handling of references when it propagates C++ exceptions to Python (in this case RMM’s std::bad_alloc), but I am not yet sure of the best way to distinguish those since the changes required to make RMM fail silently instead of throw could leave its internal memory pool in an invalid internal state. I’ve inspected the Cython-generated C++ code and I’m not seeing anything immediately wrong with the macro it’s using to propagate exceptions, but I’m not terribly familiar with Python’s C APIs so it’s possible that I’m missing something.

In summary, unless I’m missing something this appears to be a problem with Python itself, not with cudf or rmm (please correct me if you see an alternative). Whether this is coming from Python or Cython isn’t clear, and I may have to try to build up a MWE independent of RAPIDS libraries to confirm that if I can’t narrow this down by inspection, so I don’t have a good idea of how much longer this will take to fix.

EDIT The example above actually fails whether or not pool_allocator is set to True or False, I forgot to test the final version for that but @VibhuJawa pointed out that this error can still occur without using a pool. Conversely, I found that the exact same code will work if I redefine create_cudf_ser using a list comprehension

def create_cudf_ser(n_rows, num):
    series_list = [cudf.Series([1], dtype="int64").repeat(n_rows) for _ in range(num)]

There is absolutely no reason that this should not be an equivalent function unless it’s an underlying bug in the Python interpreter (again, perhaps due to interactions with the Cython-generated C++ code).

Inspired from @vyasr 's example, see a pure cupy example , which has the same issue and (deleting err makes it work) .

import cupy as cp

def create_cupy_aray(n_rows):
    s_1 = cp.asarray([1], dtype="int64").repeat(n_rows)
    s_2 = cp.asarray([1], dtype="int64").repeat(n_rows)
    s_3 = cp.asarray([1], dtype="int64").repeat(n_rows)
    s_4 = cp.asarray([1], dtype="int64").repeat(n_rows)
    s_5 = cp.asarray([1], dtype="int64").repeat(n_rows)
    

print("Create one small array")
create_cupy_aray(200_000_000)

err = None
try:
    print("Create large array")
    create_cupy_aray(900_000_000)  # Works if I instead call with (1_800_000_000, 1) or even larger
except MemoryError as e:
    err = e
    print(e)

# del err  # <= Makes it work
print("Create one small array again")
create_cupy_aray(400_000_000)
print("FINISHED")

Create one small array
Create large array
Out of memory allocating 7,200,000,000 bytes (allocated so far: 28,800,000,512 bytes).
Create one small array again
---------------------------------------------------------------------------
OutOfMemoryError                          Traceback (most recent call last)
/tmp/ipykernel_72472/3675052165.py in <module>
     22 # del err  # <= Makes it work
     23 print("Create one small array again")
---> 24 create_cupy_aray(400_000_000)
     25 print("FINISHED")

/tmp/ipykernel_72472/3675052165.py in create_cupy_aray(n_rows)
      3 def create_cupy_aray(n_rows):
      4     s_1 = cp.asarray([1], dtype="int64").repeat(n_rows)
----> 5     s_2 = cp.asarray([1], dtype="int64").repeat(n_rows)
      6     s_3 = cp.asarray([1], dtype="int64").repeat(n_rows)
      7     s_4 = cp.asarray([1], dtype="int64").repeat(n_rows)

cupy/_core/core.pyx in cupy._core.core.ndarray.repeat()

cupy/_core/core.pyx in cupy._core.core.ndarray.repeat()

cupy/_core/_routines_manipulation.pyx in cupy._core._routines_manipulation._ndarray_repeat()

cupy/_core/_routines_manipulation.pyx in cupy._core._routines_manipulation._repeat()

cupy/_core/core.pyx in cupy._core.core.ndarray.__init__()

cupy/cuda/memory.pyx in cupy.cuda.memory.alloc()

cupy/cuda/memory.pyx in cupy.cuda.memory.MemoryPool.malloc()

cupy/cuda/memory.pyx in cupy.cuda.memory.MemoryPool.malloc()

cupy/cuda/memory.pyx in cupy.cuda.memory.SingleDeviceMemoryPool.malloc()

cupy/cuda/memory.pyx in cupy.cuda.memory.SingleDeviceMemoryPool._malloc()

cupy/cuda/memory.pyx in cupy.cuda.memory.SingleDeviceMemoryPool._try_malloc()

OutOfMemoryError: Out of memory allocating 3,200,000,000 bytes (allocated so far: 32,000,000,512 bytes).
​```

Anyone tried to repro in pure C++?

I have not, mostly because the evidence in my post above pretty clearly indicates that the problem is that the owning Python objects are simply not being destructed, so our C++ RAII design is moot. Moreover, I’m pretty confident that this is entirely unrelated to libcudf/librmm given some of the strange modifications to the Python script that allow the code to work, particularly when those should be exactly equivalent. One example is the list comprehension example above, where changing

def create_cudf_ser(n_rows, num):
    series_list = []
    for _ in range(num):
        series_list.append(cudf.Series([1], dtype="int64").repeat(n_rows))

to

def create_cudf_ser(n_rows, num):
    series_list = [cudf.Series([1], dtype="int64").repeat(n_rows) for _ in range(num)]

makes it work. Another is that (with some additional changes to the script) I observe that

def create_cudf_ser(n_rows):
    series_list = []
    for _ in range(5):
        series_list.append(cudf.Series([1], dtype="int64").repeat(n_rows))

works, while

def create_cudf_ser(n_rows):
    s_1 = cudf.Series([1], dtype="int64").repeat(n_rows)
    s_2 = cudf.Series([1], dtype="int64").repeat(n_rows)
    s_3 = cudf.Series([1], dtype="int64").repeat(n_rows)
    s_4 = cudf.Series([1], dtype="int64").repeat(n_rows)
    s_5 = cudf.Series([1], dtype="int64").repeat(n_rows)

fails. There’s absolutely no reason for those to be different unless Python is failing to destroy objects whose reference counts reach zero. It seems likely that particular Python bytecodes are not playing nice with Cython’s C++ exception forwarding mechanisms.

I can reproduce this as well. I can also confirm that my MWE fails irrespective of whether or not rmm is using a pool.

Thanks @VibhuJawa – I see the same (I had to uncomment the following line to see the error):

schedule_oom_tasks()  #### OOMS the worker

Thanks (edited the code too).

Thanks @VibhuJawa – I see the same (I had to uncomment the following line to see the error):

schedule_oom_tasks()  #### OOMS the worker