cudf: [BUG] Holding on to a RMM allocation exception makes cuDF go OOM
Describe the bug
Trying to allocate a too large cuDF dataframe, RMM raises a std::bad_alloc exception as expected. However, when holding on to the exception (a MemoryError exception in Python), subsequent small dataframe allocations fails with the same std::bad_alloc exception.
This issue relates to https://github.com/rapidsai/dask-cuda/issues/725
Steps/Code to reproduce bug
In the following code we try to allocate a small Series (create_single_cudf_ser()) after a big Series has failed (create_big_cudf_ser()).
Notice:
- If not using a rmm pool (
pool_allocator=False) it works - if we use
rmm.DeviceBufferinstead of Series it works. - if we delete the exception before the next small allocation it works.
- This could be RMM bug, but I was not able to reproduce it only using RMM.
import cudf
import rmm
# Setting `pool_allocator=False` makes it work
rmm.reinitialize(pool_allocator=True)
# create_big_cudf_ser -> Causes OOM on a worker
def create_big_cudf_ser():
# return rmm.DeviceBuffer(size=30*2**30) # <= Makes it work
n_rows = 300_000_000
s_1 = cudf.Series([1], dtype="int64").repeat(n_rows)
s_2 = cudf.Series([1], dtype="int64").repeat(n_rows)
s_3 = cudf.Series([1], dtype="int64").repeat(n_rows)
s_4 = cudf.Series([1], dtype="int64").repeat(n_rows)
s_5 = cudf.Series([1], dtype="int64").repeat(n_rows)
s_6 = cudf.Series([1], dtype="int64").repeat(n_rows)
s_7 = cudf.Series([1], dtype="int64").repeat(n_rows)
return len(s_1) + len(s_2) + len(s_3) + len(s_4) + len(s_5) + len(s_6) + len(s_7)
# create_small_cudf_ser -> should succeed on a worker
def create_single_cudf_ser():
# return rmm.DeviceBuffer(size=2**30) # <= Makes it work
n_rows = 200_000_000
series_1 = cudf.Series([1], dtype="int64").repeat(n_rows)
return len(series_1)
print("create_single_cudf_ser()")
create_single_cudf_ser()
err = None
try:
print("create_big_cudf_ser()")
create_big_cudf_ser()
except MemoryError as e:
err = e
print(e)
# del err # <= Makes it work
print("create_single_cudf_ser()")
create_single_cudf_ser()
print("FINISHED")
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Comments: 18 (18 by maintainers)
@shwina, @vyasr, @VibhuJawa, thanks for the investigation. I think I have a general solution to the issue: https://github.com/dask/distributed/pull/5338
@madsbk @shwina @randerzander @VibhuJawa just wanted to give interested parties an update since I haven’t been able to resolve this issue yet. Apologies in advance, this description will go a bit into the weeds but I think it’s helpful to document my results so far in case someone else has ideas on how to proceed beyond what I’m doing.
I was able to reproduce the bug from the script in the initial issue, here’s a slightly smaller MWE:
The fact that deleting the exception made it work made me pretty suspicious of what was happening here, so I implemented
__del__with a print statement for the Series, Column, and Buffer classes, and I added corresponding print statements in__init__to see what was happening (allprint(..., flush=True)to ensure immediate output). Here’s the output:Without the `del err`
With the `del err`
As you can see, the number of objects that are deleted after the initial (caught) OOM exception and before the subsequent call to
create_cudf_seris larger if we delete the stored exception. If we don’t delete the exception, those objects are only deleted at the very end after the uncaught MemoryError triggers a termination of the script and a full cleanup of the environment. The fact that deleting the error has this effect and some of the other corresponding fixes in the script above (not storing the series objects into local variables, making one larger object instead of multiple smaller ones) suggests a bug related to Python’s reference counting. Manually forcing garbage collection inside or outside the function call does not appear to fix the issue, indicating that the most likely culprit is that the MemoryError object is actively maintaining a reference to the function-local stack frame and preventing its destruction.It is possible that the problem is not in Python itself, but rather in Cython’s handling of references when it propagates C++ exceptions to Python (in this case RMM’s
std::bad_alloc), but I am not yet sure of the best way to distinguish those since the changes required to make RMM fail silently instead of throw could leave its internal memory pool in an invalid internal state. I’ve inspected the Cython-generated C++ code and I’m not seeing anything immediately wrong with the macro it’s using to propagate exceptions, but I’m not terribly familiar with Python’s C APIs so it’s possible that I’m missing something.In summary, unless I’m missing something this appears to be a problem with Python itself, not with cudf or rmm (please correct me if you see an alternative). Whether this is coming from Python or Cython isn’t clear, and I may have to try to build up a MWE independent of RAPIDS libraries to confirm that if I can’t narrow this down by inspection, so I don’t have a good idea of how much longer this will take to fix.
EDIT The example above actually fails whether or not
pool_allocatoris set to True or False, I forgot to test the final version for that but @VibhuJawa pointed out that this error can still occur without using a pool. Conversely, I found that the exact same code will work if I redefinecreate_cudf_serusing a list comprehensionThere is absolutely no reason that this should not be an equivalent function unless it’s an underlying bug in the Python interpreter (again, perhaps due to interactions with the Cython-generated C++ code).
Inspired from @vyasr 's example, see a pure
cupyexample , which has the same issue and (deleting err makes it work) .I have not, mostly because the evidence in my post above pretty clearly indicates that the problem is that the owning Python objects are simply not being destructed, so our C++ RAII design is moot. Moreover, I’m pretty confident that this is entirely unrelated to libcudf/librmm given some of the strange modifications to the Python script that allow the code to work, particularly when those should be exactly equivalent. One example is the list comprehension example above, where changing
to
makes it work. Another is that (with some additional changes to the script) I observe that
works, while
fails. There’s absolutely no reason for those to be different unless Python is failing to destroy objects whose reference counts reach zero. It seems likely that particular Python bytecodes are not playing nice with Cython’s C++ exception forwarding mechanisms.
I can reproduce this as well. I can also confirm that my MWE fails irrespective of whether or not rmm is using a pool.
Thanks (edited the code too).
Thanks @VibhuJawa – I see the same (I had to uncomment the following line to see the error):