cuml: [BUG] RMM-only context destroyed error with Random Forest in loop

It seems we may have an RMM-only memory leak with RandomForestRegressor. This could come up in a wide range of workloads, such as using RandomForestRegressor with RMM during hyper-parameter optimization.

In the following example:

  • Without an RMM pool, repeatedly fitting the model, predicting, and deleting the model/predictions causes peak memory of 1.2GB
  • With an RMM pool, repeatedly fitting the model, predicting, and deleting the model/predictions causes memory to grow uncontrollably. This can be triggered by uncommenting the rmm related lines. After 15-17 iterations, we exhaust the entire 5 GB pool.

Is it possible there is a place where RMM isn’t getting visibility of a call to free memory?

import cudf
import cuml
import rmm
import cupy as cp
from dask.utils import parse_bytes
from sklearn.datasets import make_regression

# cudf.set_allocator(pool=True, initial_pool_size=parse_bytes("5GB"))
# cp.cuda.set_allocator(rmm.rmm_cupy_allocator)

NFEATURES = 20

X, y = make_regression(
    n_samples=10000,
    n_features=NFEATURES,
    random_state=12,
)

X = X.astype("float32")
X = cp.asarray(X)
y = cp.asarray(y)

for i in range(30):
    print(i)
    clf = cuml.ensemble.RandomForestRegressor(n_estimators=50)
    clf.fit(X, y)
    preds = clf.predict(X)
    del clf, preds

Environment: 2020-07-31 nightly at ~ 9AM EDT

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Comments: 40 (40 by maintainers)

Commits related to this issue

Most upvoted comments

Fixed via PR 510 to the RMM repo.

Thanks for the repro @JohnZed. I was able to simplify it even further. This repro will actually segfault.

TEST(PoolTest, TwoStreams)
{
  Pool mr{rmm::mr::get_current_device_resource(), 0};
  cudaStream_t stream;
  const int size = 10000;
  cudaStreamCreate(&stream);
  EXPECT_NO_THROW(rmm::device_buffer buff(size, stream, &mr));
  cudaStreamDestroy(stream);
  mr.allocate(size);
}

As you identified, when we try and reclaim a block from another stream, we attempt to synchronize a stream that was already destroyed. Unfortunately this isn’t guaranteed to return a cudaErrorInvalidResourceHandle and can actually segfault.

passing == works perfectly fine no memory leak I can see

With two streams, I’m not seeing failures consistently at the same place. Sometimes I get to the 2nd iteration, sometimes the 3rd iteration before the failure.

EDIT: You’re faster 😃

Thanks @harrism . I do see allocs without frees, but only if I include the model.predict. Regardless of whether I just use fit or use both fit and predict, the context appears to be destroyed. It also appears to only occur when inside a Python loop, as Saloni noted above.

So far, I’ve tested KNN (Reg/Clf), Random Forest (Reg/Clf) and Logistic Regression with the following script. Only Random Forest appears to have this issue.

# to run: python rmm-model-logger.py rfr-logs.txt
import sys

import cudf
import cuml
import rmm
import numpy as np


logfilename = sys.argv[1]

# swap estimator class here
clf = cuml.ensemble.RandomForestClassifier

rmm.reinitialize(
    pool_allocator=True,
    managed_memory=False,
    initial_pool_size=2e9,
    logging=True,
    devices=0,
    log_file_name=logfilename,
)

X = cudf.DataFrame({"a": range(10), "b": range(10,20)}).astype("float32")
y = cudf.Series(np.random.choice([0, 1], 10))

for i in range(30):
    print(i)
    model = clf()
    model.fit(X, y)
    preds = model.predict(X)

Logs:

import pandas as pd
​
df = pd.read_csv("rfc-logs.dev0.txt")
print(df.Action.value_counts())
​
allocate    211
free        201
Name: Action, dtype: int64

df = pd.read_csv("rfr-logs.dev0.txt")
print(df.Action.value_counts())
allocate    204
free        189
Name: Action, dtype: int64

rfr-logs.dev0.txt rfc-logs.dev0.txt

Interestingly, if I run the script but comment out the preds = model.predict(X) line, I still get the destroyed context but the allocs match the frees.

import pandas as pd
​
df = pd.read_csv("rfc-fit-only-logs.dev0.txt")
print(df.Action.value_counts())
​
import pandas as pd
​
df = pd.read_csv("rfr-fit-only-logs.dev0.txt")
print(df.Action.value_counts())
free        185
allocate    185
Name: Action, dtype: int64
free        206
allocate    206
Name: Action, dtype: int64

rfc-fit-only-logs.dev0.txt rfr-fit-only-logs.dev0.txt

Full traceback:

python rmm-model-logger.py rfr-fit-only-logs.txt
0
1
Traceback (most recent call last):
  File "rmm-model-logger.py", line 30, in <module>
    model.fit(X, y)
  File "cuml/ensemble/randomforestregressor.pyx", line 393, in cuml.ensemble.randomforestregressor.RandomForestRegressor.fit
  File "/raid/nicholasb/miniconda3/envs/rapids-tpcxbb-20200817/lib/python3.7/site-packages/cuml/common/memory_utils.py", line 56, in cupy_rmm_wrapper
    return func(*args, **kwargs)
  File "cuml/ensemble/randomforest_common.pyx", line 251, in cuml.ensemble.randomforest_common.BaseRandomForestModel._dataset_setup_for_fit
  File "/raid/nicholasb/miniconda3/envs/rapids-tpcxbb-20200817/lib/python3.7/site-packages/cuml/common/memory_utils.py", line 56, in cupy_rmm_wrapper
    return func(*args, **kwargs)
  File "/raid/nicholasb/miniconda3/envs/rapids-tpcxbb-20200817/lib/python3.7/site-packages/cuml/common/input_utils.py", line 188, in input_to_cuml_array
    X = convert_dtype(X, to_dtype=convert_to_dtype)
  File "/raid/nicholasb/miniconda3/envs/rapids-tpcxbb-20200817/lib/python3.7/site-packages/cuml/common/memory_utils.py", line 56, in cupy_rmm_wrapper
    return func(*args, **kwargs)
  File "/raid/nicholasb/miniconda3/envs/rapids-tpcxbb-20200817/lib/python3.7/site-packages/cuml/common/input_utils.py", line 459, in convert_dtype
    would_lose_info = _typecast_will_lose_information(X, to_dtype)
  File "/raid/nicholasb/miniconda3/envs/rapids-tpcxbb-20200817/lib/python3.7/site-packages/cuml/common/input_utils.py", line 504, in _typecast_will_lose_information
    (X < target_dtype_range.min) |
  File "/raid/nicholasb/miniconda3/envs/rapids-tpcxbb-20200817/lib/python3.7/site-packages/cudf/core/series.py", line 1537, in __lt__
    return self._binaryop(other, "lt")
  File "/raid/nicholasb/miniconda3/envs/rapids-tpcxbb-20200817/lib/python3.7/contextlib.py", line 74, in inner
    return func(*args, **kwds)
  File "/raid/nicholasb/miniconda3/envs/rapids-tpcxbb-20200817/lib/python3.7/site-packages/cudf/core/series.py", line 1083, in _binaryop
    outcol = lhs._column.binary_operator(fn, rhs, reflect=reflect)
  File "/raid/nicholasb/miniconda3/envs/rapids-tpcxbb-20200817/lib/python3.7/site-packages/cudf/core/column/numerical.py", line 100, in binary_operator
    lhs=self, rhs=rhs, op=binop, out_dtype=out_dtype, reflect=reflect
  File "/raid/nicholasb/miniconda3/envs/rapids-tpcxbb-20200817/lib/python3.7/contextlib.py", line 74, in inner
    return func(*args, **kwds)
  File "/raid/nicholasb/miniconda3/envs/rapids-tpcxbb-20200817/lib/python3.7/site-packages/cudf/core/column/numerical.py", line 472, in _numeric_column_binop
    out = libcudf.binaryop.binaryop(lhs, rhs, op, out_dtype)
  File "cudf/_lib/binaryop.pyx", line 200, in cudf._lib.binaryop.binaryop
  File "cudf/_lib/scalar.pyx", line 361, in cudf._lib.scalar.as_scalar
  File "cudf/_lib/scalar.pyx", line 81, in cudf._lib.scalar.Scalar.__init__
  File "cudf/_lib/scalar.pyx", line 174, in cudf._lib.scalar._set_numeric_from_np_scalar
RuntimeError: CUDA error at: ../include/rmm/mr/device/detail/stream_ordered_memory_resource.hpp365: cudaErrorContextIsDestroyed context is destroyed

Environment:

conda list | grep "rmm\|cudf\|cuml\|numba\|cupy\|rapids"
# packages in environment at /raid/nicholasb/miniconda3/envs/rapids-tpcxbb-20200817:
cudf                      0.15.0a200817   py37_g1778921b0_4666    rapidsai-nightly
cuml                      0.15.0a200817   cuda10.2_py37_g1e5b7d348_1979    rapidsai-nightly
cupy                      7.7.0            py37h940342b_0    conda-forge
dask-cuda                 0.15.0a200817          py37_117    rapidsai-nightly
dask-cudf                 0.15.0a200817   py37_g1778921b0_4666    rapidsai-nightly
faiss-proc                1.0.0                      cuda    rapidsai-nightly
libcudf                   0.15.0a200817   cuda10.2_g1778921b0_4666    rapidsai-nightly
libcuml                   0.15.0a200817   cuda10.2_g1e5b7d348_1979    rapidsai-nightly
libcumlprims              0.15.0a200812       cuda10.2_61    rapidsai-nightly
librmm                    0.15.0a200817   cuda10.2_g17efc89_665    rapidsai-nightly
numba                     0.50.1           py37h0da4684_1    conda-forge
rmm                       0.15.0a200817   py37_g17efc89_665    rapidsai-nightly
ucx                       1.8.1+g6b29558       cuda10.2_0    rapidsai-nightly
ucx-proc                  1.0.0                       gpu    rapidsai-nightly
ucx-py                    0.15.0a200817+g6b29558        py37_203    rapidsai-nightly

cc @jakirkham @Salonijain27

Can you guys turn on logging and share the logs?

You’ll want to look for allocs without frees in the logs.