cuml: [BUG] RMM-only context destroyed error with Random Forest in loop
It seems we may have an RMM-only memory leak with RandomForestRegressor. This could come up in a wide range of workloads, such as using RandomForestRegressor with RMM during hyper-parameter optimization.
In the following example:
- Without an RMM pool, repeatedly fitting the model, predicting, and deleting the model/predictions causes peak memory of 1.2GB
- With an RMM pool, repeatedly fitting the model, predicting, and deleting the model/predictions causes memory to grow uncontrollably. This can be triggered by uncommenting the rmm related lines. After 15-17 iterations, we exhaust the entire 5 GB pool.
Is it possible there is a place where RMM isn’t getting visibility of a call to free memory?
import cudf
import cuml
import rmm
import cupy as cp
from dask.utils import parse_bytes
from sklearn.datasets import make_regression
# cudf.set_allocator(pool=True, initial_pool_size=parse_bytes("5GB"))
# cp.cuda.set_allocator(rmm.rmm_cupy_allocator)
NFEATURES = 20
X, y = make_regression(
n_samples=10000,
n_features=NFEATURES,
random_state=12,
)
X = X.astype("float32")
X = cp.asarray(X)
y = cp.asarray(y)
for i in range(30):
print(i)
clf = cuml.ensemble.RandomForestRegressor(n_estimators=50)
clf.fit(X, y)
preds = clf.predict(X)
del clf, preds
Environment: 2020-07-31 nightly at ~ 9AM EDT
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Comments: 40 (40 by maintainers)
Commits related to this issue
- Patch for nightly test&bench (#4840) - Fix for MNMG TSVD (similar issue to [cudaErrorContextIsDestroyed in RandomForest](https://github.com/rapidsai/cuml/issues/2632#issuecomment-675753377)) - #4826... — committed to rapidsai/cuml by viclafargue 2 years ago
- Patch for nightly test&bench (#4840) - Fix for MNMG TSVD (similar issue to [cudaErrorContextIsDestroyed in RandomForest](https://github.com/rapidsai/cuml/issues/2632#issuecomment-675753377)) - #4826... — committed to jakirkham/cuml by viclafargue 2 years ago
Fixed via PR 510 to the RMM repo.
Thanks for the repro @JohnZed. I was able to simplify it even further. This repro will actually segfault.
As you identified, when we try and reclaim a block from another stream, we attempt to synchronize a stream that was already destroyed. Unfortunately this isn’t guaranteed to return a
cudaErrorInvalidResourceHandleand can actually segfault.passing == works perfectly fine no memory leak I can see
With two streams, I’m not seeing failures consistently at the same place. Sometimes I get to the 2nd iteration, sometimes the 3rd iteration before the failure.
EDIT: You’re faster 😃
Thanks @harrism . I do see allocs without frees, but only if I include the
model.predict. Regardless of whether I just usefitor use bothfitandpredict, the context appears to be destroyed. It also appears to only occur when inside a Python loop, as Saloni noted above.So far, I’ve tested KNN (Reg/Clf), Random Forest (Reg/Clf) and Logistic Regression with the following script. Only Random Forest appears to have this issue.
Logs:
rfr-logs.dev0.txt rfc-logs.dev0.txt
Interestingly, if I run the script but comment out the
preds = model.predict(X)line, I still get the destroyed context but the allocs match the frees.rfc-fit-only-logs.dev0.txt rfr-fit-only-logs.dev0.txt
Full traceback:
Environment:
cc @jakirkham @Salonijain27
Can you guys turn on logging and share the logs?
You’ll want to look for allocs without frees in the logs.