cuml: [BUG] SIGABRT in CUML RF, out of bounds memory usage

Describe the bug SIGABRT, seemingly from out of bounds

Steps/Code to reproduce bug

Unknown, but paraameters were just Kaggle Paribas with some various Frequency encoding features to get to (91457, 331) size.

parameters

 OrderedDict([('output_type', 'numpy'), ('random_state', 840607124), ('verbose', False), ('n_estimators', 200), ('n_bins', 128), ('split_criterion', 1), ('max_depth', 18), ('max_leaves', 1024), ('max_features', 'auto'), ('min_samples_leaf', 1), ('min_samples_split', 10), ('min_impurity_decrease', 0.0)])

For a binary classification problem.

No messages in console at all, even though ran in debug mode with verbose=4. All I got was SIGABRT and in dmesg this:

[Sun Jul 11 21:15:41 2021] NVRM: GPU at PCI:0000:01:00: GPU-0bb167f8-b3cd-8df7-9644-d5f95716e554
[Sun Jul 11 21:15:41 2021] NVRM: GPU Board Serial Number: 
[Sun Jul 11 21:15:41 2021] NVRM: Xid (PCI:0000:01:00): 13, pid=2041, Graphics SM Warp Exception on (GPC 3, TPC 3, SM 0): Out Of Range Address
[Sun Jul 11 21:15:41 2021] NVRM: Xid (PCI:0000:01:00): 13, pid=2041, Graphics SM Global Exception on (GPC 3, TPC 3, SM 0): Multiple Warp Errors
[Sun Jul 11 21:15:41 2021] NVRM: Xid (PCI:0000:01:00): 13, pid=2041, Graphics Exception: ESR 0x51df30=0xc13000e 0x51df34=0x24 0x51df28=0x4c1eb72 0x51df2c=0x174
[Sun Jul 11 21:15:41 2021] NVRM: Xid (PCI:0000:01:00): 43, pid=6304, Ch 00000088
[Sun Jul 11 21:15:54 2021] NVRM: Xid (PCI:0000:01:00): 13, pid=6304, Graphics SM Warp Exception on (GPC 4, TPC 2, SM 1): Out Of Range Address
[Sun Jul 11 21:15:54 2021] NVRM: Xid (PCI:0000:01:00): 13, pid=6304, Graphics SM Global Exception on (GPC 4, TPC 2, SM 1): Multiple Warp Errors
[Sun Jul 11 21:15:54 2021] NVRM: Xid (PCI:0000:01:00): 13, pid=6304, Graphics Exception: ESR 0x5257b0=0xc12000e 0x5257b4=0x24 0x5257a8=0x4c1eb72 0x5257ac=0x174
[Sun Jul 11 21:15:54 2021] NVRM: Xid (PCI:0000:01:00): 43, pid=8874, Ch 00000088

Expected behavior

Not to crash, be more stable.

Environment details (please complete the following information):

Environment location: Bare-metal
Linux Distro/Architecture: Ubuntu 18.04LTS
GPU Model/Driver: RTX2080 460.80
CUDA: 11.2.2
Method of cuDF & cuML install: conda nightly 21.08 – nightly as of 7 days ago.

conda_list.txt.zip

Additional context

If hit again will try to produce repro. But I expect just various testing on NVIDIA’s side will reveal. I’ve only been using CUML RF for a day and already hit this after (maybe) 200 fits on small data.

About this issue

Original URL
State: closed
Created 3 years ago
Comments: 19 (9 by maintainers)

Commits related to this issue

Fix for crash in RF when `max_leaves` parameter is specified (#4126) Fixes issue #4046. In the `nodeSplitKernel` each thread calls `leafBasedOnParams()` which reads global variable `n_leaves`. Diff... — committed to rapidsai/cuml by vinaydes 3 years ago
Fix for crash in RF when `max_leaves` parameter is specified (#4126) Fixes issue #4046. In the `nodeSplitKernel` each thread calls `leafBasedOnParams()` which reads global variable `n_leaves`. Diff... — committed to vimarsh6739/cuml by vinaydes 3 years ago

Most upvoted comments

tagging @vinaydes who’s looking into the issue

dantegd on Jul 12, 2021

Confirming that the fix was just in time for 21.08 and was just merged. Thanks!!!

dantegd on Jul 29, 2021

Thanks to @venkywonka we seem to have reached to the cause of this issue. For now, to unblock you, I would suggest leaving the max_leaves parameter to the default value of -1. Specifying non-default value seems to be causing the issue. I’ll update this bug when we have a proper fix.

@RAMitchell This issue is related how we are updating n_leaves. Basically value of n_leaves could be different for different threads from the same threadblock, in the nodeSplitKernel. I have couple of ways to fix the issue, however since you are planning to update the node queue anyway I was wondering if this issue would also be taken care by it?

vinaydes on Jul 28, 2021