cuml: [BUG] SIGABRT in CUML RF, out of bounds memory usage

Describe the bug SIGABRT, seemingly from out of bounds

Steps/Code to reproduce bug

Unknown, but paraameters were just Kaggle Paribas with some various Frequency encoding features to get to (91457, 331) size.

parameters

 OrderedDict([('output_type', 'numpy'), ('random_state', 840607124), ('verbose', False), ('n_estimators', 200), ('n_bins', 128), ('split_criterion', 1), ('max_depth', 18), ('max_leaves', 1024), ('max_features', 'auto'), ('min_samples_leaf', 1), ('min_samples_split', 10), ('min_impurity_decrease', 0.0)])

For a binary classification problem.

No messages in console at all, even though ran in debug mode with verbose=4. All I got was SIGABRT and in dmesg this:

[Sun Jul 11 21:15:41 2021] NVRM: GPU at PCI:0000:01:00: GPU-0bb167f8-b3cd-8df7-9644-d5f95716e554
[Sun Jul 11 21:15:41 2021] NVRM: GPU Board Serial Number: 
[Sun Jul 11 21:15:41 2021] NVRM: Xid (PCI:0000:01:00): 13, pid=2041, Graphics SM Warp Exception on (GPC 3, TPC 3, SM 0): Out Of Range Address
[Sun Jul 11 21:15:41 2021] NVRM: Xid (PCI:0000:01:00): 13, pid=2041, Graphics SM Global Exception on (GPC 3, TPC 3, SM 0): Multiple Warp Errors
[Sun Jul 11 21:15:41 2021] NVRM: Xid (PCI:0000:01:00): 13, pid=2041, Graphics Exception: ESR 0x51df30=0xc13000e 0x51df34=0x24 0x51df28=0x4c1eb72 0x51df2c=0x174
[Sun Jul 11 21:15:41 2021] NVRM: Xid (PCI:0000:01:00): 43, pid=6304, Ch 00000088
[Sun Jul 11 21:15:54 2021] NVRM: Xid (PCI:0000:01:00): 13, pid=6304, Graphics SM Warp Exception on (GPC 4, TPC 2, SM 1): Out Of Range Address
[Sun Jul 11 21:15:54 2021] NVRM: Xid (PCI:0000:01:00): 13, pid=6304, Graphics SM Global Exception on (GPC 4, TPC 2, SM 1): Multiple Warp Errors
[Sun Jul 11 21:15:54 2021] NVRM: Xid (PCI:0000:01:00): 13, pid=6304, Graphics Exception: ESR 0x5257b0=0xc12000e 0x5257b4=0x24 0x5257a8=0x4c1eb72 0x5257ac=0x174
[Sun Jul 11 21:15:54 2021] NVRM: Xid (PCI:0000:01:00): 43, pid=8874, Ch 00000088

Expected behavior

Not to crash, be more stable.

Environment details (please complete the following information):

  • Environment location: Bare-metal
  • Linux Distro/Architecture: Ubuntu 18.04LTS
  • GPU Model/Driver: RTX2080 460.80
  • CUDA: 11.2.2
  • Method of cuDF & cuML install: conda nightly 21.08 – nightly as of 7 days ago.

conda_list.txt.zip

Additional context

If hit again will try to produce repro. But I expect just various testing on NVIDIA’s side will reveal. I’ve only been using CUML RF for a day and already hit this after (maybe) 200 fits on small data.

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Comments: 19 (9 by maintainers)

Commits related to this issue

Most upvoted comments

tagging @vinaydes who’s looking into the issue

Confirming that the fix was just in time for 21.08 and was just merged. Thanks!!!

Thanks to @venkywonka we seem to have reached to the cause of this issue. For now, to unblock you, I would suggest leaving the max_leaves parameter to the default value of -1. Specifying non-default value seems to be causing the issue. I’ll update this bug when we have a proper fix.

@RAMitchell This issue is related how we are updating n_leaves. Basically value of n_leaves could be different for different threads from the same threadblock, in the nodeSplitKernel. I have couple of ways to fix the issue, however since you are planning to update the node queue anyway I was wondering if this issue would also be taken care by it?