vision: torch.randperm() in cuda that have a wrong values when the n(int) have a big value(n > 2^12)

🐛 Bug

To Reproduce

Steps to reproduce the behavior:

When I use the tutorials in TORCHVISION OBJECT DETECTION FINETUNING TUTORIAL, I use the same code in the tutorials
In the first time, I use the cpu, it is just ok, but in the evaluate, the net still use the GPU to evaluate, it is the first matter
Then I use the GPU with Cuda to train, I just have one GPU, in the train, the model tell that RuntimeError: CUDA error: device-side assert triggered. I use the pytorch 1.8.1, vision 0.91
Then I debug the all code, find that some pic is ok, not all. I find that when in the loss calculation, the downsample of the pos and neg have the bug. It use a funtion in _utils.py named that BalancedPositiveNegativeSampler(), it use torch.randperm(positive.numel(), device=positive.device)[:num_pos] to generate a ramdon index
But I see the function return a wrong values, it is a very big value, such as 4755801207605297152, and the positive.numel() is 265826, so I try different num of int to return.Finally, I find in my computer, when the n >2^12, it will failed to return a right index list.I think the limit of the n is relate to the GPU channel or the others.
I think your code to generate a random index should have a judgment, if the given num larger than the limit, it should force use of CPU

wrong history.txt

Expected behavior

Returns a random permutation of integers from 0 to n - 1

Environment

Please copy and paste the output from our wrong message.txt

(or fill out the checklist below manually).

You can get the script and run it with:

wget https://raw.githubusercontent.com/pytorch/pytorch/master/torch/utils/collect_env.py
# For security purposes, please check the contents of collect_env.py before running it.
python collect_env.py

PyTorch / torchvision Version (e.g., 1.0 / 0.4.0):
OS (e.g., Linux):
How you installed PyTorch / torchvision (conda, pip, source):
Build command you used (if compiling from source):
Python version:
CUDA/cuDNN version:
GPU models and configuration:
Any other relevant information:

Additional context

wrong envs.txt

About this issue

Original URL
State: closed
Created 3 years ago
Comments: 16 (6 by maintainers)

Most upvoted comments

Hi @fmassa, I am experiencing the same issue with (torch==1.8.1+cu111) on Ubuntu 16.04 with A100 card

import torch
# this is fine
idx = torch.randperm(2**14, device="cuda:0", dtype=torch.long)[:2]
print(idx) # tensor([13049,  6236], device='cuda:0')

# this is also fine
idx = torch.randperm(2**15, device="cpu", dtype=torch.long)[:2]
print(idx) # tensor([23385, 21083])

# this is buggy
idx = torch.randperm(2**15, device="cuda:0", dtype=torch.long)[:2]
print(idx) # tensor([ 336033229560773140, 4114788168274451291], device='cuda:0')

qianyizhang on Dec 26, 2021

Hi @nothingwithyou,

I would recommend to open a new ticket on PyTorch and provide a set of the minimum commands to reproduce it. From what you describe something like torch.randperm(265826, device='cuda').max() should be enough to show-case any potential issue.

Unfortunately when I run the above command, I don’t get any values larger than n. See below:

>>> torch.randperm(265826, device='cuda').max()
tensor(265825, device='cuda:0')
>>> torch.randperm(265826, device='cuda').max()
tensor(265825, device='cuda:0')
>>> torch.randperm(265826, device='cuda').max()
tensor(265825, device='cuda:0')
>>> torch.randperm(265826, device='cuda').max()
tensor(265825, device='cuda:0')
>>> torch.randperm(265826, device='cuda').max()
tensor(265825, device='cuda:0')
>>> torch.randperm(265826, device='cuda').max()
tensor(265825, device='cuda:0')
>>> torch.randperm(265826, device='cuda').max()
tensor(265825, device='cuda:0')
>>> torch.randperm(265826, device='cuda').max()

I would also advise to use the latest PyTorch nightly and see if the problem is resolved.

datumbox on May 12, 2021