vision: torch.randperm() in cuda that have a wrong values when the n(int) have a big value(n > 2^12)
π Bug
To Reproduce
Steps to reproduce the behavior:
- When I use the tutorials in TORCHVISION OBJECT DETECTION FINETUNING TUTORIAL, I use the same code in the tutorials
- In the first time, I use the cpu, it is just ok, but in the evaluate, the net still use the GPU to evaluate, it is the first matter
- Then I use the GPU with Cuda to train, I just have one GPU, in the train, the model tell that RuntimeError: CUDA error: device-side assert triggered. I use the pytorch 1.8.1, vision 0.91
- Then I debug the all code, find that some pic is ok, not all. I find that when in the loss calculation, the downsample of the pos and neg have the bug. It use a funtion in _utils.py named that BalancedPositiveNegativeSampler(), it use
torch.randperm(positive.numel(), device=positive.device)[:num_pos]
to generate a ramdon index - But I see the function return a wrong values, it is a very big value, such as 4755801207605297152, and the
positive.numel()
is 265826, so I try different num of int to return.Finally, I find in my computer, when the n >2^12, it will failed to return a right index list.I think the limit of the n is relate to the GPU channel or the others. - I think your code to generate a random index should have a judgment, if the given num larger than the limit, it should force use of CPU
Expected behavior
Returns a random permutation of integers from 0 to n - 1
Environment
Please copy and paste the output from our wrong message.txt
(or fill out the checklist below manually).
You can get the script and run it with:
wget https://raw.githubusercontent.com/pytorch/pytorch/master/torch/utils/collect_env.py
# For security purposes, please check the contents of collect_env.py before running it.
python collect_env.py
- PyTorch / torchvision Version (e.g., 1.0 / 0.4.0):
- OS (e.g., Linux):
- How you installed PyTorch / torchvision (
conda
,pip
, source): - Build command you used (if compiling from source):
- Python version:
- CUDA/cuDNN version:
- GPU models and configuration:
- Any other relevant information:
Additional context
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Comments: 16 (6 by maintainers)
Hi @fmassa, I am experiencing the same issue with (torch==1.8.1+cu111) on Ubuntu 16.04 with A100 card
Hi @nothingwithyou,
I would recommend to open a new ticket on PyTorch and provide a set of the minimum commands to reproduce it. From what you describe something like
torch.randperm(265826, device='cuda').max()
should be enough to show-case any potential issue.Unfortunately when I run the above command, I donβt get any values larger than
n
. See below:I would also advise to use the latest PyTorch nightly and see if the problem is resolved.