bonnetal: How to fix the RuntimeError: cuDNN error: CUDNN_STATUS_NOT_INITIALIZED ? Thank you!

INTERFACE:
config yaml:  config/cityscapes/darknet21_aspp.yaml
log dir /home/pc/logs/2019-8-20-16:13/
model path None
eval only False
No batchnorm False
----------

Commit hash (training version):  b'5368eed'
----------

Opening config file config/cityscapes/darknet21_aspp.yaml
No pretrained directory found.
Copying files to /home/pc/logs/2019-8-20-16:13/ for further reference.
WARNING: Logging before flag parsing goes to stderr.
W0820 16:13:16.396194 140436803987200 deprecation_wrapper.py:119] From ../../common/logger.py:16: The name tf.summary.FileWriter is deprecated. Please use tf.compat.v1.summary.FileWriter instead.

Images from:  /home3/data/city/city_selected/leftImg8bit/train
Labels from:  /home3/data/city/city_selected/gtFine/train
Inference batch size:  1
Images from:  /home3/data/city/city_selected/leftImg8bit/val
Labels from:  /home3/data/city/city_selected/gtFine/val
Original OS:  32
New OS:  8
Strides:  [2, 2, 2, 1, 1]
Dilations:  [1, 1, 1, 2, 4]
Trying to get backbone weights online from Bonnetal server.
Using pretrained weights from bonnetal server for backbone
[Decoder] os:  4 in:  128 skip: 128 out:  128
[Decoder] os:  2 in:  128 skip: 64 out:  64
[Decoder] os:  1 in:  64 skip: 32 out:  32
Using normalized weights as bias for head.
No path to pretrained, using bonnetal Imagenet backbone weights and random decoder.
Total number of parameters:  19239412
Total number of parameters requires_grad:  19239412
Param encoder  14920544
Param decoder  4318208
Param head  660
Training in device:  cuda
Ignoring class  19  in IoU evaluation
[IOU EVAL] IGNORE:  tensor([19])
[IOU EVAL] INCLUDE:  tensor([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17,
        18])
/pytorch/aten/src/THCUNN/SpatialClassNLLCriterion.cu:103: void cunn_SpatialClassNLLCriterion_updateOutput_kernel(T *, T *, T *, long *, T *, int, int, int, int, int, long) [with T = float, AccumT = float]: block: [3,0,0], thread: [576,0,0] Assertion `t >= 0 && t < n_classes` failed.
/pytorch/aten/src/THCUNN/SpatialClassNLLCriterion.cu:103: void cunn_SpatialClassNLLCriterion_updateOutput_kernel(T *, T *, T *, long *, T *, int, int, int, int, int, long) [with T = float, AccumT = float]: block: [2,0,0], thread: [352,0,0] Assertion `t >= 0 && t < n_classes` failed.
/pytorch/aten/src/THCUNN/SpatialClassNLLCriterion.cu:103: void cunn_SpatialClassNLLCriterion_updateOutput_kernel(T *, T *, T *, long *, T *, int, int, int, int, int, long) [with T = float, AccumT = float]: block: [2,0,0], thread: [353,0,0] Assertion `t >= 0 && t < n_classes` failed.
/pytorch/aten/src/THCUNN/SpatialClassNLLCriterion.cu:103: void cunn_SpatialClassNLLCriterion_updateOutput_kernel(T *, T *, T *, long *, T *, int, int, int, int, int, long) [with T = float, AccumT = float]: block: [2,0,0], thread: [354,0,0] Assertion `t >= 0 && t < n_classes` failed.
/pytorch/aten/src/THCUNN/SpatialClassNLLCriterion.cu:103: void cunn_SpatialClassNLLCriterion_updateOutput_kernel(T *, T *, T *, long *, T *, int, int, int, int, int, long) [with T = float, AccumT = float]: block: [2,0,0], thread: [355,0,0] Assertion `t >= 0 && t < n_classes` failed.
Traceback (most recent call last):
  File "train.py", line 117, in <module>
    trainer.train()
  File "../../tasks/segmentation/modules/trainer.py", line 302, in train
    scheduler=self.scheduler)
  File "../../tasks/segmentation/modules/trainer.py", line 487, in train_epoch
    loss.backward()
  File "/usr/local/lib/python3.5/dist-packages/torch/tensor.py", line 107, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/usr/local/lib/python3.5/dist-packages/torch/autograd/__init__.py", line 93, in backward
    allow_unreachable=True)  # allow_unreachable flag
RuntimeError: cuDNN error: CUDNN_STATUS_NOT_INITIALIZED

About this issue

  • Original URL
  • State: closed
  • Created 5 years ago
  • Comments: 15 (6 by maintainers)

Most upvoted comments

Hi, I updated your comment to make it easier to read. The error message is because your labels are out of range. The cityscapes data needs to be preprocessed before use, to put all labels in the 0-19 range, using their api, which you can access here. The definition of the mapping for each label is defined by the user, and can be found on this script of their api. I usually replace the trainIds 255 and -1 by 19 to make a consistent cross-entropy-able label set.