pytorch-lightning: CUDA error: an illegal memory access was encountered after updating to the latest stable packages

Can anyone help with this CUDA error: an illegal memory access was encountered ??

It runs fine for several iterations…

🐛 Bug

Traceback (most recent call last):
  File "train_gpu.py", line 237, in <module>
    main_local(hparam_trial)   
  File "train_gpu.py", line 141, in main_local
    trainer.fit(model)
  File "/shared/storage/cs/staffstore/username/anaconda3/envs/sh1/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 859, in fit
    self.single_gpu_train(model)
  File "/shared/storage/cs/staffstore/username/anaconda3/envs/sh1/lib/python3.7/site-packages/pytorch_lightning/trainer/distrib_parts.py", line 503, in single_gpu_train
    self.run_pretrain_routine(model)
  File "/shared/storage/cs/staffstore/username/anaconda3/envs/sh1/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1015, in run_pretrain_routine
    self.train()
  File "/shared/storage/cs/staffstore/username/anaconda3/envs/sh1/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 347, in train
    self.run_training_epoch()
  File "/shared/storage/cs/staffstore/username/anaconda3/envs/sh1/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 419, in run_training_epoch
    _outputs = self.run_training_batch(batch, batch_idx)
  File "/shared/storage/cs/staffstore/username/anaconda3/envs/sh1/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 604, in run_training_batch
    self.batch_loss_value.append(loss)
  File "/shared/storage/cs/staffstore/username/anaconda3/envs/sh1/lib/python3.7/site-packages/pytorch_lightning/trainer/supporters.py", line 44, in append
    x = x.to(self.memory)
RuntimeError: CUDA error: an illegal memory access was encountered

To Reproduce

Environment

CUDA: - GPU: - Quadro P6000 - available: True - version: 10.2
Packages: - numpy: 1.18.1 - pyTorch_debug: False - pyTorch_version: 1.5.0 - pytorch-lightning: 0.7.6 - tensorboard: 2.2.2 - tqdm: 4.46.1
System: - OS: Linux - architecture: - 64bit - - processor: x86_64 - python: 3.7.0 - version: #47~18.04.1-Ubuntu SMP Thu May 7 13:10:50 UTC 2020

About this issue

Original URL
State: closed
Created 4 years ago
Comments: 18 (6 by maintainers)

Most upvoted comments

this seems to be related to mixing apex and cuda somehow. https://github.com/pytorch/pytorch/issues/21819

williamFalcon on Jun 13, 2020