pytorch-lightning: CUDA error: an illegal memory access was encountered after updating to the latest stable packages
Can anyone help with this CUDA error: an illegal memory access was encountered ??
It runs fine for several iterationsโฆ
๐ Bug
Traceback (most recent call last):
File "train_gpu.py", line 237, in <module>
main_local(hparam_trial)
File "train_gpu.py", line 141, in main_local
trainer.fit(model)
File "/shared/storage/cs/staffstore/username/anaconda3/envs/sh1/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 859, in fit
self.single_gpu_train(model)
File "/shared/storage/cs/staffstore/username/anaconda3/envs/sh1/lib/python3.7/site-packages/pytorch_lightning/trainer/distrib_parts.py", line 503, in single_gpu_train
self.run_pretrain_routine(model)
File "/shared/storage/cs/staffstore/username/anaconda3/envs/sh1/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1015, in run_pretrain_routine
self.train()
File "/shared/storage/cs/staffstore/username/anaconda3/envs/sh1/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 347, in train
self.run_training_epoch()
File "/shared/storage/cs/staffstore/username/anaconda3/envs/sh1/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 419, in run_training_epoch
_outputs = self.run_training_batch(batch, batch_idx)
File "/shared/storage/cs/staffstore/username/anaconda3/envs/sh1/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 604, in run_training_batch
self.batch_loss_value.append(loss)
File "/shared/storage/cs/staffstore/username/anaconda3/envs/sh1/lib/python3.7/site-packages/pytorch_lightning/trainer/supporters.py", line 44, in append
x = x.to(self.memory)
RuntimeError: CUDA error: an illegal memory access was encountered
To Reproduce
Environment
- CUDA: - GPU: - Quadro P6000 - available: True - version: 10.2
- Packages: - numpy: 1.18.1 - pyTorch_debug: False - pyTorch_version: 1.5.0 - pytorch-lightning: 0.7.6 - tensorboard: 2.2.2 - tqdm: 4.46.1
- System: - OS: Linux - architecture: - 64bit - - processor: x86_64 - python: 3.7.0 - version: #47~18.04.1-Ubuntu SMP Thu May 7 13:10:50 UTC 2020
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Comments: 18 (6 by maintainers)
this seems to be related to mixing apex and cuda somehow. https://github.com/pytorch/pytorch/issues/21819