pytorch-lightning: "MisconfigurationException: No supported gpu backend found!" with multi gpu training in jupyter notebooks

Bug description

When trying to train on two GPUs in a jupyter notebooks environment on jarvislabs.ai with ddp_notebooks I get the following error “MisconfigurationException: No supported gpu backend found!”.

I’m trying to train on two RTX 5000 GPUs. On a Kaggle GPU the same code runs without any problem.

Any ideas?

How to reproduce the bug

trainer = pl.Trainer(
    max_epochs=2, 
    accelerator="gpu",
    devices=2,
    precision=16,
    accumulate_grad_batches=2
)
trainer.fit(model, train_dl, val_dl)

Error messages and logs

“MisconfigurationException: No supported gpu backend found!”

Environment


#- Lightning Component (e.g. Trainer, LightningModule, LightningApp, LightningWork, LightningFlow):
#- PyTorch Lightning Version (e.g., 1.5.0): 1.7.7
#- Lightning App Version (e.g., 0.5.2):
#- PyTorch Version (e.g., 1.10): 1.11
#- Python version (e.g., 3.9):
#- OS (e.g., Linux):
#- CUDA/cuDNN version: V11.6.55
#- GPU models and configuration: 2x RTX 5000
#- How you installed Lightning(`conda`, `pip`, source): pip
#- Running environment of LightningApp (e.g. local, cloud): jarvislabs.ai 

More info

No response

cc @justusschock @awaelchli

About this issue

  • Original URL
  • State: open
  • Created 2 years ago
  • Comments: 17 (6 by maintainers)

Most upvoted comments

I am also experiencing this issue all of a sudden after migrating from PTL 1.6.5 to 1.9.0

However, my colleagues and I have solved it by exporting CUDA_VISIBLE_DEVICES=XXX as an environment variable on each of our nodes (we use 4 nodes with 8 GPUs each, combined with mpirun), where XXX is the GPU config for that node, so in my case, each node has 8 GPUs it’s export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7. Make sure you export this env var on each node, including the primary node.

Downgrading to PL Lightning 1.7.7 works for me. I don’t know what the cause of the problem is!

@vacmar01 Was your PyTorch installed with GPU support? I suspect that it was not. Please check what

import torch
print(torch.cuda.is_available())

returns for you. If not, please install it like so: pip3 install torch --extra-index-url https://download.pytorch.org/whl/cu116.