pytorch-lightning: [Bug] RuntimeError: No backend type associated with device type cpu

Bug description

On upgrading torch and lightning to both 2.1.0, and running DDP leads to the following error trace,

# Error messages and logs here please
23 Traceback (most recent call last):
24   File "/home/nikhil_valencediscovery_com/projects/openMLIP/src/mlip/train.py", line 126, in main
25     train(cfg)
26   File "/home/nikhil_valencediscovery_com/projects/openMLIP/src/mlip/train.py", line 102, in train
27     trainer.fit(model, datamodule=datamodule, ckpt_path=cfg.get("ckpt_path"))
28   File "/home/nikhil_valencediscovery_com/local/conda/envs/mlip4/lib/python3.11/site-packages/lightning/pytorch/trainer/trainer.py", line 545, in fit
29     call._call_and_handle_interrupt(
30   File "/home/nikhil_valencediscovery_com/local/conda/envs/mlip4/lib/python3.11/site-packages/lightning/pytorch/trainer/call.py", line 43, in _call_and_handle_interrupt
31     return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
32            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
33   File "/home/nikhil_valencediscovery_com/local/conda/envs/mlip4/lib/python3.11/site-packages/lightning/pytorch/strategies/launchers/subprocess_script.py", line 102, in launch
34     return function(*args, **kwargs)
35            ^^^^^^^^^^^^^^^^^^^^^^^^^
36   File "/home/nikhil_valencediscovery_com/local/conda/envs/mlip4/lib/python3.11/site-packages/lightning/pytorch/trainer/trainer.py", line 581, in _fit_impl
37     self._run(model, ckpt_path=ckpt_path)
38   File "/home/nikhil_valencediscovery_com/local/conda/envs/mlip4/lib/python3.11/site-packages/lightning/pytorch/trainer/trainer.py", line 990, in _run
39     results = self._run_stage()
40               ^^^^^^^^^^^^^^^^^
41   File "/home/nikhil_valencediscovery_com/local/conda/envs/mlip4/lib/python3.11/site-packages/lightning/pytorch/trainer/trainer.py", line 1034, in _run_stage
42     self._run_sanity_check()
43   File "/home/nikhil_valencediscovery_com/local/conda/envs/mlip4/lib/python3.11/site-packages/lightning/pytorch/trainer/trainer.py", line 1063, in _run_sanity_check
44     val_loop.run()
45   File "/home/nikhil_valencediscovery_com/local/conda/envs/mlip4/lib/python3.11/site-packages/lightning/pytorch/loops/utilities.py", line 181, in _decorator
46     return loop_run(self, *args, **kwargs)
47            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
48   File "/home/nikhil_valencediscovery_com/local/conda/envs/mlip4/lib/python3.11/site-packages/lightning/pytorch/loops/evaluation_loop.py", line 141, in run
49     return self.on_run_end()
50            ^^^^^^^^^^^^^^^^^
51   File "/home/nikhil_valencediscovery_com/local/conda/envs/mlip4/lib/python3.11/site-packages/lightning/pytorch/loops/evaluation_loop.py", line 253, in on_run_end
52     self._on_evaluation_epoch_end()
53   File "/home/nikhil_valencediscovery_com/local/conda/envs/mlip4/lib/python3.11/site-packages/lightning/pytorch/loops/evaluation_loop.py", line 331, in _on_evaluation_epoch_end
54     trainer._logger_connector.on_epoch_end()
55   File "/home/nikhil_valencediscovery_com/local/conda/envs/mlip4/lib/python3.11/site-packages/lightning/pytorch/trainer/connectors/logger_connector/logger_connector.py", line 187, in on_epoch_end
56     metrics = self.metrics
57               ^^^^^^^^^^^^
58   File "/home/nikhil_valencediscovery_com/local/conda/envs/mlip4/lib/python3.11/site-packages/lightning/pytorch/trainer/connectors/logger_connector/logger_connector.py", line 226, in metrics
59     return self.trainer._results.metrics(on_step)
60            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
61   File "/home/nikhil_valencediscovery_com/local/conda/envs/mlip4/lib/python3.11/site-packages/lightning/pytorch/trainer/connectors/logger_connector/result.py", line 471, in metrics
62     value = self._get_cache(result_metric, on_step)
63             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
64   File "/home/nikhil_valencediscovery_com/local/conda/envs/mlip4/lib/python3.11/site-packages/lightning/pytorch/trainer/connectors/logger_connector/result.py", line 435, in _get_cache
65     result_metric.compute()
66   File "/home/nikhil_valencediscovery_com/local/conda/envs/mlip4/lib/python3.11/site-packages/lightning/pytorch/trainer/connectors/logger_connector/result.py", line 280, in wrapped_func
67     self._computed = compute(*args, **kwargs)
68                      ^^^^^^^^^^^^^^^^^^^^^^^^
69   File "/home/nikhil_valencediscovery_com/local/conda/envs/mlip4/lib/python3.11/site-packages/lightning/pytorch/trainer/connectors/logger_connector/result.py", line 243, in compute
70     value = self.meta.sync(self.value.clone())  # `clone` because `sync` is in-place
71             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
72   File "/home/nikhil_valencediscovery_com/local/conda/envs/mlip4/lib/python3.11/site-packages/lightning/pytorch/strategies/ddp.py", line 330, in reduce
73     return _sync_ddp_if_available(tensor, group, reduce_op=reduce_op)
74            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
75   File "/home/nikhil_valencediscovery_com/local/conda/envs/mlip4/lib/python3.11/site-packages/lightning/fabric/utilities/distributed.py", line 171, in _sync_ddp_if_available
76     return _sync_ddp(result, group=group, reduce_op=reduce_op)
77            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
78   File "/home/nikhil_valencediscovery_com/local/conda/envs/mlip4/lib/python3.11/site-packages/lightning/fabric/utilities/distributed.py", line 221, in _sync_ddp
79     torch.distributed.all_reduce(result, op=op, group=group, async_op=False)
80   File "/home/nikhil_valencediscovery_com/local/conda/envs/mlip4/lib/python3.11/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper
81     return func(*args, **kwargs)
82            ^^^^^^^^^^^^^^^^^^^^^
83   File "/home/nikhil_valencediscovery_com/local/conda/envs/mlip4/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 2050, in all_reduce
84     work = group.allreduce([tensor], opts)
85            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
86 RuntimeError: No backend type associated with device type cpu
87 Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.

On downgrading lightning to 2.0.1, the error goes away.

What version are you seeing the problem on?

master

How to reproduce the bug

No response

Error messages and logs

Environment

Current environment
<details>
  <summary>Current environment</summary>

* CUDA:
	- GPU:               None
	- available:         False
	- version:           11.8
* Lightning:
	- lightning:         2.0.1.post0
	- lightning-cloud:   0.5.42
	- lightning-utilities: 0.9.0
	- pytorch-lightning: 2.1.0
	- torch:             2.1.0
	- torch-cluster:     1.6.3
	- torch-geometric:   2.4.0
	- torch-scatter:     2.1.2
	- torch-sparse:      0.6.18
	- torchmetrics:      1.2.0
* Packages:
	- absl-py:           2.0.0
	- aiobotocore:       2.5.4
	- aiohttp:           3.8.6
	- aioitertools:      0.11.0
	- aiosignal:         1.3.1
	- antlr4-python3-runtime: 4.9.3
	- anyio:             3.7.1
	- appdirs:           1.4.4
	- argon2-cffi:       23.1.0
	- argon2-cffi-bindings: 21.2.0
	- arrow:             1.3.0
	- ase:               3.22.1
	- asttokens:         2.4.0
	- async-lru:         2.0.4
	- async-timeout:     4.0.3
	- attrs:             23.1.0
	- babel:             2.13.0
	- backcall:          0.2.0
	- backoff:           2.2.1
	- backports.cached-property: 1.0.2
	- backports.functools-lru-cache: 1.6.5
	- beautifulsoup4:    4.12.2
	- black:             23.9.1
	- bleach:            6.1.0
	- blessed:           1.19.1
	- blinker:           1.6.3
	- boto3:             1.28.17
	- botocore:          1.31.17
	- brotli:            1.1.0
	- build:             0.10.0
	- cachecontrol:      0.12.14
	- cached-property:   1.5.2
	- cachetools:        5.3.1
	- certifi:           2023.7.22
	- cffi:              1.16.0
	- cfgv:              3.3.1
	- charset-normalizer: 3.3.0
	- cleo:              2.0.1
	- click:             8.1.7
	- colorama:          0.4.6
	- comm:              0.1.4
	- contourpy:         1.1.1
	- coverage:          7.3.2
	- crashtest:         0.4.1
	- croniter:          1.3.15
	- cryptography:      41.0.4
	- cycler:            0.12.1
	- datamol:           0.0.0
	- dateutils:         0.6.12
	- debugpy:           1.8.0
	- decorator:         5.1.1
	- deepdiff:          6.6.0
	- defusedxml:        0.7.1
	- distlib:           0.3.7
	- docker-pycreds:    0.4.0
	- dulwich:           0.21.6
	- e3nn:              0.5.1
	- einops:            0.6.0
	- entrypoints:       0.4
	- exceptiongroup:    1.1.3
	- executing:         1.2.0
	- fastapi:           0.88.0
	- fastjsonschema:    2.18.1
	- filelock:          3.12.4
	- flask:             3.0.0
	- fonttools:         4.43.1
	- fqdn:              1.5.1
	- freetype-py:       2.3.0
	- frozenlist:        1.4.0
	- fsspec:            2023.9.2
	- gcsfs:             2023.9.2
	- gitdb:             4.0.10
	- gitpython:         3.1.37
	- gmpy2:             2.1.2
	- google-api-core:   2.12.0
	- google-auth:       2.23.3
	- google-auth-oauthlib: 0.4.6
	- google-cloud-core: 2.3.3
	- google-cloud-storage: 2.12.0
	- google-crc32c:     1.1.2
	- google-resumable-media: 2.6.0
	- googleapis-common-protos: 1.61.0
	- greenlet:          3.0.0
	- grpcio:            1.59.1
	- h11:               0.14.0
	- h5py:              3.10.0
	- html5lib:          1.1
	- hydra-core:        1.3.2
	- identify:          2.5.30
	- idna:              3.4
	- importlib-metadata: 6.8.0
	- importlib-resources: 6.1.0
	- iniconfig:         2.0.0
	- inquirer:          3.1.3
	- installer:         0.7.0
	- ipdb:              0.13.13
	- ipykernel:         6.25.2
	- ipython:           8.16.1
	- ipywidgets:        8.1.1
	- isoduration:       20.11.0
	- itsdangerous:      2.1.2
	- jaraco.classes:    3.3.0
	- jedi:              0.19.1
	- jeepney:           0.8.0
	- jinja2:            3.1.2
	- jmespath:          1.0.1
	- joblib:            1.3.2
	- json5:             0.9.14
	- jsonpointer:       2.4
	- jsonschema:        4.19.1
	- jsonschema-specifications: 2023.7.1
	- jupyter-client:    8.4.0
	- jupyter-core:      5.4.0
	- jupyter-events:    0.7.0
	- jupyter-lsp:       2.2.0
	- jupyter-server:    2.7.3
	- jupyter-server-terminals: 0.4.4
	- jupyterlab:        4.0.7
	- jupyterlab-pygments: 0.2.2
	- jupyterlab-server: 2.25.0
	- jupyterlab-widgets: 3.0.9
	- keyring:           23.13.1
	- kiwisolver:        1.4.5
	- lightning:         2.0.1.post0
	- lightning-cloud:   0.5.42
	- lightning-utilities: 0.9.0
	- lockfile:          0.12.2
	- loguru:            0.7.2
	- markdown:          3.5
	- markdown-it-py:    3.0.0
	- markupsafe:        2.1.3
	- matplotlib:        3.8.0
	- matplotlib-inline: 0.1.6
	- matscipy:          0.7.0
	- mdurl:             0.1.0
	- mistune:           3.0.1
	- mlip:              0.0.1.dev157+gc3d9c0b.d20231016
	- more-itertools:    10.1.0
	- mpmath:            1.3.0
	- msgpack:           1.0.6
	- multidict:         6.0.4
	- munkres:           1.1.4
	- mypy-extensions:   1.0.0
	- nbclient:          0.8.0
	- nbconvert:         7.9.2
	- nbformat:          5.9.2
	- nest-asyncio:      1.5.8
	- networkx:          3.1
	- nodeenv:           1.8.0
	- notebook-shim:     0.2.3
	- numpy:             1.26.0
	- oauthlib:          3.2.2
	- omegaconf:         2.3.0
	- openqdc:           0.0.0
	- opt-einsum:        3.3.0
	- opt-einsum-fx:     0.1.4
	- ordered-set:       4.1.0
	- orjson:            3.9.8
	- overrides:         7.4.0
	- packaging:         23.2
	- pandas:            2.1.1
	- pandocfilters:     1.5.0
	- parso:             0.8.3
	- pathspec:          0.11.2
	- pathtools:         0.1.2
	- patsy:             0.5.3
	- pexpect:           4.8.0
	- pickleshare:       0.7.5
	- pillow:            10.1.0
	- pip:               23.3
	- pkginfo:           1.9.6
	- pkgutil-resolve-name: 1.3.10
	- platformdirs:      3.11.0
	- pluggy:            1.3.0
	- ply:               3.11
	- poetry:            1.5.1
	- poetry-core:       1.6.1
	- poetry-plugin-export: 1.5.0
	- pre-commit:        3.5.0
	- prettytable:       3.9.0
	- prometheus-client: 0.17.1
	- prompt-toolkit:    3.0.39
	- protobuf:          4.24.4
	- psutil:            5.9.5
	- ptyprocess:        0.7.0
	- pure-eval:         0.2.2
	- pyasn1:            0.5.0
	- pyasn1-modules:    0.3.0
	- pycairo:           1.25.0
	- pycparser:         2.21
	- pydantic:          1.10.13
	- pygments:          2.16.1
	- pyjwt:             2.8.0
	- pyopenssl:         23.2.0
	- pyparsing:         3.1.1
	- pyproject-hooks:   1.0.0
	- pyqt5:             5.15.9
	- pyqt5-sip:         12.12.2
	- pyrootutils:       1.0.4
	- pysocks:           1.7.1
	- pytest:            7.4.2
	- pytest-cov:        4.1.0
	- python-dateutil:   2.8.2
	- python-dotenv:     1.0.0
	- python-editor:     1.0.4
	- python-json-logger: 2.0.7
	- python-multipart:  0.0.6
	- pytorch-lightning: 2.1.0
	- pytz:              2023.3.post1
	- pyu2f:             0.1.5
	- pyyaml:            6.0.1
	- pyzmq:             25.1.1
	- rapidfuzz:         2.15.2
	- readchar:          4.0.5.dev0
	- referencing:       0.30.2
	- reportlab:         4.0.6
	- requests:          2.31.0
	- requests-oauthlib: 1.3.1
	- requests-toolbelt: 1.0.0
	- rfc3339-validator: 0.1.4
	- rfc3986-validator: 0.1.1
	- rich:              13.6.0
	- rlpycairo:         0.2.0
	- rpds-py:           0.10.6
	- rsa:               4.9
	- ruff:              0.0.292
	- s3fs:              2023.9.2
	- s3transfer:        0.6.2
	- scikit-learn:      1.3.1
	- scipy:             1.11.3
	- seaborn:           0.13.0
	- secretstorage:     3.3.3
	- selfies:           2.1.1
	- send2trash:        1.8.2
	- sentry-sdk:        1.32.0
	- setproctitle:      1.3.3
	- setuptools:        68.2.2
	- shellingham:       1.5.3
	- sip:               6.7.12
	- six:               1.16.0
	- smmap:             3.0.5
	- sniffio:           1.3.0
	- soupsieve:         2.5
	- sqlalchemy:        2.0.22
	- stack-data:        0.6.2
	- starlette:         0.22.0
	- starsessions:      1.3.0
	- statsmodels:       0.14.0
	- sympy:             1.12
	- tensorboard:       2.11.2
	- tensorboard-data-server: 0.6.1
	- tensorboard-plugin-wit: 1.8.1
	- terminado:         0.17.1
	- threadpoolctl:     3.2.0
	- tinycss2:          1.2.1
	- toml:              0.10.2
	- tomli:             2.0.1
	- tomlkit:           0.12.1
	- torch:             2.1.0
	- torch-cluster:     1.6.3
	- torch-geometric:   2.4.0
	- torch-scatter:     2.1.2
	- torch-sparse:      0.6.18
	- torchmetrics:      1.2.0
	- tornado:           6.3.3
	- tqdm:              4.66.1
	- traitlets:         5.11.2
	- triton:            2.1.0
	- trove-classifiers: 2023.9.19
	- types-python-dateutil: 2.8.19.14
	- typing-extensions: 4.8.0
	- typing-utils:      0.1.0
	- tzdata:            2023.3
	- ukkonen:           1.0.1
	- uri-template:      1.3.0
	- urllib3:           1.26.17
	- uvicorn:           0.23.2
	- virtualenv:        20.24.4
	- wandb:             0.15.12
	- wcwidth:           0.2.8
	- webcolors:         1.13
	- webencodings:      0.5.1
	- websocket-client:  1.6.4
	- websockets:        11.0.3
	- werkzeug:          3.0.0
	- wheel:             0.41.2
	- widgetsnbextension: 4.0.9
	- wrapt:             1.15.0
	- yarl:              1.9.2
	- zipp:              3.17.0
* System:
	- OS:                Linux
	- architecture:
		- 64bit
		- ELF
	- processor:         x86_64
	- python:            3.11.6
	- release:           5.15.0-1032-gcp
	- version:           #40~20.04.1-Ubuntu SMP Tue Apr 11 02:49:52 UTC 2023

</details>

More info

No response

About this issue

  • Original URL
  • State: open
  • Created 8 months ago
  • Reactions: 5
  • Comments: 21 (4 by maintainers)

Commits related to this issue

Most upvoted comments

If I understood correctly, when using self.log(..., sync_dist=True) with DDP, you have to transfer the tensor to the GPU before logging.

Is it possible to move the tensors to the correct device automatically in LightningModule.log()? If not, I feel like this should be mentioned in the documentation, and it would be good to give a better error message. Currently the 15-minute Lightning tutorial instructs to remove any .cuda() or device calls, because LightningModules are hardware agnostic.

It looks like the changes was intentional. The changelog says:

self.loged tensors are now kept in the original device to reduce unnecessary host-to-device synchronizations (#17334)

This means if you pass in the tensor, it already needs to be on the right device and the user needs to explicitly perform the .to() call.

cc @carmocca

Same bug here after upgrade to torch==2.1.0 and lightning==2.1.0.

This bug appeared when running Metric.compute() of a torchmetric after a validation epoch.

Edit: I am using lightning fabric instead of lightning trainer. This bug is also triggered.

I’ve solved the issue on lightning==2.1.3 . When rewriting any epoch_end function, if you log, just make sure that the tensor is on gpu device. If you initialize new tensor, initialize it with device=self.device

@awaelchli Thanks for clarifying. I’ve found another corner case where the new behaviour breaks existing code: If you re-use a trainer instance multiple times (e.g. for evaluating multiple epochs), you can end up with metrics moved to CPU even if you log them with GPU tensors.

The reason being that the logger connector moves all intermediate results to CPU on teardown. So on the second call to trainer.validate, the helper-state (e.g. cumulated_batch_size) of the cached results are on CPU. This can be fixed by removing all cached results through

trainer.validate_loop._results.clear()

Here’s a full example to reproduce this:

import torch
from lightning import Trainer, LightningModule
from torch.utils.data import DataLoader


class LitModel(LightningModule):
    def __init__(self) -> None:
        super().__init__()
        self.layer = torch.nn.Linear(1, 1)

    def training_step(self, x):
        loss = self.layer(x).mean()
        return loss

    def validation_step(self, *args, **kwargs):
        self.log(
            "foo", value=torch.zeros(1, device=self.device), on_step=True, sync_dist=True
        )
        return super().validation_step(*args, **kwargs)

    def configure_optimizers(self):
        return torch.optim.SGD(self.parameters(), lr=0.1)

    def val_dataloader(self):
        return DataLoader(torch.randn(32, 1), batch_size=1)


def main():
    model = LitModel()
    trainer = Trainer(devices=2, accelerator="gpu", max_epochs=2)
    trainer.validate(model)
    # Uncomment the following line to fix the issue
    #trainer.validate_loop._results.clear()
    trainer.validate(model)


if __name__ == "__main__":
    main()

Has this issue been addressed in nightly? I was really trying to stick to either pip or conda versions and it looks like 2.0.8 is not available on either.

My feeling is that the DDP strategy in lightning==2.0.8 initialized distributed backends for both CPU and GPU when running with device=GPU. Below is a minimal example that works with 2.0.8, but crashes in 2.1.0:

import torch
from lightning import Trainer, LightningModule
from torch.utils.data import DataLoader


class LitModel(LightningModule):
    def __init__(self) -> None:
        super().__init__()
        self.layer = torch.nn.Linear(1, 1)

    def training_step(self, x):
        # Everything but the next line is just dummy-code to make it run
        self.log(
            "foo", value=torch.zeros(1, device="cpu"), on_step=True, sync_dist=True
        )
        loss = self.layer(x).mean()
        return loss

    def configure_optimizers(self):
        return torch.optim.SGD(self.parameters(), lr=0.1)

    def train_dataloader(self):
        return DataLoader(torch.randn(32, 1), batch_size=1)


def main():
    model = LitModel()
    trainer = Trainer(devices=2, accelerator="gpu", max_epochs=2)
    trainer.fit(model)


if __name__ == "__main__":
    main()

Note that this isn’t restricted to distributed code that’s run by lightning. We have some functionality that uses torch.distributed directly and are running into the same exact issue when we try to broadcast non-CUDA tensors.

I’ve got the same error on torch==2.1.0 and lightning==2.1.0 and fixed when downgrading to pytorch_lightning==2.0.8

Same for me. Downgrading to pytorch-lightning==2.0.8 fixed the issue.

I’ve solved the issue on lightning==2.1.3 . When rewriting any epoch_end function, if you log, just make sure that the tensor is on gpu device. If you initialize new tensor, initialize it with device=self.device

@ouioui199 suggestion works. I changed my code from self.log_dict( {f"test_map_{label}": value for label, value in zip(self.id2label.values(), mAP_per_class)}, sync_dist=True, )

to

self.log_dict( {f"test_map_{label}": value.to("cuda") for label, value in zip(self.id2label.values(), mAP_per_class)}, sync_dist=True, )

The resolution is not clear to me. I’m getting the message “RuntimeError: No backend type associated with device type cpu”. If I was logging 20 things some of them on CPU some of GPU what should I be doing? From your comment @awaelchli I would’ve thought adding .to('cpu') calls but the error message makes me thing the opposite (but moving CPU results back to GPU also seems silly).

It looks like the change comes from this PR: #17334 (git-bisecting code sample by @dsuess)