pytorch-lightning: [Bug] RuntimeError: No backend type associated with device type cpu
Bug description
On upgrading torch and lightning to both 2.1.0, and running DDP leads to the following error trace,
# Error messages and logs here please
23 Traceback (most recent call last):
24 File "/home/nikhil_valencediscovery_com/projects/openMLIP/src/mlip/train.py", line 126, in main
25 train(cfg)
26 File "/home/nikhil_valencediscovery_com/projects/openMLIP/src/mlip/train.py", line 102, in train
27 trainer.fit(model, datamodule=datamodule, ckpt_path=cfg.get("ckpt_path"))
28 File "/home/nikhil_valencediscovery_com/local/conda/envs/mlip4/lib/python3.11/site-packages/lightning/pytorch/trainer/trainer.py", line 545, in fit
29 call._call_and_handle_interrupt(
30 File "/home/nikhil_valencediscovery_com/local/conda/envs/mlip4/lib/python3.11/site-packages/lightning/pytorch/trainer/call.py", line 43, in _call_and_handle_interrupt
31 return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
32 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
33 File "/home/nikhil_valencediscovery_com/local/conda/envs/mlip4/lib/python3.11/site-packages/lightning/pytorch/strategies/launchers/subprocess_script.py", line 102, in launch
34 return function(*args, **kwargs)
35 ^^^^^^^^^^^^^^^^^^^^^^^^^
36 File "/home/nikhil_valencediscovery_com/local/conda/envs/mlip4/lib/python3.11/site-packages/lightning/pytorch/trainer/trainer.py", line 581, in _fit_impl
37 self._run(model, ckpt_path=ckpt_path)
38 File "/home/nikhil_valencediscovery_com/local/conda/envs/mlip4/lib/python3.11/site-packages/lightning/pytorch/trainer/trainer.py", line 990, in _run
39 results = self._run_stage()
40 ^^^^^^^^^^^^^^^^^
41 File "/home/nikhil_valencediscovery_com/local/conda/envs/mlip4/lib/python3.11/site-packages/lightning/pytorch/trainer/trainer.py", line 1034, in _run_stage
42 self._run_sanity_check()
43 File "/home/nikhil_valencediscovery_com/local/conda/envs/mlip4/lib/python3.11/site-packages/lightning/pytorch/trainer/trainer.py", line 1063, in _run_sanity_check
44 val_loop.run()
45 File "/home/nikhil_valencediscovery_com/local/conda/envs/mlip4/lib/python3.11/site-packages/lightning/pytorch/loops/utilities.py", line 181, in _decorator
46 return loop_run(self, *args, **kwargs)
47 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
48 File "/home/nikhil_valencediscovery_com/local/conda/envs/mlip4/lib/python3.11/site-packages/lightning/pytorch/loops/evaluation_loop.py", line 141, in run
49 return self.on_run_end()
50 ^^^^^^^^^^^^^^^^^
51 File "/home/nikhil_valencediscovery_com/local/conda/envs/mlip4/lib/python3.11/site-packages/lightning/pytorch/loops/evaluation_loop.py", line 253, in on_run_end
52 self._on_evaluation_epoch_end()
53 File "/home/nikhil_valencediscovery_com/local/conda/envs/mlip4/lib/python3.11/site-packages/lightning/pytorch/loops/evaluation_loop.py", line 331, in _on_evaluation_epoch_end
54 trainer._logger_connector.on_epoch_end()
55 File "/home/nikhil_valencediscovery_com/local/conda/envs/mlip4/lib/python3.11/site-packages/lightning/pytorch/trainer/connectors/logger_connector/logger_connector.py", line 187, in on_epoch_end
56 metrics = self.metrics
57 ^^^^^^^^^^^^
58 File "/home/nikhil_valencediscovery_com/local/conda/envs/mlip4/lib/python3.11/site-packages/lightning/pytorch/trainer/connectors/logger_connector/logger_connector.py", line 226, in metrics
59 return self.trainer._results.metrics(on_step)
60 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
61 File "/home/nikhil_valencediscovery_com/local/conda/envs/mlip4/lib/python3.11/site-packages/lightning/pytorch/trainer/connectors/logger_connector/result.py", line 471, in metrics
62 value = self._get_cache(result_metric, on_step)
63 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
64 File "/home/nikhil_valencediscovery_com/local/conda/envs/mlip4/lib/python3.11/site-packages/lightning/pytorch/trainer/connectors/logger_connector/result.py", line 435, in _get_cache
65 result_metric.compute()
66 File "/home/nikhil_valencediscovery_com/local/conda/envs/mlip4/lib/python3.11/site-packages/lightning/pytorch/trainer/connectors/logger_connector/result.py", line 280, in wrapped_func
67 self._computed = compute(*args, **kwargs)
68 ^^^^^^^^^^^^^^^^^^^^^^^^
69 File "/home/nikhil_valencediscovery_com/local/conda/envs/mlip4/lib/python3.11/site-packages/lightning/pytorch/trainer/connectors/logger_connector/result.py", line 243, in compute
70 value = self.meta.sync(self.value.clone()) # `clone` because `sync` is in-place
71 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
72 File "/home/nikhil_valencediscovery_com/local/conda/envs/mlip4/lib/python3.11/site-packages/lightning/pytorch/strategies/ddp.py", line 330, in reduce
73 return _sync_ddp_if_available(tensor, group, reduce_op=reduce_op)
74 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
75 File "/home/nikhil_valencediscovery_com/local/conda/envs/mlip4/lib/python3.11/site-packages/lightning/fabric/utilities/distributed.py", line 171, in _sync_ddp_if_available
76 return _sync_ddp(result, group=group, reduce_op=reduce_op)
77 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
78 File "/home/nikhil_valencediscovery_com/local/conda/envs/mlip4/lib/python3.11/site-packages/lightning/fabric/utilities/distributed.py", line 221, in _sync_ddp
79 torch.distributed.all_reduce(result, op=op, group=group, async_op=False)
80 File "/home/nikhil_valencediscovery_com/local/conda/envs/mlip4/lib/python3.11/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper
81 return func(*args, **kwargs)
82 ^^^^^^^^^^^^^^^^^^^^^
83 File "/home/nikhil_valencediscovery_com/local/conda/envs/mlip4/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 2050, in all_reduce
84 work = group.allreduce([tensor], opts)
85 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
86 RuntimeError: No backend type associated with device type cpu
87 Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.
On downgrading lightning to 2.0.1, the error goes away.
What version are you seeing the problem on?
master
How to reproduce the bug
No response
Error messages and logs
Environment
Current environment
<details>
<summary>Current environment</summary>
* CUDA:
- GPU: None
- available: False
- version: 11.8
* Lightning:
- lightning: 2.0.1.post0
- lightning-cloud: 0.5.42
- lightning-utilities: 0.9.0
- pytorch-lightning: 2.1.0
- torch: 2.1.0
- torch-cluster: 1.6.3
- torch-geometric: 2.4.0
- torch-scatter: 2.1.2
- torch-sparse: 0.6.18
- torchmetrics: 1.2.0
* Packages:
- absl-py: 2.0.0
- aiobotocore: 2.5.4
- aiohttp: 3.8.6
- aioitertools: 0.11.0
- aiosignal: 1.3.1
- antlr4-python3-runtime: 4.9.3
- anyio: 3.7.1
- appdirs: 1.4.4
- argon2-cffi: 23.1.0
- argon2-cffi-bindings: 21.2.0
- arrow: 1.3.0
- ase: 3.22.1
- asttokens: 2.4.0
- async-lru: 2.0.4
- async-timeout: 4.0.3
- attrs: 23.1.0
- babel: 2.13.0
- backcall: 0.2.0
- backoff: 2.2.1
- backports.cached-property: 1.0.2
- backports.functools-lru-cache: 1.6.5
- beautifulsoup4: 4.12.2
- black: 23.9.1
- bleach: 6.1.0
- blessed: 1.19.1
- blinker: 1.6.3
- boto3: 1.28.17
- botocore: 1.31.17
- brotli: 1.1.0
- build: 0.10.0
- cachecontrol: 0.12.14
- cached-property: 1.5.2
- cachetools: 5.3.1
- certifi: 2023.7.22
- cffi: 1.16.0
- cfgv: 3.3.1
- charset-normalizer: 3.3.0
- cleo: 2.0.1
- click: 8.1.7
- colorama: 0.4.6
- comm: 0.1.4
- contourpy: 1.1.1
- coverage: 7.3.2
- crashtest: 0.4.1
- croniter: 1.3.15
- cryptography: 41.0.4
- cycler: 0.12.1
- datamol: 0.0.0
- dateutils: 0.6.12
- debugpy: 1.8.0
- decorator: 5.1.1
- deepdiff: 6.6.0
- defusedxml: 0.7.1
- distlib: 0.3.7
- docker-pycreds: 0.4.0
- dulwich: 0.21.6
- e3nn: 0.5.1
- einops: 0.6.0
- entrypoints: 0.4
- exceptiongroup: 1.1.3
- executing: 1.2.0
- fastapi: 0.88.0
- fastjsonschema: 2.18.1
- filelock: 3.12.4
- flask: 3.0.0
- fonttools: 4.43.1
- fqdn: 1.5.1
- freetype-py: 2.3.0
- frozenlist: 1.4.0
- fsspec: 2023.9.2
- gcsfs: 2023.9.2
- gitdb: 4.0.10
- gitpython: 3.1.37
- gmpy2: 2.1.2
- google-api-core: 2.12.0
- google-auth: 2.23.3
- google-auth-oauthlib: 0.4.6
- google-cloud-core: 2.3.3
- google-cloud-storage: 2.12.0
- google-crc32c: 1.1.2
- google-resumable-media: 2.6.0
- googleapis-common-protos: 1.61.0
- greenlet: 3.0.0
- grpcio: 1.59.1
- h11: 0.14.0
- h5py: 3.10.0
- html5lib: 1.1
- hydra-core: 1.3.2
- identify: 2.5.30
- idna: 3.4
- importlib-metadata: 6.8.0
- importlib-resources: 6.1.0
- iniconfig: 2.0.0
- inquirer: 3.1.3
- installer: 0.7.0
- ipdb: 0.13.13
- ipykernel: 6.25.2
- ipython: 8.16.1
- ipywidgets: 8.1.1
- isoduration: 20.11.0
- itsdangerous: 2.1.2
- jaraco.classes: 3.3.0
- jedi: 0.19.1
- jeepney: 0.8.0
- jinja2: 3.1.2
- jmespath: 1.0.1
- joblib: 1.3.2
- json5: 0.9.14
- jsonpointer: 2.4
- jsonschema: 4.19.1
- jsonschema-specifications: 2023.7.1
- jupyter-client: 8.4.0
- jupyter-core: 5.4.0
- jupyter-events: 0.7.0
- jupyter-lsp: 2.2.0
- jupyter-server: 2.7.3
- jupyter-server-terminals: 0.4.4
- jupyterlab: 4.0.7
- jupyterlab-pygments: 0.2.2
- jupyterlab-server: 2.25.0
- jupyterlab-widgets: 3.0.9
- keyring: 23.13.1
- kiwisolver: 1.4.5
- lightning: 2.0.1.post0
- lightning-cloud: 0.5.42
- lightning-utilities: 0.9.0
- lockfile: 0.12.2
- loguru: 0.7.2
- markdown: 3.5
- markdown-it-py: 3.0.0
- markupsafe: 2.1.3
- matplotlib: 3.8.0
- matplotlib-inline: 0.1.6
- matscipy: 0.7.0
- mdurl: 0.1.0
- mistune: 3.0.1
- mlip: 0.0.1.dev157+gc3d9c0b.d20231016
- more-itertools: 10.1.0
- mpmath: 1.3.0
- msgpack: 1.0.6
- multidict: 6.0.4
- munkres: 1.1.4
- mypy-extensions: 1.0.0
- nbclient: 0.8.0
- nbconvert: 7.9.2
- nbformat: 5.9.2
- nest-asyncio: 1.5.8
- networkx: 3.1
- nodeenv: 1.8.0
- notebook-shim: 0.2.3
- numpy: 1.26.0
- oauthlib: 3.2.2
- omegaconf: 2.3.0
- openqdc: 0.0.0
- opt-einsum: 3.3.0
- opt-einsum-fx: 0.1.4
- ordered-set: 4.1.0
- orjson: 3.9.8
- overrides: 7.4.0
- packaging: 23.2
- pandas: 2.1.1
- pandocfilters: 1.5.0
- parso: 0.8.3
- pathspec: 0.11.2
- pathtools: 0.1.2
- patsy: 0.5.3
- pexpect: 4.8.0
- pickleshare: 0.7.5
- pillow: 10.1.0
- pip: 23.3
- pkginfo: 1.9.6
- pkgutil-resolve-name: 1.3.10
- platformdirs: 3.11.0
- pluggy: 1.3.0
- ply: 3.11
- poetry: 1.5.1
- poetry-core: 1.6.1
- poetry-plugin-export: 1.5.0
- pre-commit: 3.5.0
- prettytable: 3.9.0
- prometheus-client: 0.17.1
- prompt-toolkit: 3.0.39
- protobuf: 4.24.4
- psutil: 5.9.5
- ptyprocess: 0.7.0
- pure-eval: 0.2.2
- pyasn1: 0.5.0
- pyasn1-modules: 0.3.0
- pycairo: 1.25.0
- pycparser: 2.21
- pydantic: 1.10.13
- pygments: 2.16.1
- pyjwt: 2.8.0
- pyopenssl: 23.2.0
- pyparsing: 3.1.1
- pyproject-hooks: 1.0.0
- pyqt5: 5.15.9
- pyqt5-sip: 12.12.2
- pyrootutils: 1.0.4
- pysocks: 1.7.1
- pytest: 7.4.2
- pytest-cov: 4.1.0
- python-dateutil: 2.8.2
- python-dotenv: 1.0.0
- python-editor: 1.0.4
- python-json-logger: 2.0.7
- python-multipart: 0.0.6
- pytorch-lightning: 2.1.0
- pytz: 2023.3.post1
- pyu2f: 0.1.5
- pyyaml: 6.0.1
- pyzmq: 25.1.1
- rapidfuzz: 2.15.2
- readchar: 4.0.5.dev0
- referencing: 0.30.2
- reportlab: 4.0.6
- requests: 2.31.0
- requests-oauthlib: 1.3.1
- requests-toolbelt: 1.0.0
- rfc3339-validator: 0.1.4
- rfc3986-validator: 0.1.1
- rich: 13.6.0
- rlpycairo: 0.2.0
- rpds-py: 0.10.6
- rsa: 4.9
- ruff: 0.0.292
- s3fs: 2023.9.2
- s3transfer: 0.6.2
- scikit-learn: 1.3.1
- scipy: 1.11.3
- seaborn: 0.13.0
- secretstorage: 3.3.3
- selfies: 2.1.1
- send2trash: 1.8.2
- sentry-sdk: 1.32.0
- setproctitle: 1.3.3
- setuptools: 68.2.2
- shellingham: 1.5.3
- sip: 6.7.12
- six: 1.16.0
- smmap: 3.0.5
- sniffio: 1.3.0
- soupsieve: 2.5
- sqlalchemy: 2.0.22
- stack-data: 0.6.2
- starlette: 0.22.0
- starsessions: 1.3.0
- statsmodels: 0.14.0
- sympy: 1.12
- tensorboard: 2.11.2
- tensorboard-data-server: 0.6.1
- tensorboard-plugin-wit: 1.8.1
- terminado: 0.17.1
- threadpoolctl: 3.2.0
- tinycss2: 1.2.1
- toml: 0.10.2
- tomli: 2.0.1
- tomlkit: 0.12.1
- torch: 2.1.0
- torch-cluster: 1.6.3
- torch-geometric: 2.4.0
- torch-scatter: 2.1.2
- torch-sparse: 0.6.18
- torchmetrics: 1.2.0
- tornado: 6.3.3
- tqdm: 4.66.1
- traitlets: 5.11.2
- triton: 2.1.0
- trove-classifiers: 2023.9.19
- types-python-dateutil: 2.8.19.14
- typing-extensions: 4.8.0
- typing-utils: 0.1.0
- tzdata: 2023.3
- ukkonen: 1.0.1
- uri-template: 1.3.0
- urllib3: 1.26.17
- uvicorn: 0.23.2
- virtualenv: 20.24.4
- wandb: 0.15.12
- wcwidth: 0.2.8
- webcolors: 1.13
- webencodings: 0.5.1
- websocket-client: 1.6.4
- websockets: 11.0.3
- werkzeug: 3.0.0
- wheel: 0.41.2
- widgetsnbextension: 4.0.9
- wrapt: 1.15.0
- yarl: 1.9.2
- zipp: 3.17.0
* System:
- OS: Linux
- architecture:
- 64bit
- ELF
- processor: x86_64
- python: 3.11.6
- release: 5.15.0-1032-gcp
- version: #40~20.04.1-Ubuntu SMP Tue Apr 11 02:49:52 UTC 2023
</details>
More info
No response
About this issue
- Original URL
- State: open
- Created 8 months ago
- Reactions: 5
- Comments: 21 (4 by maintainers)
Commits related to this issue
- fix https://github.com/Lightning-AI/pytorch-lightning/issues/18803 Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com> — committed to zhehuaichen/NeMo by zhehuaichen 6 months ago
- fix https://github.com/Lightning-AI/pytorch-lightning/issues/18803 Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com> — committed to zhehuaichen/NeMo by zhehuaichen 6 months ago
- move log values to GPU before syncing https://github.com/Lightning-AI/pytorch-lightning/issues/18803 — committed to mehta-lab/VisCy by ziw-liu 4 months ago
- move log values to GPU before syncing https://github.com/Lightning-AI/pytorch-lightning/issues/18803 — committed to mehta-lab/VisCy by ziw-liu 4 months ago
If I understood correctly, when using
self.log(..., sync_dist=True)with DDP, you have to transfer the tensor to the GPU before logging.Is it possible to move the tensors to the correct device automatically in
LightningModule.log()? If not, I feel like this should be mentioned in the documentation, and it would be good to give a better error message. Currently the 15-minute Lightning tutorial instructs to remove any.cuda()or device calls, because LightningModules are hardware agnostic.It looks like the changes was intentional. The changelog says:
This means if you pass in the tensor, it already needs to be on the right device and the user needs to explicitly perform the
.to()call.cc @carmocca
Same bug here after upgrade to torch==2.1.0 and lightning==2.1.0.
This bug appeared when running
Metric.compute()of atorchmetricafter a validation epoch.Edit: I am using lightning fabric instead of lightning trainer. This bug is also triggered.
I’ve solved the issue on lightning==2.1.3 . When rewriting any epoch_end function, if you log, just make sure that the tensor is on gpu device. If you initialize new tensor, initialize it with device=self.device
@awaelchli Thanks for clarifying. I’ve found another corner case where the new behaviour breaks existing code: If you re-use a
trainerinstance multiple times (e.g. for evaluating multiple epochs), you can end up with metrics moved to CPU even if you log them with GPU tensors.The reason being that the logger connector moves all intermediate results to CPU on teardown. So on the second call to
trainer.validate, the helper-state (e.g. cumulated_batch_size) of the cached results are on CPU. This can be fixed by removing all cached results throughHere’s a full example to reproduce this:
Has this issue been addressed in nightly? I was really trying to stick to either pip or conda versions and it looks like 2.0.8 is not available on either.
My feeling is that the DDP strategy in
lightning==2.0.8initialized distributed backends for both CPU and GPU when running with device=GPU. Below is a minimal example that works with 2.0.8, but crashes in 2.1.0:Note that this isn’t restricted to distributed code that’s run by lightning. We have some functionality that uses
torch.distributeddirectly and are running into the same exact issue when we try to broadcast non-CUDA tensors.I’ve got the same error on
torch==2.1.0andlightning==2.1.0and fixed when downgrading topytorch_lightning==2.0.8Same for me. Downgrading to
pytorch-lightning==2.0.8fixed the issue.@ouioui199 suggestion works. I changed my code from
self.log_dict( {f"test_map_{label}": value for label, value in zip(self.id2label.values(), mAP_per_class)}, sync_dist=True, )to
self.log_dict( {f"test_map_{label}": value.to("cuda") for label, value in zip(self.id2label.values(), mAP_per_class)}, sync_dist=True, )The resolution is not clear to me. I’m getting the message “RuntimeError: No backend type associated with device type cpu”. If I was logging 20 things some of them on CPU some of GPU what should I be doing? From your comment @awaelchli I would’ve thought adding
.to('cpu')calls but the error message makes me thing the opposite (but moving CPU results back to GPU also seems silly).It looks like the change comes from this PR: #17334 (git-bisecting code sample by @dsuess)