FastChat: Finetuning: RuntimeError: CUDA error: invalid device ordinal, Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

I am facing this issue after --bf16 set to False, when I set --bf 16 to True, I get errors. So after setting --bf 16 to False, it throws an error: Note: I am using “Google Cloud Platform”, 1xNvidiaT4 and Cuda version 11.6 with 30gb of ram. RuntimeError: CUDA error: invalid device ordinal CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 4209 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 4210 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 4211 closing signal SIGTERM ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 4208) of binary: /opt/conda/envs/test/bin/python3.9 Traceback (most recent call last): File “/opt/conda/envs/test/bin/torchrun”, line 8, in <module> sys.exit(main()) File “/opt/conda/envs/test/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py”, line 346, in wrapper return f(*args, **kwargs) File “/opt/conda/envs/test/lib/python3.9/site-packages/torch/distributed/run.py”, line 794, in main run(args) File “/opt/conda/envs/test/lib/python3.9/site-packages/torch/distributed/run.py”, line 785, in run elastic_launch( File “/opt/conda/envs/test/lib/python3.9/site-packages/torch/distributed/launcher/api.py”, line 134, in call return launch_agent(self._config, self._entrypoint, list(args)) File “/opt/conda/envs/test/lib/python3.9/site-packages/torch/distributed/launcher/api.py”, line 250, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

About this issue

Original URL
State: closed
Created a year ago
Comments: 17

Most upvoted comments

ensure nvidia num = nproc_per_node. so , set nproc_per_node =1

+11

coddderX on Apr 23, 2023

yes，if u have 4 nvdia, set =4

coddderX on Apr 23, 2023