FastChat: Finetuning: RuntimeError: CUDA error: invalid device ordinal, Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
I am facing this issue after --bf16 set to False, when I set --bf 16 to True, I get errors. So after setting --bf 16 to False, it throws an error:
Note: I am using “Google Cloud Platform”, 1xNvidiaT4 and Cuda version 11.6 with 30gb of ram.
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be
incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA
to enable device-side assertions.
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 4209 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 4210 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 4211 closing signal SIGTERM ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 4208) of binary: /opt/conda/envs/test/bin/python3.9 Traceback (most recent call last): File “/opt/conda/envs/test/bin/torchrun”, line 8, in <module> sys.exit(main()) File “/opt/conda/envs/test/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py”, line 346, in wrapper return f(*args, **kwargs) File “/opt/conda/envs/test/lib/python3.9/site-packages/torch/distributed/run.py”, line 794, in main run(args) File “/opt/conda/envs/test/lib/python3.9/site-packages/torch/distributed/run.py”, line 785, in run elastic_launch( File “/opt/conda/envs/test/lib/python3.9/site-packages/torch/distributed/launcher/api.py”, line 134, in call return launch_agent(self._config, self._entrypoint, list(args)) File “/opt/conda/envs/test/lib/python3.9/site-packages/torch/distributed/launcher/api.py”, line 250, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
About this issue
- Original URL
- State: closed
- Created a year ago
- Comments: 17
ensure nvidia num = nproc_per_node. so , set nproc_per_node =1
yes,if u have 4 nvdia, set =4